Deep Residual Learning for Image Recognition
Paper Summary
ResNet introduced residual connections that allow training of networks with hundreds or even thousands of layers. This architectural innovation solved the degradation problem in very deep networks and became a fundamental building block in modern deep learning.
Abstract
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions.
Critical Analysis & Questions for Consideration
ResNet's skip connections solved the deep network training problem, but examining the paper's claims and explanations reveals interesting gaps between stated theory and actual mechanisms.
Fundamental Breakthrough
Skip connections enabled training of arbitrarily deep networks, breaking through a major barrier in deep learning. This simple yet powerful innovation became a cornerstone of modern architecture design.
Degradation Problem Mischaracterization
The paper attributes training difficulty to "degradation" but later research suggests it's more about optimization landscape smoothing. The theoretical explanation doesn't fully match the empirical phenomenon.
Identity Mapping Assumption
The paper assumes layers should easily learn identity mappings, but provides limited theoretical justification for why this should be true or why it's harder without shortcuts.
Overstated Depth Benefits
While the paper trains 1000+ layer networks, they show diminishing returns beyond ~100 layers. The practical benefits of extreme depth are oversold relative to computational costs.
Batch Normalization Conflation
ResNet's success is partially due to batch normalization, used throughout. The paper doesn't adequately separate the contributions of skip connections versus batch norm.
Limited Analysis of Feature Reuse
Later work showed ResNets effectively create ensembles of shallow networks. The paper misses this perspective, focusing instead on depth as the key factor.