761 lines
46 KiB
761 lines
46 KiB
Identity Mappings in Deep Residual Networks
arXiv:1603.05027v3 [cs.CV] 25 Jul 2016 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun
Microsoft Research
AbstractDeep residual networks [1] have emerged as a family of ex-
tremely deep architectures showing compelling accuracy and nice con-
vergence behaviors. In this paper, we analyze the propagation formu-
lations behind the residual building blocks, which suggest that the for-
ward and backward signals can be directly propagated from one block
to any other block, when using identity mappings as the skip connec-
tions and after-addition activation. A series of ablation experiments sup-
port the importance of these identity mappings. This motivates us to
propose a new residual unit, which makes training easier and improves
generalization. We report improved results using a 1001-layer ResNet
on CIFAR-10 (4.62% error) and CIFAR-100, and a 200-layer ResNet
on ImageNet. Code is available at:https://github.com/KaimingHe/
1 Introduction
Deep residual networks (ResNets) [1] consist of many stacked \Residual Units".
Each unit (Fig.1(a)) can be expressed in a general form:
yl =h(xl ) +F(xl ;Wl );
xl+1 =f(yl );
wherexl andxl+1 are input and output of thel-th unit, andFis a residual
function. In [1],h(xl ) =xl is an identity mapping andfis a ReLU [2] function.
ResNets that are over 100-layer deep have shown state-of-the-art accuracy for
several challenging recognition tasks on ImageNet [3] and MS COCO [4] compe-
titions. The central idea of ResNets is to learn the additive residual functionF
with respect toh(xl ), with a key choice of using an identity mappingh(xl ) =xl .
This is realized by attaching an identity skip connection (\shortcut").
In this paper, we analyze deep residual networks by focusing on creating a
\direct" path for propagating information | not only within a residual unit,
but through the entire network. Our derivations reveal thatif bothh(xl )and
f(yl )are identity mappings, the signal could bedirectlypropagated from one
unit to any other units, in both forward and backward passes. Our experiments
empirically show that training in general becomes easier when the architecture
is closer to the above two conditions.
To understand the role of skip connections, we analyze and compare various
types ofh(xl ). We |