761 lines
46 KiB
Plaintext
761 lines
46 KiB
Plaintext
Identity Mappings in Deep Residual Networks
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
arXiv:1603.05027v3 [cs.CV] 25 Jul 2016 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun
|
||
|
||
Microsoft Research
|
||
|
||
|
||
|
||
AbstractDeep residual networks [1] have emerged as a family of ex-
|
||
tremely deep architectures showing compelling accuracy and nice con-
|
||
vergence behaviors. In this paper, we analyze the propagation formu-
|
||
lations behind the residual building blocks, which suggest that the for-
|
||
ward and backward signals can be directly propagated from one block
|
||
to any other block, when using identity mappings as the skip connec-
|
||
tions and after-addition activation. A series of ablation experiments sup-
|
||
port the importance of these identity mappings. This motivates us to
|
||
propose a new residual unit, which makes training easier and improves
|
||
generalization. We report improved results using a 1001-layer ResNet
|
||
on CIFAR-10 (4.62% error) and CIFAR-100, and a 200-layer ResNet
|
||
on ImageNet. Code is available at:https://github.com/KaimingHe/
|
||
resnet-1k-layers.
|
||
|
||
|
||
1 Introduction
|
||
|
||
Deep residual networks (ResNets) [1] consist of many stacked \Residual Units".
|
||
Each unit (Fig.1(a)) can be expressed in a general form:
|
||
|
||
yl =h(xl ) +F(xl ;Wl );
|
||
xl+1 =f(yl );
|
||
|
||
wherexl andxl+1 are input and output of thel-th unit, andFis a residual
|
||
function. In [1],h(xl ) =xl is an identity mapping andfis a ReLU [2] function.
|
||
ResNets that are over 100-layer deep have shown state-of-the-art accuracy for
|
||
several challenging recognition tasks on ImageNet [3] and MS COCO [4] compe-
|
||
titions. The central idea of ResNets is to learn the additive residual functionF
|
||
with respect toh(xl ), with a key choice of using an identity mappingh(xl ) =xl .
|
||
This is realized by attaching an identity skip connection (\shortcut").
|
||
In this paper, we analyze deep residual networks by focusing on creating a
|
||
\direct" path for propagating information | not only within a residual unit,
|
||
but through the entire network. Our derivations reveal thatif bothh(xl )and
|
||
f(yl )are identity mappings, the signal could bedirectlypropagated from one
|
||
unit to any other units, in both forward and backward passes. Our experiments
|
||
empirically show that training in general becomes easier when the architecture
|
||
is closer to the above two conditions.
|
||
To understand the role of skip connections, we analyze and compare various
|
||
types ofh(xl ). We |