761 lines
46 KiB
Plaintext
761 lines
46 KiB
Plaintext
|
Identity Mappings in Deep Residual Networks
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
arXiv:1603.05027v3 [cs.CV] 25 Jul 2016 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun
|
|||
|
|
|||
|
Microsoft Research
|
|||
|
|
|||
|
|
|||
|
|
|||
|
AbstractDeep residual networks [1] have emerged as a family of ex-
|
|||
|
tremely deep architectures showing compelling accuracy and nice con-
|
|||
|
vergence behaviors. In this paper, we analyze the propagation formu-
|
|||
|
lations behind the residual building blocks, which suggest that the for-
|
|||
|
ward and backward signals can be directly propagated from one block
|
|||
|
to any other block, when using identity mappings as the skip connec-
|
|||
|
tions and after-addition activation. A series of ablation experiments sup-
|
|||
|
port the importance of these identity mappings. This motivates us to
|
|||
|
propose a new residual unit, which makes training easier and improves
|
|||
|
generalization. We report improved results using a 1001-layer ResNet
|
|||
|
on CIFAR-10 (4.62% error) and CIFAR-100, and a 200-layer ResNet
|
|||
|
on ImageNet. Code is available at:https://github.com/KaimingHe/
|
|||
|
resnet-1k-layers.
|
|||
|
|
|||
|
|
|||
|
1 Introduction
|
|||
|
|
|||
|
Deep residual networks (ResNets) [1] consist of many stacked \Residual Units".
|
|||
|
Each unit (Fig.1(a)) can be expressed in a general form:
|
|||
|
|
|||
|
yl =h(xl ) +F(xl ;Wl );
|
|||
|
xl+1 =f(yl );
|
|||
|
|
|||
|
wherexl andxl+1 are input and output of thel-th unit, andFis a residual
|
|||
|
function. In [1],h(xl ) =xl is an identity mapping andfis a ReLU [2] function.
|
|||
|
ResNets that are over 100-layer deep have shown state-of-the-art accuracy for
|
|||
|
several challenging recognition tasks on ImageNet [3] and MS COCO [4] compe-
|
|||
|
titions. The central idea of ResNets is to learn the additive residual functionF
|
|||
|
with respect toh(xl ), with a key choice of using an identity mappingh(xl ) =xl .
|
|||
|
This is realized by attaching an identity skip connection (\shortcut").
|
|||
|
In this paper, we analyze deep residual networks by focusing on creating a
|
|||
|
\direct" path for propagating information | not only within a residual unit,
|
|||
|
but through the entire network. Our derivations reveal thatif bothh(xl )and
|
|||
|
f(yl )are identity mappings, the signal could bedirectlypropagated from one
|
|||
|
unit to any other units, in both forward and backward passes. Our experiments
|
|||
|
empirically show that training in general becomes easier when the architecture
|
|||
|
is closer to the above two conditions.
|
|||
|
To understand the role of skip connections, we analyze and compare various
|
|||
|
types ofh(xl ). We |