Identity Mappings in Deep Residual Networks


     arXiv:1603.05027v3  [cs.CV]  25 Jul 2016       Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun

                            Microsoft Research


          AbstractDeep residual networks [1] have emerged as a family of ex-
          tremely deep architectures showing compelling accuracy and nice con-
          vergence behaviors. In this paper, we analyze the propagation formu-
          lations behind the residual building blocks, which suggest that the for-
          ward and backward signals can be directly propagated from one block
          to any other block, when using identity mappings as the skip connec-
          tions and after-addition activation. A series of ablation experiments sup-
          port the importance of these identity mappings. This motivates us to
          propose a new residual unit, which makes training easier and improves
          generalization. We report improved results using a 1001-layer ResNet
          on CIFAR-10 (4.62% error) and CIFAR-100, and a 200-layer ResNet
          on ImageNet. Code is available at:https://github.com/KaimingHe/
          resnet-1k-layers.


     1 Introduction

     Deep residual networks (ResNets) [1] consist of many stacked \Residual Units".
     Each unit (Fig.1(a)) can be expressed in a general form:

                         yl =h(xl ) +F(xl ;Wl );
                             xl+1 =f(yl );

     wherexl andxl+1 are input and output of thel-th unit, andFis a residual
     function. In [1],h(xl ) =xl is an identity mapping andfis a ReLU [2] function.
        ResNets that are over 100-layer deep have shown state-of-the-art accuracy for
     several challenging recognition tasks on ImageNet [3] and MS COCO [4] compe-
     titions. The central idea of ResNets is to learn the additive residual functionF
     with respect toh(xl ), with a key choice of using an identity mappingh(xl ) =xl .
     This is realized by attaching an identity skip connection (\shortcut").
        In this paper, we analyze deep residual networks by focusing on creating a
     \direct" path for propagating information | not only within a residual unit,
     but through the entire network. Our derivations reveal thatif bothh(xl )and
     f(yl )are identity mappings, the signal could bedirectlypropagated from one
     unit to any other units, in both forward and backward passes. Our experiments
     empirically show that training in general becomes easier when the architecture
     is closer to the above two conditions.
        To understand the role of skip connections, we analyze and compare various
     types ofh(xl ). We  nd that the identity mappingh(xl ) =xl chosen in [1]     2

                            2                               20
       x         x                      ResNet−1001, original (error: 7.61%)l         l                      ResNet−1001, proposed (error: 4.92%)

           weight       BN                                     15
           BN        ReLU    0.2
           ReLU       weight                                      Test Error (%
                           s


                           Training Los                                 10 weight       BN
           BN        ReLU                                      )
                           0.02
      additton          weight                                    5
       ReLU      additton
       xl+1       xl+1
                           0.002                                0 (a) original   (b) proposed     0    1    2    3    4    5    6
                                           Iterations           x 10 4

     Figure 1. Left: (a) original Residual Unit in [1]; (b) proposed Residual Unit. The grey
     arrows indicate the easiest paths for the information to propagate, corresponding to
     the additive term \xl " in Eqn.(4) (forward propagation) and the additive term \1" in
     Eqn.(5) (backward propagation).Right: training curves on CIFAR-10 of1001-layer
     ResNets. Solid lines denote test error (y-axis on the right), and dashed lines denote
     training loss (y-axis on the left). The proposed unit makes ResNet-1001 easier to train.


     achieves the fastest error reduction and lowest training loss among all variants
     we investigated, whereas skip connections of scaling, gating [5,6,7], and 1 1
     convolutions all lead to higher training loss and error. These experiments suggest
     that keeping a \clean" information path (indicated by the grey arrows in Fig.1,2,
     and4) is helpful for easing optimization.
        To construct an identity mappingf(yl ) =yl , we view the activation func-
     tions (ReLU and BN [8]) as \pre-activation" of the weight layers, in contrast
     to conventional wisdom of \post-activation". This point of view leads to a new
     residual unit design, shown in (Fig.1(b)). Based on this unit, we present com-
     petitive results on CIFAR-10/100 with a 1001-layer ResNet, which is much easier
     to train and generalizes better than the original ResNet in [1]. We further report
     improved results on ImageNet using a 200-layer ResNet, for which the counter-
     part of [1] starts to over t. These results suggest that there is much room to
     exploit the dimension ofnetwork depth, a key to the success of modern deep
     learning.


     2 Analysis of Deep Residual Networks


     The ResNets developed in [1] aremodularizedarchitectures that stack building
     blocks of the same connecting shape. In this paper we call these blocks \Residual                                                              3

     Units". The original Residual Unit in [1] performs the following computation:

                         yl =h(xl ) +F(xl ;Wl );                  (1)
                             xl+1 =f(yl ):                     (2)

     Herexl is the input feature to thel-th Residual Unit.Wl =fWl;k j1 k K gis a
     set of weights (and biases) associated with thel-th Residual Unit, andKis the
     number of layers in a Residual Unit (Kis 2 or 3 in [1]).Fdenotes the residual
     function,e.g., a stack of two 3 3 convolutional layers in [1]. The functionfis
     the operation after element-wise addition, and in [1]fis ReLU. The functionh
     is set as an identity mapping:h(xl ) =xl .1
        Iffis also an identity mapping:xl+1  yl , we can put Eqn.(2) into Eqn.(1)
     and obtain:
                          xl+1 =xl +F(xl ;Wl ):                  (3)
     Recursively ( xl+2 =xl+1 +F(xl+1 ;Wl+1 ) =xl +F(xl ;Wl ) +F(xl+1 ;Wl+1 ) , etc.) we
     will have:
                                 L X 1
                         xL =xl +   F(xi ;Wi );                 (4)
                                 i=l
     forany deeper unitLandany shallower unitl. Eqn.(4) exhibits some nice
     properties.(i)The featurexL of any deeper unitLcan be represented as thePfeaturexl of any shallower unitlplus a residual function in a form of  L 1 F,i=l indicating that the model is in aresidualfashion between any unitsLandl.(ii)PThe featurexL =x0 +  L 1 F(x Wi=0   i ; i ), of any deep unitL, is thesummation
     of the outputs of all preceding residual functions (plusx0 ). This is in contrast to
     Qa \plain network" where a featurexL is a series of matrix-vectorproducts, say, L 1 Wi=0  i x0 (ignoring BN and ReLU).
        Eqn.(4) also leads to nice backward propagation properties. Denoting the
     loss function asE, from the chain rule of backpropagation [9] we have:
                                                  !
                @E   @E @x    @E      @ L X 1
                   =      L =     1 +      F(x@xl  @xL @x                    i ;Wi ) :        (5)
                          l  @xL    @xli=l

     Eqn.(5) indicates that the gradient @E can be decomposed into two additive @x
     terms: a term of               l @E that propagates information directly without concern- @xL                             ing any weight layers, and another term of       P@E  @  L 1 F that propagates @xL @xl  i=l
     through the weight layers. The additive term of @E ensures that information is @x directly propagated back toany shallower unitl. Eqn.( L   5) also suggests that it
      1 It is noteworthy that there are Residual Units for increasing dimensions and reducing
       feature map sizes [1] in whichhis not identity. In this case the following derivations
       do not hold strictly. But as there are only a very few such units (two on CIFAR and
       three on ImageNet, depending on image sizes [1]), we expect that they do not have
       the exponential impact as we present in Sec.3. One may also think of our derivations
       as applied to all Residual Units within the same feature map size.     4

     is unlikely for the gradient @E to be canceled out for a mini-batch, because in @x general the term   P     l @  L 1 F cannot be always -1 for all samples in a mini-batch. @xl  i=l This implies that the gradient of a layer does not vanish even when the weights
     are arbitrarily small.

     Discussions
     Eqn.(4) and Eqn.(5) suggest that the signal can be directly propagated from
     any unit to another, both forward and backward. The foundation of Eqn.(4) is
     two identity mappings: (i) the identity skip connectionh(xl ) =xl , and (ii) the
     condition thatfis an identity mapping.
        These directly propagated information  ows are represented by the grey ar-
     rows in Fig.1,2, and4. And the above two conditions are true when these grey
     arrows cover no operations (expect addition) and thus are \clean". In the fol-
     lowing two sections we separately investigate the impacts of the two conditions.

     3 On the Importance of Identity Skip Connections

     Let’s consider a simple modi cation,h(xl ) = l xl , to break the identity shortcut:
                         xl+1 = l xl +F(xl ;Wl );                  (6)
     where l is a modulating scalar (for simplicity we still assumefis identity).
     Recursively applying this formulation we obtain an equation similar to Eqn. (4):Q        P   Qx                 1L = ( L 1           ( L 1  i=l  i )xl +  L 
                      i=l  j=i+1 j )F(xi ;Wi ), or simply:

                          L Y 1     L X 1
                      xL = (   i )xl +   F^(xi ;Wi );               (7)
                           i=l      i=l
     where the notationF^absorbs the scalars into the residual functions. Similar to
     Eqn.(5), we have backpropagation of the following form:
                                               !
                  @E   @E   L Y 1     @ L X 1
                     =     (   @x            i ) +      F^(xi ;Wi ) :          (8)
                    l  @xL         @xli=l        i=l
     Unlike Eqn.(5), in Eqn.(8) the  rst additive term is modulated by a factorQL 1                                           1 for alli, this i=l  i . For an extremely deep network (Lis large), if i >
     factor can be exponentially large; if i <1 for alli, this factor can be expo-
     nentially small and vanish, which blocks the backpropagated signal from the
     shortcut and forces it to  ow through the weight layers. This results in opti-
     mization di culties as we show by experiments.
        In the above analysis, the original identity skip connection in Eqn.(3) is re-
     placed with a simple scalingh(xl ) = l xl . If the skip connectionh(xl ) represents
     more complicated transforms (such as gating and 1 1 convolutions), in Eqn.(8)Qthe  rst term becomes  L 1 h0 whereh0 is the derivative ofh. This product i=l  i may also impede information propagation and hamper the training procedure
     as witnessed in the following experiments.                                                              5


                   3x3 conv                 3x3 conv
                     ReLU                   ReLU
                   3x3 conv                 3x3 conv
                                  0.5   0.5
              addition                 addition
                ReLU  (a) original (b) constant scaling ReLU


                        3x3 conv                 3x3 conv
                          ReLU                   ReLU
                  1x1 conv  3x3 conv           1x1 conv  3x3 conv
                  sigmoid                   sigmoid
                  1-                    1-
              addition                 addition
                ReLU  (c) exclusive gating     ReLU  (d) shortcut-only gating


                   3x3 conv                 3x3 conv
                     ReLU                   ReLU
              1x1 conv 3x3 conv            dropout 3x3 conv

              addition                 addition
                ReLU  (e) conv shortcut       ReLU  (f) dropout shortcut

     Figure 2.Various types of shortcut connections used in Table1. The grey arrows
     indicate the easiest paths for the information to propagate. The shortcut connections
     in (b-f) are impeded by di erent components. For simplifying illustrations we do not
     display the BN layers, which are adopted right after the weight layers for all units here.


     3.1 Experiments on Skip Connections

     We experiment with the 110-layer ResNet as presented in [1] on CIFAR-10 [10].
     This extremely deep ResNet-110 has 54 two-layer Residual Units (consisting of
     3 3 convolutional layers) and is challenging for optimization. Our implementa-
     tion details (see appendix) are the same as [1]. Throughout this paper we report
     the median accuracy of5 runsfor each architecture on CIFAR, reducing the
     impacts of random variations.
        Though our above analysis is driven by identityf, the experiments in this
     section are all based onf= ReLU as in [1]; we address identityfin the next sec-
     tion. Our baseline ResNet-110 has 6.61% error on the test set. The comparisons
     of other variants (Fig.2and Table1) are summarized as follows:
        Constant scaling. We set = 0:5 for all shortcuts (Fig.2(b)). We further
     study two cases of scalingF: (i)Fis not scaled; or (ii)Fis scaled by a constant
     scalar of 1  = 0:5, which is similar to the highway gating [6,7] but with frozen
     gates. The former case does not converge well; the latter is able to converge,
     but the test error (Table1, 12.35%) is substantially higher than the original
     ResNet-110. Fig3(a) shows that the training error is higher than that of the
     original ResNet-110, suggesting that the optimization has di culties when the
     shortcut signal is scaled down.     6

     Table 1.Classi cation error on the CIFAR-10 test set using ResNet-110 [1], with
     di erent types of shortcut connections applied to all Residual Units. We report \fail"
     when the test error is higher than 20%.

            case      Fig.     on shortcut  onF  error (%) remark
          original [1]    Fig.2(a)     1      1    6.61
                                0      1     fail    This is a plain net constant     Fig.2(b)scaling               0.5     1     fail
                               0.5     0.5   12.35   frozen gating
                             1 g(x)   g(x)    fail    initb exclusive                                     g =0 to 5
                     Fig.2(c)gating              1 g(x)   g(x)   8.70    initbg =-6
                             1 g(x)   g(x)   9.81    initbg =-7
         shortcut-only            1 g(x)    1    12.86Fig.2(d)                        initbg =0
           gating              1 g(x)    1    6.91    initbg =-6
       1 1 conv shortcut Fig.2(e)  1 1 conv   1    12.22
        dropout shortcut  Fig.2(f)  dropout 0.5   1     fail


        Exclusive gating. Following the Highway Networks [6,7] that adopt a gating
     mechanism [5], we consider a gating functiong(x) = (W g x+bg ) where a
     transform is represented by weights W g and biasesbg followed by the sigmoid
     function (x) =  1 . In a convolutional networkg(x) is realized by a 1 11+e x
     convolutional layer. The gating function modulates the signal by element-wise
     multiplication.
        We investigate the \exclusive" gates as used in [6,7] | theFpath is scaled
     byg(x) and the shortcut path is scaled by 1 g(x). See Fig2(c). We  nd that the
     initialization of the biasesbg is critical for training gated models, and following
     the guidelines 2 in [6,7], we conduct hyper-parameter search on the initial value of
     bg in the range of 0 to -10 with a decrement step of -1 on the training set by cross-
     validation. The best value ( 6 here) is then used for training on the training
     set, leading to a test result of 8.70% (Table1), which still lags far behind the
     ResNet-110 baseline. Fig3(b) shows the training curves. Table1also reports the
     results of using other initialized values, noting that the exclusive gating network
     does not converge to a good solution whenbg is not appropriately initialized.
        The impact of the exclusive gating mechanism is two-fold. When 1 g(x)
     approaches 1, the gated shortcut connections are closer to identity which helps
     information propagation; but in this caseg(x) approaches 0 and suppresses the
     functionF. To isolate the e ects of the gating functions on the shortcut path
     alone, we investigate a non-exclusive gating mechanism in the next.
        Shortcut-only gating. In this case the functionFis not scaled; only the
     shortcut path is gated by 1 g(x). See Fig2(d). The initialized value ofbg is still
     essential in this case. When the initializedbg is 0 (so initially the expectation
     of 1 g(x) is 0.5), the network converges to a poor result of 12.86% (Table1).
     This is also caused by higher training error (Fig3(c)).

      2 See also:people.idsia.ch/~rupesh/very_deep_learning/by [6,7].                                                              7

       2                        20   2                        20


                                15                           15
       0.2                           0.2
                                 Test Error (%                            Test Error (%
      s                            s


      Training Los                            Training Los 10                           10
                                 )                            )
      0.02                           0.02
                                5                            5
           110, original                       110, original
           110, const scaling (0.5, 0.5)                  110, exclusive gating (init b=−6)
      0.002                         0  0.002                         00   1   2   3   4   5   6       0   1   2   3   4   5   6
                  Iterations                         Iterations
                   (a)         x 10 4
                                               (b)         x 10 4

       2                        20   2                        20


                                15                           15
       0.2                           0.2
                                 Test Error (%                            Test Error (%
      s                            s


      Training Los                            Training Los 10                           10
                                 )                            )
      0.02                           0.02
                                5                            5
           110, original                       110, original
           110, shortcut−only gating (init b=0)               110, 1x1 conv shortcut
      0.002                         0  0.002                         00   1   2   3   4   5   6       0   1   2   3   4   5   6
                  Iterations
                   (c)         x 10 4               Iterations
                                               (d)         x 10 4


     Figure 3.Training curves on CIFAR-10 of various shortcuts. Solid lines denote test
     error (y-axis on the right), and dashed lines denote training loss (y-axis on the left).


        When the initializedbg is very negatively biased (e.g., 6), the value of
     1 g(x) is closer to 1 and the shortcut connection is nearly an identity mapping.
     Therefore, the result (6.91%, Table1) is much closer to the ResNet-110 baseline.
        1 1 convolutional shortcut. Next we experiment with 1 1 convolutional
     shortcut connections that replace the identity. This option has been investigated
     in [1] (known as option C) on a 34-layer ResNet (16 Residual Units) and shows
     good results, suggesting that 1 1 shortcut connections could be useful. But we
      nd that this is not the case when there are many Residual Units. The 110-layer
     ResNet has a poorer result (12.22%, Table1) when using 1 1 convolutional
     shortcuts. Again, the training error becomes higher (Fig3(d)). When stacking
     so many Residual Units (54 for ResNet-110), even the shortest path may still
     impede signal propagation. We witnessed similar phenomena on ImageNet with
     ResNet-101 when using 1 1 convolutional shortcuts.
        Dropout shortcut. Last we experiment with dropout [11] (at a ratio of 0.5)
     which we adopt on the output of the identity shortcut (Fig.2(f)). The network
     fails to converge to a good solution. Dropout statistically imposes a scale of 
     with an expectation of 0.5 on the shortcut, and similar to constant scaling by
     0.5, it impedes signal propagation.     8

     Table 2.Classi cation error (%) on the CIFAR-10 test set using di erent activation
     functions.

             case               Fig.     ResNet-110 ResNet-164
             original Residual Unit [1]  Fig.4(a)    6.61      5.93
             BN after addition      Fig.4(b)    8.17      6.50
             ReLU before addition    Fig.4(c)    7.84      6.14
             ReLU-only pre-activation Fig.4(d)    6.71      5.91
             full pre-activation     Fig.4(e)    6.37      5.46

        xl          xl          xl          xl          xl

            weight        weight        weight        ReLU         BN
            BN          BN          BN         weight        ReLU
            ReLU        ReLU        ReLU         BN         weight
            weight        weight        weight        ReLU         BN
            BN    addition             BN         weight        ReLU
       addition        BN              ReLU         BN         weight
       ReLU        ReLU        addition       addition       addition
       xl+1         xl+1         xl+1         xl+1         xl+1
                   (b) BN after   (c) ReLU before   (d) ReLU-only(a) original                                   (e) full pre-activationaddition      addition     pre-activation

     Figure 4.Various usages of activation in Table2. All these units consist of the same
     components | only the orders are di erent.


     3.2 Discussions
     As indicated by the grey arrows in Fig.2, the shortcut connections are the
     most direct paths for the information to propagate.Multiplicativemanipulations
     (scaling, gating, 1 1 convolutions, and dropout) on the shortcuts can hamper
     information propagation and lead to optimization problems.
        It is noteworthy that the gating and 1 1 convolutional shortcuts introduce
     more parameters, and should have strongerrepresentationalabilities than iden-
     tity shortcuts. In fact, the shortcut-only gating and 1 1 convolution cover the
     solution space of identity shortcuts (i.e., they could be optimized as identity
     shortcuts). However, their training error is higher than that of identity short-
     cuts, indicating that the degradation of these models is caused by optimization
     issues, instead of representational abilities.


     4 On the Usage of Activation Functions

     Experiments in the above section support the analysis in Eqn.(5) and Eqn.(8),
     both being derived under the assumption that the after-addition activationf                                                              9

     is the identity mapping. But in the above experimentsfis ReLU as designed
     in [1], so Eqn.(5) and (8) are approximate in the above experiments. Next we
     investigate the impact off.
        We want to makefan identity mapping, which is done by re-arranging
     the activation functions (ReLU and/or BN). The original Residual Unit in [1]
     has a shape in Fig.4(a) | BN is used after each weight layer, and ReLU is
     adopted after BN except that the last ReLU in a Residual Unit is after element-
     wise addition (f= ReLU). Fig.4(b-e) show the alternatives we investigated,
     explained as following.

     4.1 Experiments on Activation
     In this section we experiment with ResNet-110 and a 164-layerBottleneck[1]
     architecture (denoted as ResNet-164). A bottleneck Residual Unit consist of a
     1 1 layer for reducing dimension, a 3 3 layer, and a 1 1 layer for restoring
     dimension. As designed in [1], its computational complexity is similar to the
     two-3 3 Residual Unit. More details are in the appendix. The baseline ResNet-
     164 has a competitive result of 5.93% on CIFAR-10 (Table2).
        BN after addition. Before turningfinto an identity mapping, we go the
     opposite way by adopting BN after addition (Fig.4(b)). In this casefinvolves
     BN and ReLU. The results become considerably worse than the baseline (Ta-
     ble2). Unlike the original design, now the BN layer alters the signal that passes
     through the shortcut and impedes information propagation, as re ected by the
     di culties on reducing training loss at the beginning of training (Fib.6left).
        ReLU before addition. A na  ve choice of makingfinto an identity map-
     ping is to move the ReLU before addition (Fig.4(c)). However, this leads to a
     non-negativeoutput from the transformF, while intuitively a \residual" func-
     tion should take values in ( 1;+1). As a result, the forward propagated sig-
     nal is monotonically increasing. This may impact the representational ability,
     and the result is worse (7.84%, Table2) than the baseline. We expect to have
     a residual function taking values in ( 1;+1). This condition is satis ed by
     other Residual Units including the following ones.
        Post-activation or pre-activation?In the original design (Eqn.(1) and
     Eqn.(2)), the activationxl+1 =f(yl ) a ectsboth pathsin thenextResidual
     Unit:yl+1 =f(yl ) +F(f(yl );Wl+1 ). Next we develop anasymmetricform
     where an activationf^only a ects theFpath:yl+1 =yl +F(f^(yl );Wl+1 ), for
     anyl(Fig.5(a) to (b)). By renaming the notations, we have the following form:

                        xl+1 =xl +F(f^ (xl );Wl );:                (9)

     It is easy to see that Eqn.(9) is similar to Eqn.(4), and can enable a backward
     formulation similar to Eqn.(5). For this new Residual Unit as in Eqn.(9), the new
     after-addition activation becomes an identity mapping. This design means that
     if a new after-addition activationf^ is asymmetrically adopted, it is equivalent
     to recastingf^as thepre-activationof the next Residual Unit. This is illustrated
     in Fig.5.     10

             ...                   ...                   ...
            act.
        original                          act.                  act.
        Residual     weight       asymmetric     weight                 weight Unit                 output 
                  act.        activation      act.                  act.
                 weight                 weight                 weight
            addition                addition                addition
            act.                                pre-activation act.       Residual Unit     act.
                 weight                 weight                 weight
                  act.                  act.                  act.
                 weight                 weight                 weight
            addition                addition                addition
            act.                  ... adopt output activation                       ...
             ...                          equivalent to only to weight path
            (a)                  (b)                  (c)

     Figure 5.Using asymmetric after-addition activation is equivalent to constructing a
     pre-activationResidual Unit.

     Table 3.Classi cation error (%) on the CIFAR-10/100 test set using the original
     Residual Units and our pre-activation Residual Units.

            dataset  network           baseline unit pre-activation unit
                    ResNet-110 (1layer skip)    9.90        8.91
                    ResNet-110            6.61        6.37CIFAR-10 ResNet-164            5.93        5.46
                    ResNet-1001           7.61        4.92
                    ResNet-164           25.16       24.33CIFAR-100 ResNet-1001           27.82       22.71


        The distinction between post-activation/pre-activation is caused by the pres-
     ence of the element-wiseaddition. For a plain network that hasNlayers, there
     areN 1 activations (BN/ReLU), and it does not matter whether we think of
     them as post- or pre-activations. But for branched layers merged by addition,
     the position of activation matters.
        We experiment with two such designs: (i) ReLU-only pre-activation (Fig.4(d)),
     and (ii) full pre-activation (Fig.4(e)) where BN and ReLU are both adopted be-
     fore weight layers. Table2shows that the ReLU-only pre-activation performs
     very similar to the baseline on ResNet-110/164. This ReLU layer is not used in
     conjunction with a BN layer, and may not enjoy the bene ts of BN [8].
        Somehow surprisingly, when BN and ReLU are both used as pre-activation,
     the results are improved by healthy margins (Table2and Table3). In Table3we
     report results using various architectures: (i) ResNet-110, (ii) ResNet-164, (iii)
     a 110-layer ResNet architecture in which each shortcut skips only 1 layer (i.e.,                                                             11

       2                        20   2                        20
                                               164, original
                                               164, proposed (pre−activation)
                                15                           15
       0.2                           0.2
                                                             Test Error (%
                                 Test Error (%
                                   s
      s


                                   Training Los
      Training Los                         10                           10
                                                             )
                                 )
      0.02                           0.02
                                5                            5

           110, original
           110, BN after add
      0.002                         0  0.002                         0 0   1   2   3   4   5   6       0   1   2   3   4   5   6
                  Iterations                         Iterations x 10 4                          x 10 4

     Figure 6.Training curves on CIFAR-10.Left: BN after addition (Fig.4(b)) using
     ResNet-110.Right: pre-activation unit (Fig.4(e)) on ResNet-164. Solid lines denote
     test error, and dashed lines denote training loss.


     a Residual Unit has only 1 layer), denoted as \ResNet-110(1layer)", and (iv)
     a 1001-layer bottleneck architecture that has 333 Residual Units (111 on each
     feature map size), denoted as \ResNet-1001". We also experiment on CIFAR-
     100. Table3shows that our \pre-activation" models are consistently better than
     the baseline counterparts. We analyze these results in the following.


     4.2 Analysis

     We  nd the impact of pre-activation is twofold. First, the optimization is further
     eased (comparing with the baseline ResNet) becausefis an identity mapping.
     Second, using BN as pre-activation improves regularization of the models.
        Ease of optimization. This e ect is particularly obvious when training
     the1001-layerResNet. Fig.1shows the curves. Using the original design in
     [1], the training error is reduced very slowly at the beginning of training. For
     f= ReLU, the signal is impacted if it is negative, and when there are many
     Residual Units, this e ect becomes prominent and Eqn.(3) (so Eqn.(5)) is not
     a good approximation. On the other hand, whenfis an identity mapping, the
     signal can be propagated directly between any two units. Our 1001-layer network
     reduces the training loss very quickly (Fig.1). It also achieves the lowest loss
     among all models we investigated, suggesting the success of optimization.
        We also  nd that the impact off= ReLU is not severe when the ResNet
     has fewer layers (e.g., 164 in Fig.6(right)). The training curve seems to su er
     a little bit at the beginning of training, but goes into a healthy status soon. By
     monitoring the responses we observe that this is because after some training,
     the weights are adjusted into a status such thatyl in Eqn.(1) is more frequently
     above zero andfdoes not truncate it (xl is always non-negative due to the pre-
     vious ReLU, soyl is below zero only when the magnitude ofFis very negative).
     The truncation, however, is more frequent when there are 1000 layers.     12

     Table 4.Comparisons with state-of-the-art methods on CIFAR-10 and CIFAR-100
     using \moderate data augmentation" ( ip/translation), except for ELU [12] with no
     augmentation. Better results of [13,14] have been reported using stronger data augmen-
     tation and ensembling. For the ResNets we also report the number of parameters. Our
     results are the median of 5 runs with mean std in the brackets. All ResNets results
     are obtained with a mini-batch size of 128 except y with a mini-batch size of 64 (code
     available athttps://github.com/KaimingHe/resnet-1k-layers).

         CIFAR-10         error (%)     CIFAR-100        error (%)
         NIN [15]           8.81        NIN [15]          35.68
         DSN [16]           8.22        DSN [16]          34.57
         FitNet [17]          8.39        FitNet [17]         35.04
         Highway [7]         7.72        Highway [7]         32.39
         All-CNN [14]         7.25        All-CNN [14]        33.71
         ELU [12]           6.55        ELU [12]          24.28
         FitResNet, LSUV [18]    5.84        FitNet, LSUV [18]     27.66
         ResNet-110 [1] (1.7M)    6.61        ResNet-164 [1] (1.7M)   25.16
         ResNet-1202 [1] (19.4M)   7.93        ResNet-1001 [1] (10.2M)  27.82
         ResNet-164 [ours] (1.7M)  5.46        ResNet-164 [ours] (1.7M)  24.33
         ResNet-1001 [ours] (10.2M) 4.92 (4.89 0.14)  ResNet-1001 [ours] (10.2M) 22.71 (22.68 0.22)
         ResNet-1001 [ours] (10.2M) y 4.62 (4.69 0.20)


        Reducing over tting. Another impact of using the proposed pre-activation
     unit is on regularization, as shown in Fig.6(right). The pre-activation ver-
     sion reaches slightly higher training loss at convergence, but produces lower test
     error. This phenomenon is observed on ResNet-110, ResNet-110(1-layer), and
     ResNet-164 on both CIFAR-10 and 100. This is presumably caused by BN’s reg-
     ularization e ect [8]. In the original Residual Unit (Fig.4(a)), although the BN
     normalizes the signal, this is soon added to the shortcut and thus the merged
     signal is not normalized. This unnormalized signal is then used as the input of
     the next weight layer. On the contrary, in our pre-activation version, the inputs
     to all weight layers have been normalized.


     5 Results

     Comparisons on CIFAR-10/100.Table4compares the state-of-the-art meth-
     ods on CIFAR-10/100, where we achieve competitive results. We note that we
     do not specially tailor the network width or  lter sizes, nor use regularization
     techniques (such as dropout) which are very e ective for these small datasets.
     We obtain these results via a simple but essential concept | going deeper. These
     results demonstrate the potential ofpushing the limits of depth.

     Comparisons on ImageNet.Next we report experimental results on the 1000-
     class ImageNet dataset [3]. We have done preliminary experiments using the skip
     connections studied in Fig.2&3on ImageNet with ResNet-101 [1], and observed
     similar optimization di culties. The training error of these non-identity shortcut
     networks is obviously higher than the original ResNet at the  rst learning rate                                                             13

     Table 5.Comparisons of single-crop error on the ILSVRC 2012 validation set. All
     ResNets are trained using the same hyper-parameters and implementations as [1]).
     Our Residual Units are the full pre-activation version (Fig.4(e)). y : code/model avail-
     able athttps://github.com/facebook/fb.resnet.torch/tree/master/pretrained,
     using scale and aspect ratio augmentation in [20].

       method                      augmentation  train crop test crop top-1 top-5
       ResNet-152, original Residual Unit [1]    scale    224 224224 224 23.0 6.7
       ResNet-152, original Residual Unit [1]    scale    224 224320 320 21.3 5.5
       ResNet-152,pre-actResidual Unit     scale    224 224320 320 21.1 5.5
       ResNet-200, original Residual Unit [1]    scale    224 224320 320 21.8 6.0
       ResNet-200,pre-actResidual Unit     scale    224 224320 320 20.7 5.3
       ResNet-200,pre-actResidual Unit  scale+asp ratio 224 224320 32020.1 y 4.8 y
       Inception v3 [19]              scale+asp ratio 299 299299 299 21.2 5.6


     (similar to Fig.3), and we decided to halt training due to limited resources.
     But we did  nish a \BN after addition" version (Fig.4(b)) of ResNet-101 on
     ImageNet and observed higher training loss and validation error. This model’s
     single-crop (224 224) validation error is 24.6%/7.5%,vs.the original ResNet-
     101’s 23.6%/7.1%. This is in line with the results on CIFAR in Fig.6(left).
        Table5shows the results of ResNet-152 [1] and ResNet-200 3 , all trained from
     scratch. We notice that the original ResNet paper [1] trained the models using
     scale jittering with shorter sides2[256;480], and so the test of a 224 224 crop
     ons= 256 (as did in [1]) is negatively biased. Instead, we test a single 320 320
     crop froms= 320, for all original and our ResNets. Even though the ResNets
     are trained on smaller crops, they can be easily tested on larger crops because
     the ResNets are fully convolutional by design. This size is also close to 299 299
     used by Inception v3 [19], allowing a fairer comparison.
        The original ResNet-152 [1] has top-1 error of 21.3% on a 320 320 crop, and
     our pre-activation counterpart has 21.1%. The gain is not big on ResNet-152
     because this model has not shown severe generalization di culties. However,
     the original ResNet-200 has an error rate of 21.8%, higher than the baseline
     ResNet-152. But we  nd that the original ResNet-200 haslowertraining error
     than ResNet-152, suggesting that it su ers from over tting.
        Our pre-activation ResNet-200 has an error rate of 20.7%, which is1.1%
     lower than the baseline ResNet-200 and also lower than the two versions of
     ResNet-152. When using the scale and aspect ratio augmentation of [20,19], our
     ResNet-200 has a result better than Inception v3 [19] (Table5). Concurrent
     with our work, an Inception-ResNet-v2 model [21] achieves a single-crop result
     of 19.9%/4.9%. We expect our observations and the proposed Residual Unit will
     help this type and generally other types of ResNets.

     Computational Cost.Our models’ computational complexity is linear on

      3 The ResNet-200 has 16 more 3-layer bottleneck Residual Units than ResNet-152,
       which are added on the feature map of 28 28.     14

     depth (so a 1001-layer net is 10 complex of a 100-layer net). On CIFAR,
     ResNet-1001 takes about 27 hours to train on 2 GPUs; on ImageNet, ResNet-
     200 takes about 3 weeks to train on 8 GPUs (on par with VGG nets [22]).


     6 Conclusions


     This paper investigates the propagation formulations behind the connection
     mechanisms of deep residual networks. Our derivations imply that identity short-
     cut connections and identity after-addition activation are essential for making
     information propagation smooth. Ablation experiments demonstrate phenom-
     ena that are consistent with our derivations. We also present 1000-layer deep
     networks that can be easily trained and achieve improved accuracy.


     Appendix: Implementation DetailsThe implementation details and hyper-
     parameters are the same as those in [1]. On CIFAR we use only the translation
     and  ipping augmentation in [1] for training. The learning rate starts from 0.1,
     and is divided by 10 at 32k and 48k iterations. Following [1], for all CIFAR
     experiments we warm up the training by using a smaller learning rate of 0.01 at
     the beginning 400 iterations and go back to 0.1 after that, although we remark
     that this is not necessary for our proposed Residual Unit. The mini-batch size
     is 128 on 2 GPUs (64 each), the weight decay is 0.0001, the momentum is 0.9,
     and the weights are initialized as in [23].
        On ImageNet, we train the models using the same data augmentation as in
     [1]. The learning rate starts from 0.1 (no warming up), and is divided by 10 at
     30 and 60 epochs. The mini-batch size is 256 on 8 GPUs (32 each). The weight
     decay, momentum, and weight initialization are the same as above.
        When using the pre-activation Residual Units (Fig.4(d)(e) and Fig.5), we
     pay special attention to the  rst and the last Residual Units of the entire net-
     work. For the  rst Residual Unit (that follows a stand-alone convolutional layer,
     conv 1 ), we adopt the  rst activation right after conv 1 and before splitting into
     two paths; for the last Residual Unit (followed by average pooling and a fully-
     connected classi er), we adopt an extra activation right after its element-wise
     addition. These two special cases are the natural outcome when we obtain the
     pre-activation network via the modi cation procedure as shown in Fig.5.
        The bottleneck Residual Units (for ResNet-164/1001 on CIFAR) are con-
     structed following [1]. For example, a "    #
                                   3 3, 16 unit in ResNet-110 is replaced 3 3, 162    3
     with a 1 1, 166    7 unit in ResNet-164, both of which have roughly the same num- 43 3, 165
            1 1, 64
     ber of parameters. For the bottleneck ResNets, when reducing the feature map
     size we use projection shortcuts [1] for increasing dimensions, and when pre-
     activation is used, these projection shortcuts are also with pre-activation.                                                                    15

      References

       1.He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
         In: CVPR. (2016)
       2.Nair, V., Hinton, G.E.: Recti ed linear units improve restricted boltzmann ma-
         chines. In: ICML. (2010)
       3.Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
         Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large
         Scale Visual Recognition Challenge. IJCV (2015)
       4.Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll ar, P.,
         Zitnick, C.L.: Microsoft COCO: Common objects in context. In: ECCV. (2014)
       5.Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation
         (1997)
       6.Srivastava, R.K., Gre , K., Schmidhuber, J.: Highway networks. In: ICML work-
         shop. (2015)
       7.Srivastava, R.K., Gre , K., Schmidhuber, J.: Training very deep networks. In:
         NIPS. (2015)
       8.Io e, S., Szegedy, C.: Batch normalization: Accelerating deep network training by
         reducing internal covariate shift. In: ICML. (2015)
       9.LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.,
         Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural
         computation (1989)
      10.Krizhevsky, A.: Learning multiple layers of features from tiny images. Tech Report
         (2009)
      11.Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.:
         Improving neural networks by preventing co-adaptation of feature detectors.
         arXiv:1207.0580 (2012)
      12.Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network
         learning by exponential linear units (ELUs). In: ICLR. (2016)
      13.Graham, B.: Fractional max-pooling. arXiv:1412.6071 (2014)
      14.Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplic-
         ity: The all convolutional net. arXiv:1412.6806 (2014)
      15.Lin, M., Chen, Q., Yan, S.: Network in network. In: ICLR. (2014)
      16.Lee, C.Y., Xie, S., Gallagher, P., Zhang, Z., Tu, Z.: Deeply-supervised nets. In:
         AISTATS. (2015)
      17.Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets:
         Hints for thin deep nets. In: ICLR. (2015)
      18.Mishkin, D., Matas, J.: All you need is a good init. In: ICLR. (2016)
      19.Szegedy, C., Vanhoucke, V., Io e, S., Shlens, J., Wojna, Z.: Rethinking the incep-
         tion architecture for computer vision. In: CVPR. (2016)
      20.Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D.,
         Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR. (2015)
      21.Szegedy, C., Io e, S., Vanhoucke, V.: Inception-v4, inception-resnet and the impact
         of residual connections on learning. arXiv:1602.07261 (2016)
      22.Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
         image recognition. In: ICLR. (2015)
      23.He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into recti ers: Surpassing human-
         level performance on imagenet classi cation. In: ICCV. (2015)