testing_generation/Corpus/A guide to convolution arit...

                     A guide to convolution arithmetic for   deep
                                      learning

                               The authors of this guide would like to thank David Warde-Farley,
                             Guillaume Alain  and  Caglar Gulcehre for   their valuable feedback. We
                             are likewise grateful to all those who helped improve this tutorial with
                             helpful comments, constructive criticisms  and  code contributions. Keep
                             them coming!
                               Special thanks to Ethan Schoonover, creator of the Solarized color
                             scheme, 1 whose colors were used for   the ﬁgures.

                                                    Feedback
                               Your feedback is welcomed! We did our best to be as precise, infor-
                             mative  and  up to the point as possible, but should there be any thing you
                             feel might be an error or could be rephrased to be more precise or com-
                             prehensible, please don’t refrain from contacting us. Likewise, drop us a
                             line if you think there is something that might ﬁt this technical report
                              and  you would like us to discuss – we will make our best eﬀort to update
                             this document.

                                            Source code  and  animations
                               The code used to generate this guide along with its ﬁgures is available
                             on GitHub. 2 There the reader can also ﬁnd an animated version of the
                             ﬁgures.


                      1 Introduction 5
                        1.1 Discrete convolutions . . . . . . . . . . . . . . . . . . . . . . . . .6
                        1.2 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10

                      2 Convolution arithmetic 12
                        2.1 No zero padding, unit strides . . . . . . . . . . . . . . . . . . . .12
                        2.2 Zero padding, unit strides . . . . . . . . . . . . . . . . . . . . . .13
                            2.2.1 Half (same) padding . . . . . . . . . . . . . . . . . . . . .13
                            2.2.2 Full padding . . . . . . . . . . . . . . . . . . . . . . . . .13
                        2.3 No zero padding, non-unit strides . . . . . . . . . . . . . . . . . .15
                        2.4 Zero padding, non-unit strides . . . . . . . . . . . . . . . . . . . .15

                      3 Pooling arithmetic 18

                      4 Transposed convolution arithmetic 19
                        4.1 Convolution as a matrix operation . . . . . . . . . . . . . . . . .20
                        4.2 Transposed convolution . . . . . . . . . . . . . . . . . . . . . . .20
                        4.3 No zero padding, unit strides, transposed . . . . . . . . . . . . .21
                        4.4 Zero padding, unit strides, transposed . . . . . . . . . . . . . . .22
                            4.4.1 Half (same) padding, transposed . . . . . . . . . . . . . .22
                            4.4.2 Full padding, transposed . . . . . . . . . . . . . . . . . . .22
                        4.5 No zero padding, non-unit strides, transposed . . . . . . . . . . .24
                        4.6 Zero padding, non-unit strides, transposed . . . . . . . . . . . . .24

                      5 Miscellaneous convolutions 28
                        5.1 Dilated convolutions . . . . . . . . . . . . . . . . . . . . . . . . .28


                                                                        Chapter 1


                      Introduction


                      Deep convolutional neural networks (CNNs) have been at the heart of spectac-
                      ular advances in deep learning. Although CNNs have been used as early as the
                      nineties to solve character recognition tasks (Le Cunet al., 1997), their current
                      widespread application is due to much more recent work, when a deep CNN
                      was used to beat state-of-the-art in the ImageNet image classiﬁcation challenge
                      (Krizhevskyet al., 2012).
                        Convolutional neural networks therefor  e constitute a very useful tool for   ma-
                      chine learning practitioners. However, learning to use CNNs for   the ﬁrst time
                      is generally an intimidating experience. A convolutional layer’s output shape
                      is aﬀected by the shape of its input as well as the choice of kernel shape, zero
                      padding  and  strides,  and  the relationship between these properties is not triv-
                      ial to infer. This contrasts with fully-connected layers, whose output size is
                      independent of the input size. Additionally, CNNs also usually feature apool-
                      ingstage, adding yet another level of complexity with respect to fully-connected
                      networks. Finally, so-called transposed convolutional layers (also known as frac-
                      tionally strided convolutional layers) have been employed in more  and  more work
                      as of late (Zeileret al., 2011; Zeiler  and  Fergus, 2014; Longet al., 2015; Rad-
                      for  det al., 2015; Visinet al., 2015; Imet al., 2016),  and  their relationship with
                      convolutional layers has been explained with various degrees of clarity.
                        This guide’s objective is twofold:

                        1.Explain the relationship between convolutional layers  and  transposed con-
                          volutional layers.
                        2.Provide an intuitive underst and ing of the relationship between input shape,
                          kernel shape, zero padding, strides  and  output shape in convolutional,
                          pooling  and  transposed convolutional layers.

                        In order to remain broadly applicable, the results shown in this guide are
                      independent of implementation details  and  apply to all commonly used machine
                      learning frameworks, such as Theano (Bergstraet al., 2010; Bastienet al., 2012),


                      Torch (Collobertet al., 2011), Tensorﬂow (Abadiet al., 2015)  and  Caﬀe (Jia et al., 2014).

                        This chapter brieﬂy reviews the main building blocks of CNNs, namely dis-
                      crete convolutions  and  pooling. for   an in-depth treatment of the subject, see
                      Chapter 9 of the Deep Learning textbook (Goodfellowet al., 2016).


                      1.1 Discrete convolutions

                      The bread  and  butter of neural networks is aﬃne transformations: a vector
                      is received as input  and  is multiplied with a matrix to produce an output (to
                      which a bias vector is usually added before passing the result through a non-
                      linearity). This is applicable to any  type of input, be it an image, a sound
                      clip or an unordered collection of features: whatever their dimensionality, their
                      representation can always be ﬂattened into a vector before the transfomation.
                        Images, sound clips  and  many  other similar kinds of data have an intrinsic
                      structure. More formally, they share these important properties:

                        They are stored as multi-dimensional arrays.
                        They feature one or more axes for   which ordering matters (e.g., width  and
                          height axes for   an image, time axis for   a sound clip).
                        One axis, called the channel axis, is used to access diﬀerent views of the
                          data (e.g., the red, green  and  blue channels of a color image, or the left
                           and  right channels of a stereo audio track).

                        These properties are not exploited when an aﬃne transformation is applied;
                      in fact, all the axes are treated in the same way  and  the topological information
                      is not taken into account. Still, taking advantage of the implicit structure of
                      the data may prove very h and y in solving some tasks, like computer vision  and
                      speech recognition,  and  in these cases it would be best to preserve it. This is
                      where discrete convolutions come into play.
                        A discrete convolution is a linear transformation that preserves this notion
                      of ordering. It is sparse (only a few input units contribute to a given output
                      unit)  and  reuses parameters (the same weights are applied to multiple locations
                      in the input).
                        Figure 1.1 provides an example of a discrete convolution. The light blue
                      grid is called the input feature map. To keep the drawing simple, a single input
                      feature map is represented, but it is not uncommon to have multiple feature
                      maps stacked one onto another. 1 A kernel(shaded area) of value

                                            <<FIGURE>>

                           Figure 1.1: Computing the output values of a discrete convolution.


                                            <<FIGURE>>


                          Figure 1.2: Computing the output values of a discrete convolution for   N = 2, i1 =i2 = 5, k1 =k2 = 3, s1 =s2 = 2,  and  p1 =p2 = 1.


                      slides across the input feature map. At each location, the product between
                      each element of the kernel  and  the input element it overlaps is computed  and
                      the results are summed up to obtain the output in the current location. The
                      procedure can be repeated using diﬀerent kernels to for  m as many  output feature
                      maps as desired (Figure 1.3). The ﬁnal outputs of this procedure are called
                      output feature maps.2 If there are multiple input feature maps, the kernel will
                      have to be 3-dimensional – or, equivalently each one of the feature maps will
                      be convolved with a distinct kernel –  and  the resulting feature maps will be
                      summed up elementwise to produce the output feature map.
                        The convolution depicted in Figure 1.1 is an instance of a 2-D convolution,
                      but it can be generalized to N-D convolutions. for   instance, in a 3-D convolu-
                      tion, the kernel would be a cuboid and  would slide across the height, width  and
                      depth of the input feature map.
                        The collection of kernels deﬁning a discrete convolution has a shape corre-
                      sponding to some permutation of(n;m;k 1 ;:::;k N ), where


                                      <<FORMULA>>

                        The following properties aﬀect the output size oj of a convolutional layer
                      along axis j:

                                      <<FORMULA>>

                      for   instance, Figure 1.2 shows a 3x3 kernel applied to a 5x5 input padded
                      with a 1x1 border of zeros using 2x2 strides.
                        Note that strides constitute a for  m of subsampling. As an alternative to
                      being interpreted as a measure of how much the kernel is translated, strides can
                      also be viewed as how much of the output is retained. for   instance, moving
                      the kernel by hops of two is equivalent to moving the kernel by hops of one but
                      retaining only odd output elements (Figure 1.4).
                        1 An example of this is what was referred to earlier as channels for images  and  sound clips.
                        2 While there is a distinction between convolution  and  cross-correlation from a signal pro-
                      cessing perspective, the two become interchangeable when the kernel is learned. for   the sake
                      of simplicity  and  to stay consistent with most of the machine learning literature, the term
                      convolution will be used in this guide.

                                            <<FIGURE>>

                      Figure 1.3: A convolution mapping from two input feature maps to three output
                      feature maps using a32 3x3 collection of kernels w. In the left pathway,
                      input feature map 1 is convolved with kernel w1;1  and  input feature map 2 is
                      convolved with kernel w1;2 ,  and  the results are summed together elementwise
                      to for  m the ﬁrst output feature map. The same is repeated for   the middle  and
                      right pathways to for  m the second  and  third feature maps,  and  all three output
                      feature maps are grouped together to for  m the output.

                                            <<FIGURE>>

                      Figure 1.4: An alternative way of viewing strides. Instead of translating the
                       3x3 kernel by increments ofs= 2(left), the kernel is translated by increments
                      of1 and  only one ins= 2output elements is retained (right).


                                                1.2 Pooling

                      In addition to discrete convolutions themselves,pooling operations make up
                      another important building block in CNNs. Pooling operations reduce the size
                      of feature maps by using some function to summarize subregions, such as taking
                      the average or the maximum value.
                        Pooling works by sliding a window across the input  and  feeding the content
                      of the window to a pooling function. In some sense, pooling works very much
                      like a discrete convolution, but replaces the linear combination described by the
                      kernel with some other function. Figure 1.5 provides an example for   average
                      pooling,  and  Figure 1.6 does the same for   max pooling.
                        The following properties aﬀect the output size j of a pooling layer along
                      axisj:

                                      <<FORMULA>>


                                                  <<FIGURE>>


                     Figure 1.5: Computing the output values of a  3x3  average pooling operation on a 5x5 input using 1x1 strides.

                                                  <<FIGURE>>


                     Figure 1.6: Computing the output values of a  3x3  max pooling operation on a 5X5 input using 1X1 strides.


                      Convolution arithmetic


                      The analysis of the relationship between convolutional layer properties is eased
                      by the fact that they don’t interact across axes, i.e., the choice of kernel size,
                      stride  and  zero padding along axis j only aﬀects the output size of axis j.
                      Because of that, this chapter will focus on the following simpliﬁed setting:

                        2-D discrete convolutions (N= 2),
                        square inputs (i1 =i2 =i),
                        square kernel size (k1 =k2 =k),
                        same strides along both axes (s1 =s2 =s),
                        same zero padding along both axes (p1 =p2 =p).

                        This facilitates the analysis  and  the visualization, but keep in mind that the
                      results outlined here also generalize to the N-D  and  non-square cases.


                      2.1 No zero padding, unit strides

                      The simplest case to analyze is when the kernel just slides across every position
                      of the input (i.e.,s= 1 and p= 0). Figure 2.1 provides an example for  i= 4
                       and k= 3.
                        One way of deﬁning the output size in this case is by the number of possible
                      placements of the kernel on the input. Let’s consider the width axis: the kernel
                      starts on the leftmost part of the input feature map  and  slides by steps of one
                      until it touches the right side of the input. The size of the output will be equal
                      to the number of steps made, plus one, accounting for   the initial position of the
                      kernel (Figure 2.8a). The same logic applies for   the height axis.
                        More formally, the following relationship can be inferred:

                          Relationship 1.for   any i,k and p,  and  for  s= 1,

                                                     <<FORMULA>>


                      2.2 Zero padding, unit strides

                      To factor in zero padding (i.e., only restricting tos= 1), let’s consider its eﬀect
                      on the eﬀective input size: padding with p zeros changes the eﬀective input size
                      from i to i+ 2p. In the general case, Relationship 1 can then be used to infer
                      the following relationship:

                          Relationship 2.for   any  i,k  and  p,  and  for   s= 1,

                                                <<FORMULA>>

                      Figure 2.2 provides an example for   i= 5,k= 4  and  p= 2.
                        In practice, two speciﬁc instances of zero padding are used quite extensively
                      because of their respective properties. Let’s discuss them in more detail.

                      2.2.1 Half (same) padding
                      Having the output size be the same as the input size (i.e.,o=i) can be a
                      desirable property:

                          Relationship 3.for   any  i  and  for   k o d (k= 2n+ 1; n2N),
                          s= 1  and  p=b k=2 c=n,

                                               <<FORMULA>>

                      This is sometimes referred to as half(or same) padding. Figure 2.3 provides an
                      example for   i= 5,k= 3 and  (therefor  e) p= 1.

                      2.2.2 Full padding
                      While convolving a kernel generally decreases the output size with respect to
                      the input size, sometimes the opposite is required. This can be achieved with
                      proper zero padding:

                          Relationship 4.for   any  i  and  k,  and  for   p=kx1  and  s= 1,

                                                <<FORMULA>>


                                                <<FIGURE>>

                      Figure 2.1: (No padding, unit strides) Convolving a 3x3 kernel over a 4x4
                      input using unit strides (i.e.,i= 4,k= 3,s= 1  and  p= 0).


                                                <<FIGURE>>

                      Figure 2.2: (Arbitrary padding, unit strides) Convolving a 4x4 kernel over a
                      5x5 input padded with a 2x2 border of zeros using unit strides (i.e.,i= 5,
                      k= 4,s= 1 and p= 2).


                                              <<FIGURE>>


                      Figure 2.3: (Half padding, unit strides) Convolving a 3x3 kernel over a 5x5
                      input using half padding  and  unit strides (i.e.,i= 5,k= 3,s= 1  and  p= 1).


                                            <<FIGURE>>


                      Figure 2.4: (Full padding, unit strides) Convolving a 3x3 kernel over a 5x5
                      input using full padding  and  unit strides (i.e.,i= 5,k= 3,s= 1  and  p= 2).


                      This is sometimes referred to as full padding, because in this setting every
                      possible partial or complete superimposition of the kernel on the input feature
                      map is taken into account. Figure 2.4 provides an example for i= 5,k= 3  and
                      (therefore) p= 2.


                      2.3 No zero padding, non-unit strides

                      All relationships derived so far only apply for   unit-strided convolutions. Incorporating
                      non unitary strides requires another inference leap. To facilitate
                      the analysis, let’s momentarily ignore zero padding (i.e.,s >1  and  p= 0).
                      Figure 2.5 provides an example for   i= 5,k= 3 and s= 2.
                        Once again, the output size can be deﬁned in terms of the number of possible
                      placements of the kernel on the input. Let’s consider the width axis: the kernel
                      starts as usual on the leftmost part of the input, but this time it slides by steps
                      of sizes until it touches the right side of the input. The size of the output is
                      again equal to the number of steps made, plus one, accounting for   the initial
                      position of the kernel (Figure 2.8b). The same logic applies for   the height axis.
                        From this, the following relationship can be inferred:

                          Relationship 5.for   any  i,k  and  s,  and  for   p= 0,

                                               <<FORMULA>>

                      The ﬂoor function accounts for   the fact that sometimes the last possible step
                      does not coincide with the kernel reaching the end of the input, i.e., some input
                      units are left out (see Figure 2.7 for   an example of such a case).


                      2.4 Zero padding, non-unit strides

                      The most general case (convolving over a zero padded input using non-unit
                      strides) can be derived by applying Relationship 5 on an eﬀective input of size
                      i+ 2p, in analogy to what was done for   Relationship 2:

                          Relationship 6.for   any i,k,p and s,

                                             <<FORMULA>>

                      As before, the ﬂoor function means that in some cases a convolution will produce
                      the same output size for   multiple input sizes. More speciﬁcally, ifi+ 2pkis
                      a multiple ofs, then any  input size j=i+a; a2 f0;:::; sx1 g will produce
                      the same output size. Note that this ambiguity applies only for   s >1.

                                            <<FIGURE>>

                        Figure 2.6 shows an example with i= 5,k= 3,s= 2  and  p= 1, while

                                            <<FIGURE>>

                      Figure 2.7 provides an example for   i= 6,k= 3,s= 2  and  p= 1. Interestingly,

                      despite having diﬀerent input sizes these convolutions share the same output
                      size. While this doesn’t aﬀect the analysis for   convolutions, this will complicate
                      the analysis in the case of transposed convolutions.


                                            <<FIGURE>>

                      Figure 2.5: (No zero padding, arbitrary strides) Convolving a 3x3 kernel over
                      a 5x5 input using 2x2 strides (i.e.,i= 5,k= 3,s= 2  and  p= 0).

                                            <<FIGURE>>

                      Figure 2.6: (Arbitrary padding  and  strides) Convolving a 3x3 kernel over a
                       5x5 input padded with a 1x1 border of zeros using 2x2 strides (i.e.,i= 5,
                      k= 3,s= 2  and  p= 1).

                                            <<FIGURE>>

                      Figure 2.7: (Arbitrary padding  and  strides) Convolving a 3x3 kernel over a
                      6x6 input padded with a 1x1 border of zeros using 2x2 strides (i.e.,i= 6,
                      k= 3,s= 2  and  p= 1). In this case, the bottom row  and  right column of the
                      zero padded input are not covered by the kernel.

                      (a) The kernel has to slide two steps (b) The kernel has to slide one step of
                      to the right to touch the right side of size two to the right to touch the right
                      the input ( and  equivalently downwards). side of the input ( and  equivalently down-
                      Adding one to account for   the initial ker- wards). Adding one to account for   the
                      nel position, the output size is 3x3. initial kernel position, the output size is 2x2.

                                            <<FIGURE>>

                                     Figure 2.8: Counting kernel positions.


                                                 Chapter 3

                      Pooling arithmetic

                      In a neural network, pooling layers provide invariance to small translations of
                      the input. The most common kind of pooling is max pooling, which consists
                      in splitting the input in (usually non-overlapping) patches  and  outputting the
                      maximum value of each patch. Other kinds of pooling exist, e.g., mean or
                      average pooling, which all share the same idea of aggregating the input locally
                      by applying a non-linearity to the content of some patches (Boureauet al.,
                      2010a,b, 2011; Saxeet al., 2011).
                        Some readers may have noticed that the treatment of convolution arithmetic
                      only relies on the assumption that some function is repeatedly applied onto
                      subsets of the input. This means that the relationships derived in the previous
                      chapter can be reused in the case of pooling arithmetic. Since pooling does not
                      involve zero padding, the relationship describing the general case is as follows:

                          Relationship 7.for   any  i,k  and  s,

                                              <<FORMULA>>

                      This relationship holds for any  type of pooling.


                                                Chapter 4

                      Transposed convolution arithmetic


                      The need for   transposed convolutions generally arises from the desire to use a
                      transfor  mation going in the opposite direction of a normal convolution, i.e., from
                      something that has the shape of the output of some convolution to something
                      that has the shape of its input while maintaining a connectivity pattern that
                      is compatible with said convolution. for   instance, one might use such a trans-
                      for  mation as the decoding layer of a convolutional autoencoder or to project
                      feature maps to a higher-dimensional space.
                        Once again, the convolutional case is considerably more complex than the
                      fully-connected case, which only requires to use a weight matrix whose shape has
                      been transposed. However, since every convolution boils down to an eﬃcient im-
                      plementation of a matrix operation, the insights gained from the fully-connected
                      case are useful in solving the convolutional case.
                        Like for   convolution arithmetic, the dissertation about transposed convolu-
                      tion arithmetic is simpliﬁed by the fact that transposed convolution properties
                      don’t interact across axes.
                        The chapter will focus on the following setting:

                        2-D transposed convolutions (N= 2),
                        square inputs (i1 =i2 =i),
                        square kernel size (k1 =k2 =k),
                        same strides along both axes (s1 =s2 =s),
                        same zero padding along both axes (p1 =p2 =p).

                      Once again, the results outlined generalize to the N-D  and  non-square cases.


                                                 4.1 Convolution as a matrix operation

                      Take for   example the convolution represented in Figure 2.1. If the input  and
                      output were to be unrolled into vectors from left to right, top to bottom, the
                      convolution could be represented as a sparse matrix C where the non-zero elements
                      are the elements w i;j of the kernel (with i  and  j being the row  and  column
                      of the kernel respectively):

                                                <<FORMULA>>

                        This linear operation takes the input matrix ﬂattened as a 16-dimensional
                      vector  and  produces a 4-dimensional vector that is later reshaped as the 2x2
                      output matrix.
                        Using this representation, the backward pass is easily obtained by trans-
                      posingC; in other words, the error is backpropagated by multiplying the loss
                      withCT . This operation takes a 4-dimensional vector as input  and  produces
                      a 16-dimensional vector as output,  and  its connectivity pattern is compatible
                      withCby construction.
                        Notably, the kernel w deﬁnes both the matrices C  and  CT used for   the
                      for  ward  and  backward passes.


                      4.2 Transposed convolution

                      Let’s now consider what would be required to go the other way around, i.e.,
                      map from a 4-dimensional space to a 16-dimensional space, while keeping the
                      connectivity pattern of the convolution depicted in Figure 2.1. This operation
                      is known as a transposed convolution.
                        Transposed convolutions – also called fractionally strided convolutions or
                      deconvolutions 1 – work by swapping the for  ward  and  backward passes of a con-
                      volution. One way to put it is to note that the kernel deﬁnes a convolution, but
                      whether it’s a direct convolution or a transposed convolution is determined by
                      how the for  ward  and  backward passes are computed.
                        for   instance, although the kernel w deﬁnes a convolution whose for  ward  and
                      backward passes are computed by multiplying with C  and  CT respectively, it
                      also deﬁnes a transposed convolution whose for  ward  and  backward passes are
                      computed by multiplying withCT  and  (CT )T =C respectively. 2
                        Finally note that it is always possible to emulate a transposed convolution
                      with a direct convolution. The disadvantage is that it usually involves adding
                        1 The term “deconvolution” is sometimes used in the literature, but we advocate against it
                      on the grounds that a deconvolution is mathematically deﬁned as the inverse of a convolution,
                      which is diﬀerent from a transposed convolution.
                        2 The transposed convolution operation can be thought of as the gradient of some convolution
                        with respect to its input, which is usually how transposed convolutions are implemented
                      in practice.


                      many  columns  and  rows of zeros to the input, resulting in a much less eﬃcient
                      implementation.
                        Building on what has been introduced so far, this chapter will proceed some-
                      what backwards with respect to the convolution arithmetic chapter, deriving the
                      properties of each transposed convolution by referring to the direct convolution
                      with which it shares the kernel,  and  deﬁning the equivalent direct convolution.


                      4.3 No zero padding, unit strides, transposed

                      The simplest way to think about a transposed convolution on a given input is
                      to imagine such an input as being the result of a direct convolution applied on
                      some initial feature map. The transposed convolution can be then considered as
                      the operation that allows to recover the shape 3 of this initial feature map.
                        Let’s consider the convolution of a 3x3 kernel on a 4x4 input with unitary
                      stride  and  no padding (i.e.,i= 4,k= 3,s= 1  and  p= 0). As depicted in
                      Figure 2.1, this produces a 2x2 output. The transpose of this convolution will
                      then have an output of shape 4x4 when applied on a 2x2 input.
                        Another way to obtain the result of a transposed convolution is to apply an
                      equivalent – but much less eﬃcient – direct convolution. The example described
                      so far could be tackled by convolving a 3x3 kernel over a 2x2 input padded
                      with a 2x2 border of zeros using unit strides (i.e.,i0 = 2,k0 =k,s0 = 1 and
                      p0 = 2), as shown in Figure 4.1. Notably, the kernel’s  and  stride’s sizes remain
                      the same, but the input of the transposed convolution is now zero padded. 4
                        One way to understand  the logic behind zero padding is to consider the
                      connectivity pattern of the transposed convolution  and  use it to guide the design
                      of the equivalent convolution. for   example, the top left pixel of the input of the
                      direct convolution only contribute to the top left pixel of the output, the top
                      right pixel is only connected to the top right output pixel,  and  so on.
                        To maintain the same connectivity pattern in the equivalent convolution it is
                      necessary to zero pad the input in such a way that the ﬁrst (top-left) application
                      of the kernel only touches the top-left pixel, i.e., the padding has to be equal to
                      the size of the kernel minus one.
                        Proceeding in the same fashion it is possible to determine similar observa-
                      tions for   the other elements of the image, giving rise to the following relationship:
                        3 Note that the transposed convolution does not guarantee to recover the input itself, as it
                      is not deﬁned as the inverse of the convolution, but rather just returns a feature map that has
                      the same width  and  height.
                        4 Note that although equivalent to applying the transposed matrix, this visualization adds
                      a lot of zero multiplications in the for  m of zero padding. This is done here for   illustration
                      purposes, but it is ineﬃcient,  and  software implementations will normally not perfor  m the
                      useless zero multiplications.


                      Relationship 8.A convolution described bys= 1,p= 0 and k
                          has an associated transposed convolution described byk0 =k,s0 =s
                           and p0 = kx1  and  its output size is

                                            <<FORMULA>>

                        Interestingly, this corresponds to a fully padded convolution with unit strides.


                      4.4 Zero padding, unit strides, transposed

                      Knowing that the transpose of a non-padded convolution is equivalent to con-
                      volving a zero padded input, it would be reasonable to suppose that the trans-
                      pose of a zero padded convolution is equivalent to convolving an input padded
                      withlesszeros.
                        It is indeed the case, as shown in Figure 4.2 for  i= 5,k= 4 and p= 2.
                        for  mally, the following relationship applies for   zero padded convolutions:

                          Relationship 9.A convolution described bys= 1,k and phas an
                          associated transposed convolution described byk0 =k,s0 =s and
                          p0 =kp1 and  its output size is

                                           <<FORMULA>>


                      4.4.1 Half (same) padding, transposed
                      By applying the same inductive reasoning as befor  e, it is reasonable to expect
                      that the equivalent convolution of the transpose of a half padded convolution
                      is itself a half padded convolution, given that the output size of a half padded
                      convolution is the same as its input size. Thus the following relation applies:

                          Relationship 10.A convolution described byk= 2n+1; n2N,
                          s= 1 and p=bk=2c=nh as an associated transposed convolution
                          described byk0 =k,s0 =s and p0 =p and  its output size is

                                           <<FORMULA>>


                                           <<FIGURE>>

                        Figure 4.3 provides an example for   i= 5,k= 3 and  (therefor  e)p= 1.

                      4.4.2 Full padding, transposed
                      Knowing that the equivalent convolution of the transpose of a non-padded con-
                      volution involves full padding, it is unsurprising that the equivalent of the trans-
                      pose of a fully padded convolution is a non-padded convolution:

                                          <<FIGURE>>

                      Figure 4.1: The transpose of convolving a 3x3 kernel over a 4x4 input using
                      unit strides (i.e.,i= 4,k= 3,s= 1 and p= 0). It is equivalent to convolving
                      a 3x3 kernel over a 2x2 input padded with a 2x2 border of zeros using unit
                      strides (i.e.,i0 = 2,k0 =k,s0 = 1 and p0 = 2).

                                          <<FIGURE>>

                      Figure 4.2: The transpose of convolving a 4x4 kernel over a 5x5 input padded
                      with a 2x2 border of zeros using unit strides (i.e.,i= 5,k= 4,s= 1 and
                      p= 2). It is equivalent to convolving a 4x4 kernel over a 6x6 input padded
                      with a 1x1 border of zeros using unit strides (i.e.,i0 = 6,k0 =k,s0 = 1 and
                      p0 = 1).

                                           <<FIGURE>>

                      Figure 4.3: The transpose of convolving a 3x3 kernel over a 5x5 input using
                      half padding  and  unit strides (i.e.,i= 5,k= 3,s= 1 and p= 1). It is
                      equivalent to convolving a 3x3 kernel over a 5x5 input using half padding
                       and  unit strides (i.e.,i0 = 5,k0 =k,s0 = 1 and p0 = 1).


                       Relationship 11.A convolution described bys= 1,k and p= kx1
                          has an associated transposed convolution described byk0 =k,s0 =s
                           and p0 = 0 and  its output size is

                                          <<FIGURE>>

                        Figure 4.4 provides an example for  i= 5,k= 3 and  (therefor  e)p= 2.


                      4.5 No zero padding, non-unit strides, transposed

                      Using the same kind of inductive logic as for   zero padded convolutions, one
                      might expect that the transpose of a convolution with s >1 involves an equiv-
                      alent convolution with s <1. As will be explained, this is a valid intuition,
                      which is why transposed convolutions are sometimes called fractionally strided
                      convolutions.
                        Figure 4.5 provides an example for  i= 5,k= 3 and s= 2which helps
                      understand  what fractional strides involve: zeros are inserted between input
                      units, which makes the kernel move around at a slower pace than with unit
                      strides. 5
                        for   the moment, it will be assumed that the convolution is non-padded
                      (p= 0)  and  that its input size i is such that  ixk  is a multiple ofs. In that
                      case, the following relationship holds:

                          Relationship 12.A convolution described byp= 0,k and s and
                          whose input size is such that ixk is a multiple ofs, has an associated
                          transposed convolution described by~i0 ,k0 =k,s0 = 1 and p0 = kx1 ,
                          where~i0 is the size of the stretched input obtained by adding  sx1
                          zeros between each input unit,  and  its output size is

                                            <<FORMULA>>

                      4.6 Zero padding, non-unit strides, transposed

                      When the convolution’s input sizeiis such thati+ 2pkis a multiple ofs,
                      the analysis can extended to the zero padded case by combining Relationship 9
                       and  Relationship 12:
                        5 Doing so is ineﬃcient  and  real-world implementations avoid useless multiplications by
                      zero, but conceptually it is how the transpose of a strided convolution can be thought of.

                                          <<FIGURE>>

                      Figure 4.4: The transpose of convolving a 3x3 kernel over a 5x5 input using
                      full padding  and  unit strides (i.e.,i= 5,k= 3,s= 1 and p= 2). It is equivalent
                      to convolving a 3x3 kernel over a77input using unit strides (i.e.,i0 = 7,
                      k0 =k,s0 = 1 and p0 = 0).

                                          <<FIGURE>>

                      Figure 4.5: The transpose of convolving a 3x3 kernel over a 5x5 input using
                       2x2 strides (i.e.,i= 5,k= 3,s= 2 and p= 0). It is equivalent to convolving
                      a 3x3 kernel over a 2x2 input (with1zero inserted between inputs) padded
                      with a 2x2 border of zeros using unit strides (i.e.,i0 = 2,~i0 = 3,k0 =k,s0 = 1
                       and p0 = 2).

                                        <<FIGURE>>

                      Figure 4.6: The transpose of convolving a 3x3 kernel over a 5x5 input padded
                      with a 1x1 border of zeros using 2x2 strides (i.e.,i= 5,k= 3,s= 2 and
                      p= 1). It is equivalent to convolving a 3x3 kernel over a 3x3 input (with
                      1zero inserted between inputs) padded with a 1x1 border of zeros using unit
                      strides (i.e.,i0 = 3,~i0 = 5,k0 =k,s0 = 1 and p0 = 1).


                         Relationship 13.A convolution described byk,s and p and  whose
                          input sizeiis such thati+2pkis a multiple ofshas an associated
                          transposed convolution described by~i0 ,k0 =k,s0 = 1 and p0 =
                          kp1, where~i0 is the size of the stretched input obtained by
                          adding sx1 zeros between each input unit,  and  its output size is

                                          <<FORMULA>>


                                          <<FIGURE>>

                        Figure 4.6 provides an example for  i= 5,k= 3,s= 2 and p= 1.
                        The constraint on the size of the inputican be relaxed by introducing
                      another parametera2 f0;:::; sx1 gthat allows to distinguish between thes
                      diﬀerent cases that all lead to the samei0 :

                          Relationship 14.A convolution described byk,s and phas an
                          associated transposed convolution described bya,~i0 ,k0 =k,s0 = 1
                           and p0 =kp1, where~i0 is the size of the stretched input obtained
                          by adding sx1 zeros between each input unit,  and a= (i+ 2pk)
                          modsrepresents the number of zeros added to the bottom  and  right
                          edges of the input,  and  its output size is

                                         <<FORMULA>>


                                         <<FIGURE>>

                        Figure 4.7 provides an example for  i= 6,k= 3,s= 2 and p= 1.

                                        <<FIGURE>>

                      Figure 4.7: The transpose of convolving a 3x3 kernel over a 6x6 input padded
                      with a 1x1 border of zeros using 2x2 strides (i.e.,i= 6,k= 3,s= 2 and
                      p= 1). It is equivalent to convolving a 3x3 kernel over a 2x2 input (with
                      1zero inserted between inputs) padded with a 1x1 border of zeros (with an
                      additional border of size1added to the bottom  and  right edges) using unit
                      strides (i.e.,i0 = 3,~i0 = 5,a= 1,k0 =k,s0 = 1 and p0 = 1).


                                                 Chapter 5


                      Miscellaneous convolutions

                      5.1 Dilated convolutions

                      Readers familiar with the deep learning literature may have noticed the term
                      “dilated convolutions” (or “atrous convolutions”, from the French expressioncon-
                      volutions à trous) appear in recent papers. Here we attempt to provide an in-
                      tuitive underst and ing of dilated convolutions. for   a more in-depth description
                       and  to underst and  in what contexts they are applied, see Chenet al.(2014); Yu
                       and  Koltun (2015).
                        Dilated convolutions “inﬂate” the kernel by inserting spaces between the ker-
                      nel elements. The dilation “rate” is controlled by an additional hyperparameter
                      d. Implementations may vary, but there are usually dx1 spaces inserted between
                      kernel elements such thatd= 1corresponds to a regular convolution.
                        Dilated convolutions are used to cheaply increase the receptive ﬁeld of output
                      units without increasing the kernel size, which is especially eﬀective when multi-
                      ple dilated convolutions are stacked one after another. for   a concrete example,
                      see Oordet al.(2016), in which the proposed WaveNet model implements an
                      autoregressive generative model for   raw audio which uses dilated convolutions
                      to condition new audio frames on a large context of past audio frames.
                        To underst and  the relationship tying the dilation rated and  the output size
                      o, it is useful to think of the impact ofdon theeﬀective kernel size. A kernel
                      of sizekdilated by a factordhas an eﬀective size

                                          <<FORMULA>>

                      This can be combined with Relationship 6 to for  m the following relationship for
                      dilated convolutions:

                          Relationship 15.for   any  i,k,p and s,  and  for   a dilation rated,

                                       <<FORMULA>>


                                        <<FIGURE>>
                      Figure 5.1: (Dilated convolution) Convolving a 3x3 kernel over a77input
                      with a dilation factor of 2 (i.e.,i= 7,k= 3,d= 2,s= 1 and p= 0).


                      Figure 5.1 provides an example for  i= 7,k= 3 and d= 2.


                                                  Bibliography


                      Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado,
                       G. S., Davis, A., Dean, J., Devin, M.,et al.(2015). Tensorﬂow: Large-
                       scale machine learning on heterogeneous systems. Software available from
                       tensorﬂow.org.
                      Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I., Bergeron,
                       A., Bouchard, N., Warde-Farley, D.,  and  Bengio, Y. (2012). Theano: new
                       features  and  speed improvements.arXiv preprint arXiv:1211.5590.
                      Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins,
                       G., Turian, J., Warde-Farley, D.,  and  Bengio, Y. (2010). Theano: A cpu  and
                       gpu math compiler in python. InProc. 9th Python in Science Conf, pages
                       1–7.
                      Boureau, Y., Bach, F., LeCun, Y.,  and  Ponce, J. (2010a). Learning mid-level
                       features for   recognition. InProc. International Conference on Computer Vi-
                       sion  and  Pattern Recognition (CVPR’10). IEEE.
                      Boureau, Y., Ponce, J.,  and  LeCun, Y. (2010b). A theoretical analysis of feature
                       pooling in vision algorithms. InProc. International Conference on Machine
                       learning (ICML’10).
                      Boureau, Y., Le Roux, N., Bach, F., Ponce, J.,  and  LeCun, Y. (2011). Ask the
                       locals: multi-way local pooling for   image recognition. InProc. International
                       Conference on Computer Vision (ICCV’11). IEEE.
                      Chen, L.-C., Pap and reou, G., Kokkinos, I., Murphy, K.,  and  Yuille, A. L. (2014).
                       Semantic image segmentation with deep convolutional nets  and  fully con-
                       nected crfs.arXiv preprint arXiv:1412.7062.
                      Collobert, R., Kavukcuoglu, K.,  and  Farabet, C. (2011). Torch7: A matlab-like
                       environment for   machine learning. InBigLearn, NIPS Workshop, number
                       EPFL-CONF-192376.
                      Goodfellow, I., Bengio, Y.,  and  Courville, A. (2016). Deep learning. Book in
                       preparation for   MIT Press.


                      Im, D. J., Kim, C. D., Jiang, H.,  and  Memisevic, R. (2016). Generating images
                       with recurrent adversarial networks.arXiv preprint arXiv:1602.05110.
                      Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadar-
                       rama, S.,  and  Darrell, T. (2014). Caﬀe: Convolutional architecture for   fast
                       feature embedding. InProceedings of the ACM International Conference on
                       Multimedia, pages 675–678. ACM.
                      Krizhevsky, A., Sutskever, I.,  and  Hinton, G. E. (2012). Imagenet classiﬁcation
                       with deep convolutional neural networks. InAdvances in neural infor  mation
                       processing systems, pages 1097–1105.
                      Le Cun, Y., Bottou, L.,  and  Bengio, Y. (1997). Reading checks with multilayer
                       graph transfor  mer networks. InAcoustics, Speech,  and  Signal Processing,
                       1997. ICASSP-97., 1997 IEEE International Conference on, volume 1, pages
                       151–154. IEEE.
                      Long, J., Shelhamer, E.,  and  Darrell, T. (2015). Fully convolutional networks for
                       semantic segmentation. InProceedings of the IEEE Conference on Computer
                       Vision  and  Pattern Recognition, pages 3431–3440.
                      Oord, A. v. d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A.,
                       Kalchbrenner, N., Senior, A.,  and  Kavukcuoglu, K. (2016). Wavenet: A
                       generative model for   raw audio.arXiv preprint arXiv:1609.03499.
                      Radfor  d, A., Metz, L.,  and  Chintala, S. (2015). Unsupervised representa-
                       tion learning with deep convolutional generative adversarial networks.arXiv
                       preprint arXiv:1511.06434.
                      Saxe, A., Koh, P. W., Chen, Z., Bh and , M., Suresh, B.,  and  Ng, A. (2011).
                       On r and om weights  and  unsupervised feature learning. In L. Getoor  and
                       T. Scheﬀer, editors,Proceedings of the 28th International Conference on Ma-
                       chine Learning (ICML-11), ICML ’11, pages 1089–1096, New York, NY, USA.
                       ACM.
                      Visin, F., Kastner, K., Courville, A. C., Bengio, Y., Matteucci, M.,  and  Cho,
                       K. (2015). Reseg: A recurrent neural network for   object segmentation.
                      Yu, F.  and  Koltun, V. (2015). Multi-scale context aggregation by dilated con-
                       volutions.arXiv preprint arXiv:1511.07122.
                      Zeiler, M. D.  and  Fergus, R. (2014). Visualizing  and  underst and ing convolu-
                       tional networks. InComputer vision–ECCV 2014, pages 818–833. Springer.
                      Zeiler, M. D., Taylor, G. W.,  and  Fergus, R. (2011). Adaptive deconvolutional
                       networks for   mid  and  high level feature learning. InComputer Vision (ICCV),
                       2011 IEEE International Conference on, pages 2018–2025. IEEE.