911 lines
54 KiB
Plaintext
911 lines
54 KiB
Plaintext
|
A guide to convolution arithmetic for deep
|
|||
|
learning
|
|||
|
|
|||
|
The authors of this guide would like to thank David Warde-Farley,
|
|||
|
Guillaume Alain and Caglar Gulcehre for their valuable feedback. We
|
|||
|
are likewise grateful to all those who helped improve this tutorial with
|
|||
|
helpful comments, constructive criticisms and code contributions. Keep
|
|||
|
them coming!
|
|||
|
Special thanks to Ethan Schoonover, creator of the Solarized color
|
|||
|
scheme, 1 whose colors were used for the figures.
|
|||
|
|
|||
|
Feedback
|
|||
|
Your feedback is welcomed! We did our best to be as precise, infor-
|
|||
|
mative and up to the point as possible, but should there be any thing you
|
|||
|
feel might be an error or could be rephrased to be more precise or com-
|
|||
|
prehensible, please don’t refrain from contacting us. Likewise, drop us a
|
|||
|
line if you think there is something that might fit this technical report
|
|||
|
and you would like us to discuss – we will make our best effort to update
|
|||
|
this document.
|
|||
|
|
|||
|
Source code and animations
|
|||
|
The code used to generate this guide along with its figures is available
|
|||
|
on GitHub. 2 There the reader can also find an animated version of the
|
|||
|
figures.
|
|||
|
|
|||
|
|
|||
|
1 Introduction 5
|
|||
|
1.1 Discrete convolutions . . . . . . . . . . . . . . . . . . . . . . . . .6
|
|||
|
1.2 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
|
|||
|
|
|||
|
2 Convolution arithmetic 12
|
|||
|
2.1 No zero padding, unit strides . . . . . . . . . . . . . . . . . . . .12
|
|||
|
2.2 Zero padding, unit strides . . . . . . . . . . . . . . . . . . . . . .13
|
|||
|
2.2.1 Half (same) padding . . . . . . . . . . . . . . . . . . . . .13
|
|||
|
2.2.2 Full padding . . . . . . . . . . . . . . . . . . . . . . . . .13
|
|||
|
2.3 No zero padding, non-unit strides . . . . . . . . . . . . . . . . . .15
|
|||
|
2.4 Zero padding, non-unit strides . . . . . . . . . . . . . . . . . . . .15
|
|||
|
|
|||
|
3 Pooling arithmetic 18
|
|||
|
|
|||
|
4 Transposed convolution arithmetic 19
|
|||
|
4.1 Convolution as a matrix operation . . . . . . . . . . . . . . . . .20
|
|||
|
4.2 Transposed convolution . . . . . . . . . . . . . . . . . . . . . . .20
|
|||
|
4.3 No zero padding, unit strides, transposed . . . . . . . . . . . . .21
|
|||
|
4.4 Zero padding, unit strides, transposed . . . . . . . . . . . . . . .22
|
|||
|
4.4.1 Half (same) padding, transposed . . . . . . . . . . . . . .22
|
|||
|
4.4.2 Full padding, transposed . . . . . . . . . . . . . . . . . . .22
|
|||
|
4.5 No zero padding, non-unit strides, transposed . . . . . . . . . . .24
|
|||
|
4.6 Zero padding, non-unit strides, transposed . . . . . . . . . . . . .24
|
|||
|
|
|||
|
5 Miscellaneous convolutions 28
|
|||
|
5.1 Dilated convolutions . . . . . . . . . . . . . . . . . . . . . . . . .28
|
|||
|
|
|||
|
|
|||
|
Chapter 1
|
|||
|
|
|||
|
|
|||
|
Introduction
|
|||
|
|
|||
|
|
|||
|
Deep convolutional neural networks (CNNs) have been at the heart of spectac-
|
|||
|
ular advances in deep learning. Although CNNs have been used as early as the
|
|||
|
nineties to solve character recognition tasks (Le Cunet al., 1997), their current
|
|||
|
widespread application is due to much more recent work, when a deep CNN
|
|||
|
was used to beat state-of-the-art in the ImageNet image classification challenge
|
|||
|
(Krizhevskyet al., 2012).
|
|||
|
Convolutional neural networks therefor e constitute a very useful tool for ma-
|
|||
|
chine learning practitioners. However, learning to use CNNs for the first time
|
|||
|
is generally an intimidating experience. A convolutional layer’s output shape
|
|||
|
is affected by the shape of its input as well as the choice of kernel shape, zero
|
|||
|
padding and strides, and the relationship between these properties is not triv-
|
|||
|
ial to infer. This contrasts with fully-connected layers, whose output size is
|
|||
|
independent of the input size. Additionally, CNNs also usually feature apool-
|
|||
|
ingstage, adding yet another level of complexity with respect to fully-connected
|
|||
|
networks. Finally, so-called transposed convolutional layers (also known as frac-
|
|||
|
tionally strided convolutional layers) have been employed in more and more work
|
|||
|
as of late (Zeileret al., 2011; Zeiler and Fergus, 2014; Longet al., 2015; Rad-
|
|||
|
for det al., 2015; Visinet al., 2015; Imet al., 2016), and their relationship with
|
|||
|
convolutional layers has been explained with various degrees of clarity.
|
|||
|
This guide’s objective is twofold:
|
|||
|
|
|||
|
1.Explain the relationship between convolutional layers and transposed con-
|
|||
|
volutional layers.
|
|||
|
2.Provide an intuitive underst and ing of the relationship between input shape,
|
|||
|
kernel shape, zero padding, strides and output shape in convolutional,
|
|||
|
pooling and transposed convolutional layers.
|
|||
|
|
|||
|
In order to remain broadly applicable, the results shown in this guide are
|
|||
|
independent of implementation details and apply to all commonly used machine
|
|||
|
learning frameworks, such as Theano (Bergstraet al., 2010; Bastienet al., 2012),
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Torch (Collobertet al., 2011), Tensorflow (Abadiet al., 2015) and Caffe (Jia et al., 2014).
|
|||
|
|
|||
|
This chapter briefly reviews the main building blocks of CNNs, namely dis-
|
|||
|
crete convolutions and pooling. for an in-depth treatment of the subject, see
|
|||
|
Chapter 9 of the Deep Learning textbook (Goodfellowet al., 2016).
|
|||
|
|
|||
|
|
|||
|
1.1 Discrete convolutions
|
|||
|
|
|||
|
The bread and butter of neural networks is affine transformations: a vector
|
|||
|
is received as input and is multiplied with a matrix to produce an output (to
|
|||
|
which a bias vector is usually added before passing the result through a non-
|
|||
|
linearity). This is applicable to any type of input, be it an image, a sound
|
|||
|
clip or an unordered collection of features: whatever their dimensionality, their
|
|||
|
representation can always be flattened into a vector before the transfomation.
|
|||
|
Images, sound clips and many other similar kinds of data have an intrinsic
|
|||
|
structure. More formally, they share these important properties:
|
|||
|
|
|||
|
|