911 lines
54 KiB
Plaintext
911 lines
54 KiB
Plaintext
A guide to convolution arithmetic for deep
|
||
learning
|
||
|
||
The authors of this guide would like to thank David Warde-Farley,
|
||
Guillaume Alain and Caglar Gulcehre for their valuable feedback. We
|
||
are likewise grateful to all those who helped improve this tutorial with
|
||
helpful comments, constructive criticisms and code contributions. Keep
|
||
them coming!
|
||
Special thanks to Ethan Schoonover, creator of the Solarized color
|
||
scheme, 1 whose colors were used for the figures.
|
||
|
||
Feedback
|
||
Your feedback is welcomed! We did our best to be as precise, infor-
|
||
mative and up to the point as possible, but should there be any thing you
|
||
feel might be an error or could be rephrased to be more precise or com-
|
||
prehensible, please don’t refrain from contacting us. Likewise, drop us a
|
||
line if you think there is something that might fit this technical report
|
||
and you would like us to discuss – we will make our best effort to update
|
||
this document.
|
||
|
||
Source code and animations
|
||
The code used to generate this guide along with its figures is available
|
||
on GitHub. 2 There the reader can also find an animated version of the
|
||
figures.
|
||
|
||
|
||
1 Introduction 5
|
||
1.1 Discrete convolutions . . . . . . . . . . . . . . . . . . . . . . . . .6
|
||
1.2 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
|
||
|
||
2 Convolution arithmetic 12
|
||
2.1 No zero padding, unit strides . . . . . . . . . . . . . . . . . . . .12
|
||
2.2 Zero padding, unit strides . . . . . . . . . . . . . . . . . . . . . .13
|
||
2.2.1 Half (same) padding . . . . . . . . . . . . . . . . . . . . .13
|
||
2.2.2 Full padding . . . . . . . . . . . . . . . . . . . . . . . . .13
|
||
2.3 No zero padding, non-unit strides . . . . . . . . . . . . . . . . . .15
|
||
2.4 Zero padding, non-unit strides . . . . . . . . . . . . . . . . . . . .15
|
||
|
||
3 Pooling arithmetic 18
|
||
|
||
4 Transposed convolution arithmetic 19
|
||
4.1 Convolution as a matrix operation . . . . . . . . . . . . . . . . .20
|
||
4.2 Transposed convolution . . . . . . . . . . . . . . . . . . . . . . .20
|
||
4.3 No zero padding, unit strides, transposed . . . . . . . . . . . . .21
|
||
4.4 Zero padding, unit strides, transposed . . . . . . . . . . . . . . .22
|
||
4.4.1 Half (same) padding, transposed . . . . . . . . . . . . . .22
|
||
4.4.2 Full padding, transposed . . . . . . . . . . . . . . . . . . .22
|
||
4.5 No zero padding, non-unit strides, transposed . . . . . . . . . . .24
|
||
4.6 Zero padding, non-unit strides, transposed . . . . . . . . . . . . .24
|
||
|
||
5 Miscellaneous convolutions 28
|
||
5.1 Dilated convolutions . . . . . . . . . . . . . . . . . . . . . . . . .28
|
||
|
||
|
||
Chapter 1
|
||
|
||
|
||
Introduction
|
||
|
||
|
||
Deep convolutional neural networks (CNNs) have been at the heart of spectac-
|
||
ular advances in deep learning. Although CNNs have been used as early as the
|
||
nineties to solve character recognition tasks (Le Cunet al., 1997), their current
|
||
widespread application is due to much more recent work, when a deep CNN
|
||
was used to beat state-of-the-art in the ImageNet image classification challenge
|
||
(Krizhevskyet al., 2012).
|
||
Convolutional neural networks therefor e constitute a very useful tool for ma-
|
||
chine learning practitioners. However, learning to use CNNs for the first time
|
||
is generally an intimidating experience. A convolutional layer’s output shape
|
||
is affected by the shape of its input as well as the choice of kernel shape, zero
|
||
padding and strides, and the relationship between these properties is not triv-
|
||
ial to infer. This contrasts with fully-connected layers, whose output size is
|
||
independent of the input size. Additionally, CNNs also usually feature apool-
|
||
ingstage, adding yet another level of complexity with respect to fully-connected
|
||
networks. Finally, so-called transposed convolutional layers (also known as frac-
|
||
tionally strided convolutional layers) have been employed in more and more work
|
||
as of late (Zeileret al., 2011; Zeiler and Fergus, 2014; Longet al., 2015; Rad-
|
||
for det al., 2015; Visinet al., 2015; Imet al., 2016), and their relationship with
|
||
convolutional layers has been explained with various degrees of clarity.
|
||
This guide’s objective is twofold:
|
||
|
||
1.Explain the relationship between convolutional layers and transposed con-
|
||
volutional layers.
|
||
2.Provide an intuitive underst and ing of the relationship between input shape,
|
||
kernel shape, zero padding, strides and output shape in convolutional,
|
||
pooling and transposed convolutional layers.
|
||
|
||
In order to remain broadly applicable, the results shown in this guide are
|
||
independent of implementation details and apply to all commonly used machine
|
||
learning frameworks, such as Theano (Bergstraet al., 2010; Bastienet al., 2012),
|
||
|
||
|
||
|
||
Torch (Collobertet al., 2011), Tensorflow (Abadiet al., 2015) and Caffe (Jia et al., 2014).
|
||
|
||
This chapter briefly reviews the main building blocks of CNNs, namely dis-
|
||
crete convolutions and pooling. for an in-depth treatment of the subject, see
|
||
Chapter 9 of the Deep Learning textbook (Goodfellowet al., 2016).
|
||
|
||
|
||
1.1 Discrete convolutions
|
||
|
||
The bread and butter of neural networks is affine transformations: a vector
|
||
is received as input and is multiplied with a matrix to produce an output (to
|
||
which a bias vector is usually added before passing the result through a non-
|
||
linearity). This is applicable to any type of input, be it an image, a sound
|
||
clip or an unordered collection of features: whatever their dimensionality, their
|
||
representation can always be flattened into a vector before the transfomation.
|
||
Images, sound clips and many other similar kinds of data have an intrinsic
|
||
structure. More formally, they share these important properties:
|
||
|
||
|