367 lines
37 KiB
Plaintext
367 lines
37 KiB
Plaintext
Convex Neural Networks
|
||
|
||
|
||
|
||
|
||
Yoshua Bengio, Nicolas Le Roux, Pascal Vincent, Olivier Delalleau, Patrice Marcotte
|
||
Dept. IRO, Universite de Montr´ eal´
|
||
P.O. Box 6128, Downtown Branch, Montreal, H3C 3J7, Qc, Canada
|
||
fbengioy,lerouxni,vincentp,delallea,marcotteg@iro.umontreal.ca
|
||
|
||
Abstract
|
||
Convexity has recently received a lot of attention in the machine learning
|
||
community, and the lack of convexity has been seen as a major disad-
|
||
vantage of many learning algorithms, such as multi-layer artificial neural
|
||
networks. We show that training multi-layer neural networks in which the
|
||
number of hidden units is learned can be viewed as a convex optimization
|
||
problem. This problem involves an infinite number of variables, but can be
|
||
solved by incrementally inserting a hidden unit at a time, each time finding
|
||
a linear classifier that minimizes a weighted sum of errors.
|
||
|
||
1 Introduction
|
||
The objective of this paper is not to present yet another learning algorithm, but rather to point
|
||
to a previously unnoticed relation between multi-layer neural networks (NNs),Boosting (Fre-
|
||
und and Schapire, 1997) and convex optimization. Its main contributions concern the mathe-
|
||
matical analysis of an algorithm that is similar to previously proposed incremental NNs, with
|
||
L1 regularization on the output weights. This analysis helps to understand the underlying
|
||
convex optimization problem that one is trying to solve.
|
||
This paper was motivated by the unproven conjecture (based on anecdotal experience) that
|
||
when the number of hidden units is “large”, the resulting average error is rather insensitive to
|
||
the random initialization of the NN parameters. One way to justify this assertion is that to re-
|
||
ally stay stuck in a local minimum, one must have second derivatives positive simultaneously
|
||
in all directions. When the number of hidden units is large, it seems implausible for none of
|
||
them to offer a descent direction. Although this paper does not prove or disprove the above
|
||
conjecture, in trying to do so we found an interestingcharacterization of the optimization
|
||
problem for NNs as a convex programif the output loss function is convex in the NN out-
|
||
put and if the output layer weights are regularized by a convex penalty. More specifically,
|
||
if the regularization is the L1 norm of the output layer weights, then we show that a “rea-
|
||
sonable” solution exists, involving a finite number of hidden units (no more than the number
|
||
of examples, and in practice typically much less). We present a theoretical algorithm that
|
||
is reminiscent of Column Generation (Chvatal, 1983), in which hidden neurons are inserted ´
|
||
one at a time. Each insertion requires solving a weighted classification problem, very much
|
||
like in Boosting (Freund and Schapire, 1997) and in particular Gradient Boosting (Mason
|
||
et al., 2000; Friedman, 2001).
|
||
Neural Networks, Gradient Boosting, and Column Generation
|
||
Denote x~2Rd+1 the extension of vector x2Rd with one element with value 1. What
|
||
we call “Neural Network” (NN) here is a predictor for supervised learning of the form Py^(x) = m wi=1 i hi (x)where xis an input vector, hi (x)is obtained from a linear dis-
|
||
criminant function hi (x) =s(vi |