367 lines
37 KiB
Plaintext
367 lines
37 KiB
Plaintext
|
Convex Neural Networks
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Yoshua Bengio, Nicolas Le Roux, Pascal Vincent, Olivier Delalleau, Patrice Marcotte
|
|||
|
Dept. IRO, Universite de Montr´ eal´
|
|||
|
P.O. Box 6128, Downtown Branch, Montreal, H3C 3J7, Qc, Canada
|
|||
|
fbengioy,lerouxni,vincentp,delallea,marcotteg@iro.umontreal.ca
|
|||
|
|
|||
|
Abstract
|
|||
|
Convexity has recently received a lot of attention in the machine learning
|
|||
|
community, and the lack of convexity has been seen as a major disad-
|
|||
|
vantage of many learning algorithms, such as multi-layer artificial neural
|
|||
|
networks. We show that training multi-layer neural networks in which the
|
|||
|
number of hidden units is learned can be viewed as a convex optimization
|
|||
|
problem. This problem involves an infinite number of variables, but can be
|
|||
|
solved by incrementally inserting a hidden unit at a time, each time finding
|
|||
|
a linear classifier that minimizes a weighted sum of errors.
|
|||
|
|
|||
|
1 Introduction
|
|||
|
The objective of this paper is not to present yet another learning algorithm, but rather to point
|
|||
|
to a previously unnoticed relation between multi-layer neural networks (NNs),Boosting (Fre-
|
|||
|
und and Schapire, 1997) and convex optimization. Its main contributions concern the mathe-
|
|||
|
matical analysis of an algorithm that is similar to previously proposed incremental NNs, with
|
|||
|
L1 regularization on the output weights. This analysis helps to understand the underlying
|
|||
|
convex optimization problem that one is trying to solve.
|
|||
|
This paper was motivated by the unproven conjecture (based on anecdotal experience) that
|
|||
|
when the number of hidden units is “large”, the resulting average error is rather insensitive to
|
|||
|
the random initialization of the NN parameters. One way to justify this assertion is that to re-
|
|||
|
ally stay stuck in a local minimum, one must have second derivatives positive simultaneously
|
|||
|
in all directions. When the number of hidden units is large, it seems implausible for none of
|
|||
|
them to offer a descent direction. Although this paper does not prove or disprove the above
|
|||
|
conjecture, in trying to do so we found an interestingcharacterization of the optimization
|
|||
|
problem for NNs as a convex programif the output loss function is convex in the NN out-
|
|||
|
put and if the output layer weights are regularized by a convex penalty. More specifically,
|
|||
|
if the regularization is the L1 norm of the output layer weights, then we show that a “rea-
|
|||
|
sonable” solution exists, involving a finite number of hidden units (no more than the number
|
|||
|
of examples, and in practice typically much less). We present a theoretical algorithm that
|
|||
|
is reminiscent of Column Generation (Chvatal, 1983), in which hidden neurons are inserted ´
|
|||
|
one at a time. Each insertion requires solving a weighted classification problem, very much
|
|||
|
like in Boosting (Freund and Schapire, 1997) and in particular Gradient Boosting (Mason
|
|||
|
et al., 2000; Friedman, 2001).
|
|||
|
Neural Networks, Gradient Boosting, and Column Generation
|
|||
|
Denote x~2Rd+1 the extension of vector x2Rd with one element with value 1. What
|
|||
|
we call “Neural Network” (NN) here is a predictor for supervised learning of the form Py^(x) = m wi=1 i hi (x)where xis an input vector, hi (x)is obtained from a linear dis-
|
|||
|
criminant function hi (x) =s(vi |