testing_generation/Corpus/convex-neural-networks.txt

                                    Convex Neural Networks


                      Yoshua Bengio, Nicolas Le Roux, Pascal Vincent, Olivier Delalleau, Patrice Marcotte
                                         Dept. IRO, Universite de Montr´      eal´
                             P.O. Box 6128, Downtown Branch, Montreal, H3C 3J7, Qc, Canada
                              fbengioy,lerouxni,vincentp,delallea,marcotteg@iro.umontreal.ca

                                                Abstract
                           Convexity has recently received a lot of attention in the machine learning
                           community, and the lack of convexity has been seen as a major disad-
                           vantage of many learning algorithms, such as multi-layer artiﬁcial neural
                           networks. We show that training multi-layer neural networks in which the
                           number of hidden units is learned can be viewed as a convex optimization
                           problem. This problem involves an inﬁnite number of variables, but can be
                           solved by incrementally inserting a hidden unit at a time, each time ﬁnding
                           a linear classiﬁer that minimizes a weighted sum of errors.

                     1 Introduction
                     The objective of this paper is not to present yet another learning algorithm, but rather to point
                     to a previously unnoticed relation between multi-layer neural networks (NNs),Boosting (Fre-
                     und and Schapire, 1997) and convex optimization. Its main contributions concern the mathe-
                     matical analysis of an algorithm that is similar to previously proposed incremental NNs, with
                     L1 regularization on the output weights. This analysis helps to understand the underlying
                     convex optimization problem that one is trying to solve.
                     This paper was motivated by the unproven conjecture (based on anecdotal experience) that
                     when the number of hidden units is “large”, the resulting average error is rather insensitive to
                     the random initialization of the NN parameters. One way to justify this assertion is that to re-
                     ally stay stuck in a local minimum, one must have second derivatives positive simultaneously
                     in all directions. When the number of hidden units is large, it seems implausible for none of
                     them to offer a descent direction. Although this paper does not prove or disprove the above
                     conjecture, in trying to do so we found an interestingcharacterization of the optimization
                     problem for NNs as a convex programif the output loss function is convex in the NN out-
                     put and if the output layer weights are regularized by a convex penalty. More speciﬁcally,
                     if the regularization is the L1 norm of the output layer weights, then we show that a “rea-
                     sonable” solution exists, involving a ﬁnite number of hidden units (no more than the number
                     of examples, and in practice typically much less). We present a theoretical algorithm that
                     is reminiscent of Column Generation (Chvatal, 1983), in which hidden neurons are inserted ´
                     one at a time. Each insertion requires solving a weighted classiﬁcation problem, very much
                     like in Boosting (Freund and Schapire, 1997) and in particular Gradient Boosting (Mason
                     et al., 2000; Friedman, 2001).
                     Neural Networks, Gradient Boosting, and Column Generation
                     Denote x~2Rd+1 the extension of vector x2Rd with one element with value 1. What
                     we call “Neural Network” (NN) here is a predictor for supervised learning of the form Py^(x) =  m wi=1 i hi (x)where xis an input vector, hi (x)is obtained from a linear dis-
                     criminant function hi (x) =s(vi x~)with e.g. s(a) = sign(a), or s(a) = tanh(a)or
                     s(a) =  1 . A learning algorithm must specify how to select m, the w1+ea                                        i ’s and the vi ’s.                     The classical solution (Rumelhart, Hinton and Williams, 1986) involves (a) selecting a loss
                     function Q(^y;y)that speciﬁes how to penalize for mismatches between y^(x)and the ob-
                     served y’s (target output or target class), (b) optionally selecting a regularization penalty that
                     favors “small” parameters, and (c) choosing a method to approximately minimize the sum of
                     the losses on the training data D=f(x1 ;y 1 );:::;(xn ;y n )gplus the regularization penalty.
                     Note that in this formulation, an output non-linearity can still be used, by inserting it in the
                     loss function Q. Examples of such loss functions are the quadratic loss jjy^yjj 2 , the hinge
                     loss max(0;1yy^)(used in SVMs), the cross-entropy loss ylog ^y(1y)log(1y^)
                     (used in logistic regression), and the exponential loss eyy^ (used in Boosting).
                     Gradient Boosting has been introduced in (Friedman, 2001) and (Mason et al., 2000) as a
                     non-parametric greedy-stagewise supervised learning algorithm in which one adds a function
                     at a time to the current solution y^(x), in a steepest-descent fashion, to form an additive model
                     as above but with the functions hi typically taken in other kinds of sets of functions, such as
                     those obtained with decision trees. In a stagewise approach, when the (m+1) -th basis hm+1 is added, only wm+1 is optimized (by a line search), like inmatching pursuitalgorithms.Such
                     a greedy-stagewise approach is also at the basis of Boosting algorithms (Freund and Schapire,
                     1997), which is usually applied using decision trees as bases and Qthe exponential loss.
                     It may be difﬁcult to minimize exactly for wm+1 and hm+1 when the previous bases and
                     weights are ﬁxed, so (Friedman, 2001) proposes to “follow the gradient” in function space,
                     i.e., look for a base learner hm+1 that is best correlated with the gradient of the average
                     loss on the y^(xi )(that would be the residue y^(xi )yi in the case of the square loss). The
                     algorithm analyzed here also involves maximizing the correlation between Q0 (the derivative
                     of Qwith respect to its ﬁrst argument, evaluated on the training predictions) and the next
                     basis hm+1 . However, we follow a “stepwise”, less greedy, approach, in which all the output
                     weights are optimized at each step, in order to obtain convergence guarantees.
                     Our approach adapts the Column Generation principle (Chvatal, 1983), a decomposition´
                     technique initially proposed for solving linear programs with many variables and few con-
                     straints. In this framework, active variables, or “columns”, are only generated as they are
                     required to decrease the objective. In several implementations, the column-generation sub-
                     problem is frequently a combinatorial problem for which efﬁcient algorithms are available.
                     In our case, the subproblem corresponds to determining an “optimal” linear classiﬁer.

                     2 Core Ideas
                     Informally, consider the set Hof all possible hidden unit functions (i.e., of all possible hidden
                     unit weight vectors vi ). Imagine a NN that has all the elements in this set as hidden units. We
                     might want to impose precision limitations on those weights to obtain either a countable or
                     even a ﬁnite set. For such a NN, we only need to learn the output weights. If we end up with
                     a ﬁnite number of non-zero output weights, we will have at the end an ordinary feedforward
                     NN. This can be achieved by using a regularization penalty on the output weights that yields
                     sparse solutions, such as the L1 penalty. If in addition the loss function is convex in the output
                     layer weights (which is the case of squared error, hinge loss, -tube regression loss, and
                     logistic or softmax cross-entropy), then it is easy to show that the overall training criterion
                     is convex in the parameters (which are now only the output weights). The only problem is
                     that there are as many variables in this convex program as there are elements in the set H,
                     which may be very large (possibly inﬁnite). However, we ﬁnd that with L1 regularization,
                     a ﬁnite solution is obtained, and that such a solution can be obtained by greedily inserting
                     one hidden unit at a time. Furthermore, it is theoretically possible to check that the global
                     optimum has been reached.
                     Deﬁnition 2.1.Let Hbe a set of functions from an input space Xto R. Elements of H
                     can be understood as “hidden units” in a NN. Let Wbe the Hilbert space of functions from
                     Hto R, with an inner product denoted by abfor a;b2 W . An element of Wcan be
                     understood as the output weights vector in a neural network. Let h(x) :H !Rthe function
                     that maps any element hi of Hto hi (x). h(x)can be understood as the vector of activations                     of hidden units when input xis observed. Let w2 W represent aparameter(the output
                     weights). The NN prediction is denoted y^(x) =wh(x). Let Q:RR!Rbe a
                     cost function convex in its ﬁrst argument that takes a scalar prediction y^(x)and a scalar
                     target value yand returns a scalar cost. This is the cost to be minimized on example pair
                     (x;y). Let D=f(xi ;y i ) : 1inga training set. Let  :W !Rbe a convex
                     regularization functional that penalizes for the choice of more “complex” parameters (e.g.,
                     (w) =jjwjj 1 according to a 1-norm in W, if His countable). We deﬁne theconvex NN
                     criterion C(H;Q;;D;w)with parameter was follows: Xn
                                  C(H;Q;;D;w) = (w) +   Q(wh(xt );y t ):           (1)
                                                       t=1
                     The following is a trivial lemma, but it is conceptually very important as it is the basis for the
                     rest of the analysis in this paper.
                     Lemma 2.2.The convex NN cost C(H;Q;;D;w)is a convex function of w.
                     Proof. Q(wh(xt );y t )is convex in wand is convex in w, by the above construction. C
                     is additive in Q(wh(xt );y t )and additive in . Hence Cis convex in w.
                     Note that there are no constraints in this convex optimization program, so that at the global
                     minimum all the partial derivatives of Cwith respect to elements of wcancel.
                     Let jHj be the cardinality of the set H. If it is not ﬁnite, it is not obvious that an optimal
                     solution can be achieved in ﬁnitely many iterations.
                     Lemma 2.2 says that training NNs from a very large class (with one or more hidden layer)
                     can be seen as convex optimization problems, usually in a very high dimensional space,as
                     long as we allow the number of hidden units to be selected by the learning algorithm.
                     By choosing a regularizer that promotessparsesolutions, we obtain a solution that has a
                     ﬁnitenumber of “active” hidden units (non-zero entries in the output weights vector w).
                     This assertion is proven below, in theorem 3.1, for the case of the hinge loss.
                     However, even if the solution involves a ﬁnite number of active hidden units, the convex
                     optimization problem could still be computationally intractable because of the large number
                     of variables involved. One approach to this problem is to apply the principles already suc-
                     cessfully embedded in Gradient Boosting, but more speciﬁcally in Column Generation (an
                     optimization technique for very large scale linear programs), i.e., add one hidden unit at a
                     time in an incremental fashion. Theimportant ingredient here is a way to know that we
                     have reached the global optimum, thus not requiring to actually visit all the possible
                     hidden units.We show that this can be achieved as long as we can solve the sub-problem
                     of ﬁnding a linear classiﬁer that minimizes the weighted sum of classiﬁcation errors. This
                     can be done exactly only on low dimensional data sets but can be well approached using
                     weighted linear SVMs, weighted logistic regression, or Perceptron-type algorithms.
                     Another idea (not followed up here) would be to consider ﬁrst a smaller set H1 , for which
                     the convex problem can be solved in polynomial time, and whose solution can theoretically
                     be selected as initialization for minimizing the criterion C(H2 ;Q;;D;w), with H1  H 2 ,
                     and where H2 may have inﬁnite cardinality (countable or not). In this way we could show
                     that we can ﬁnd a solution whose cost satisﬁes C(H2 ;Q;;D;w)C(H1 ;Q;;D;w),
                     i.e., is at least as good as the solution of a more restricted convex optimization problem. The
                     second minimization can be performed with a local descent algorithm, without the necessity
                     to guarantee that the global optimum will be found.

                     3 Finite Number of Hidden Neurons
                     In this section we consider the special case with Q(^y;y) =max(0;1yy^)the hinge loss,
                     and L1 regularization, and we show that the global optimum of the convex cost involves at
                     most n+ 1 hidden neurons, using an approach already exploited in (Ratsch, Demiriz and¨
                     Bennett, 2002) for L1 -loss regression Boosting with L1 regularization of output weights.                                                    Xn
                     The training criterion is C(w) =Kkwk1 +   max(0;1yt wh(xt )) . Let us rewrite
                                                    t=1 this cost function as the constrained optimization problem: Xn          y     xminL(w;) =Kkwk                       t )]1t  (C1 )
                                        1 +   t s.t.       t [wh(
                         w;                          and t 0;t= 1;:::;n   (C2 )t=1
                     Using a standard technique, the above program can be recast as a linear program. Deﬁn-
                     ing = (1 ;:::; n )the vector of Lagrangian multipliers for the constraints C1 , its dual
                     problem (P)takes the form (in the case of a ﬁnite number         Jof base learners): Xn             Z(P) : max                 i K0;i2I
                                             t s.t.             and t 1;t= 1;:::;n
                     with (Z               t=1
                          i )t =yt hi (xt ). In the case of a ﬁnite number Jof base learners, I=f1;:::;Jg. If
                     the number of hidden units is uncountable, then Iis a closed bounded interval of R.
                     Such an optimization problem satisﬁes all the conditions needed for using Theorem 4.2
                     from (Hettich and Kortanek, 1993). Indeed:
                     Iis compact (as a closed bounded interval of P                      R);
                     F:7! n t =1 t is a concave function (it is even a linear function);
                     g: (;i)7!Zi Kis convex in (it is actually linear in );
                     (P)n(therefore ﬁnite) ( (P)is the largest value of Fsatisfying the constraints);
                     for every set of n+ 1 points i0 ;:::;i n 2I, there exists ~such that g(;i~ j )<0for
                     j= 0;:::;n (one can take ~= 0 since K >0).
                     Then, from Theorem 4.2 from (Hettich and Kortanek, 1993), the following theorem holds:
                     Theorem 3.1.The solution of (P)can be attained with constraints C0 and only n+ 1 con- 2 straints C0 (i.e., there exists a subset of n+1 constraints C0 giving rise to the same maximum 1                               1 as when using the whole set of constraints). Therefore, the primal problem associated is the
                     minimization of the cost function of a NN with n+ 1 hidden neurons.

                     4 Incremental Convex NN Algorithm
                     In this section we present a stepwise algorithm to optimize a NN, and show that there is a cri-
                     terion that allows to verify whether the global optimum has been reached. This is a specializa-
                     tion of minimizing C(H;Q;;D;w), with (w) =jjwjj 1 and H=fh:h(x) =s(vx~)g
                     is the set of soft or hard linear classiﬁers (depending on choice of s()).
                                        Algorithm ConvexNN( D, Q, , s)
                      Input: training set D=f(x1 ;y 1 );:::;(xn ;y n )g, convex loss function Q, and scalar
                      regularization penalty . sis either thesignfunction or the P tanhfunction.
                      (1)Set v1 = (0;0;:::;1) and select w1 = argmin    Q(ww                  1 j.
                      (2)Set i= 2 .                         1  t   1 s(1);y t ) +jw

                      (3)Repeat      P(4)   Let q       i1t =Q0 (    wj=1 j hj (xt );y t )
                      (5)   If s= sign
                      (5a)     train linear classiﬁer hi (x) = sign(vi x~)with examples Pf(xt ;sign(qt ))g
                               and errors weighted by jqt j, t= 1:::n (i.e.,maximize   qt t hi (xt ))
                      (5b)   else ( s= tanh )                            P(5c)    Ptrain linear classiﬁer hi (x) = tanh(vi x~)tomaximize   q  (xt t hi t ).
                      (6)   If   q        ,stop.t t hi (xt )< 
                      (7)   Select w1 ;:::;w i (and optionally vP   2 ;:::;v i ) minimizing (exactly or P                    Papproximately) C=   Q( i  w              jwt    j=1 j hj (xt );y t ) +   j=1  j j
                           such that @C = 0 for j= 1:::i .@w j        P(8) Returnthe predictor y^(x) =  i  wj=1 j hj (x).                     A key property of the above algorithm is that, at termination, the global optimum is reached,
                     i.e., no hidden unit (linear classiﬁer) can improve the objective. In the case where s= sign ,
                     we obtain a Boosting-like algorithm, i.e., it involves ﬁnding a classiﬁer which minimizes the Pweighted cost   qt t sign(vx~t ).
                     Theorem 4.1.AlgorithmConvexNN Pstops when it reaches the global optimum of
                                      C(w) =   Q(wh(x   ) +jjwjj t       t );y t      1 .
                     Proof.Let wbe the output weights vector when the algorithm stops. Because the set of
                     hidden units Hwe consider is such that when his in H, his also in H, we can assume
                     all weights to be non-negative. By contradiction, if w0 6=wis the global optimum, with
                     C(w0 )< C(w), then, since Cis convex in the output weights, for any 2(0;1) , we have
                     C(w 0 + (1)w)C(w0 ) + (1)C(w)< C(w). Let w =w 0 + (1)w. For
                     small enough, we can assume all weights in wthat are strictly positive to be also strictly
                     positive in w . Let us denote by Ip the set of strictly positive weights in w(and w ), by Iz the set of weights set to zero in wbut to a non-zero value in w , and by k the difference
                     w;k wk in the weight of hidden unit hk between wand w . We can assume j <0for
                     j2Iz , because instead of setting a small positive weight to hj , one can decrease the weight
                     of hj by the same amount, which will give either the same cost, or possibly a lower one
                     when the weight of h                                      1j is positive. With o()denoting a quantity such that  o()!0
                     when !0, the difference  (w) =XC(w )C(w)can now be written:
                      (w) = (kw k1  kwk1 ) +   (Q(w h(xt );y t )Q(wh(xt );y t ))
                                0            t 1
                                  X    X       XX= @  i +    Aj  +     (Q0 (wh(xt );y t )k hk (xt )) +o()
                                  i  2Ip    j2Iz        t !  k                    !X      X           X        X=       i +   qt i hi (xt ) +     j +   qt j hj (xt ) +o()
                               i2Ip       t           j2Iz        !t
                               X  @C     X        X=    i   (w) +    @w            j +   qt j hj (xt ) +o()
                                      ii2Ip         j2Iz         !t
                                  X        X= 0 +     j +   qt j hj (xt ) +o()
                                  j2Iz         t
                     since for i2Ip , thanks to step (7) of the algorithm, we have @C (w) = 0 . Thus the @w
                     inequality          rewrites into                    i 1  (w)<0                  !X           X1 j +   qt hj (xt ) +1 o()<0
                                  j2I
                     which, when !0, yields (note that z           t
                                               1 j does not depend on !  since j is linear in ):
                                      X           X1 j +   qt hj (xt ) 0               (2)
                                      j2I But, h              z           t
                          i being the optimal classiﬁer chosen in step (5a) or (5c), all hidden units hP                              P          j verify Pq        q                   1 t t hj (xt )    t t hi (xt )<  and 8j2Iz ,   j (+   q        0(since
                                                                 t t hj (xt ))>
                      j <0) , contradicting eq. 2.
                     (Mason et al., 2000) prove a related global convergence result for the AnyBoost algorithm,
                     a non-parametric Boosting algorithm that is also similar to Gradient Boosting (Friedman,
                     2001). Again, this requires solving as a sub-problem an exact minimization to ﬁnd a function
                     hi 2 H that is maximally correlated with the gradient Q0 on the output. We now show a
                     simple procedure to select a hyperplane with the best weighted classiﬁcation error.
                     Exact Minimization                     In step (5a) we are required to ﬁnd a linear classiﬁer that minimizes the weighted sum of
                     classiﬁcation errors. Unfortunately, this is an NP-hard problem (w.r.t. d, see theorem 4
                     in (Marcotte and Savard, 1992)). However, an exact solution can be easily found in O(n3 )
                     computations for d= 2 inputs.
                     Proposition 4.2.Finding a linear classiﬁer that minimizes the weighted sum of classiﬁcation
                     error can be achieved in O(n3 )steps when the input dimension is d= 2 .
                                         PProof.We want to maximize   c        +b)with respect to uand b, the c
                       R                  i i sign(uxi                       i ’s being
                     in . Consider uﬁxedand sort the xi ’s according to their dot product with uand denote r
                     the function which maps ito r(i)such that xr(i) is in i-th position in the sort. Depending on P       Pthe value of b, we will have n+1 possible sums, respectively   k c          ci=1 r(i) +  n
                                                                          i=k+1 r(i) ,
                     k= 0;:::;n . It is obvious that those sums only depend on the order of the products uxi ,
                     i= 1;:::;n . When uvaries smoothly on the unit circle, as the dot product is a continuous
                     function of its arguments, the changes in the order of the dot products will occur only when
                     there is a pair (i;j)such that uxi =uxj . Therefore, there are at most as many order
                     changes as there are pairs of different points, i.e., n(n1)=2. In the case of d= 2 , we
                     can enumerate all the different angles for which there is a change, namely a1 ;:::;a z with
                     zn(n1) . We then need to test at least one u= [cos();sin()] for each interval a2                                                    i <
                      < a i+1 , and also one ufor  < a 1 , which makes a total of n(n1) possibilities. 2
                     It is possible to generalize this result in higher dimensions, and as shown in (Marcotte and
                     Savard, 1992), one can achieve O(log(n)nd )time.
                     Algorithm 1Optimal linear classiﬁer search
                                          PMaximizing  n c             in dimension 2
                      (1)for i= 1;:::;n for j=i+ 1 i=1 i (sign(wxi );y i )
                                             ;:::;n
                      (3)  i;j =(xi ;x j ) +  where (x                   and x2       i ;x j )is the angle between xi    j (6)sort the i;j in increasing order
                      (7) w0 = (1;0)
                      (8)for k= 1;:::; n(n1)
                                     2
                      (9)   wk = (cos i;j ;sin i;j ), uk =wk +wk1
                      (10)  sort the x                  2
                                  i according to the value of P         uk xi (11)  compute S(uk ) =  n c    x
                                      S   i=1 i (uk  i );y i )
                      (12)output: argmax uk
                     Approximate Minimization
                     For data in higher dimensions, the exact minimization scheme to ﬁnd the optimal linear
                     classiﬁer is not practical. Therefore it is interesting to consider approximate schemes for
                     obtaining a linear classiﬁer with weighted costs. Popular schemes for doing so are the linear
                     SVM (i.e., linear classiﬁer with hinge loss), the logistic regression classiﬁer, and variants of
                     the Perceptron algorithm. In that case, step (5c) of the algorithm is not an exact minimization,
                     and one cannot guarantee that the global optimum will be reached. However, it might be
                     reasonable to believe that ﬁnding a linear classiﬁer by minimizing a weighted hinge loss
                     should yield solutions close to the exact minimization. Unfortunately, this is not generally
                     true, as we have found out on a simple toy data set described below. On the other hand,
                     if in step (7) one performs an optimization not only of the output weights wj ( ji) but
                     also of the corresponding weight vectors vj , then the algorithm ﬁnds a solution close to the
                     global optimum (we could only verify this on 2-D data sets, where the exact solution can be
                     computed easily). It means that at the end of each stage, one ﬁrst performs a few training
                     iterations of the whole NN (for the hidden units ji) with an ordinary gradient descent
                     mechanism (we used conjugate gradients but stochastic gradient descent would work too),
                     optimizing the wj ’s and the vj ’s, and then one ﬁxes the vj ’s and obtains the optimal wj ’s for
                     these vj ’s (using a convex optimization procedure). In our experiments we used a quadratic                     Q, for which the optimization of the output weights can be done with a neural network, using
                     the outputs of the hidden layer as inputs.
                     Let us consider now a bit more carefully what it means to tune the vj ’s in step (7). Indeed,
                     changing the weight vector vj of a selected hidden neuron to decrease the cost isequivalent
                     to a change in the output weights w’s. More precisely, consider the step in which the
                     value of vj becomes v0 . This is equivalent to the following operation on the w’s, when wj                                            j is the corresponding output weight value: the output weight associated with the value vj of
                     a hidden neuron is set to 0, and the output weight associated with the value v0 of a hidden j neuron is set to wj . This corresponds to an exchange between two variables in the convex
                     program. We are justiﬁed to take any such step as long as it allows us to decrease the cost
                     C(w). The fact that we are simultaneously making such exchanges on all the hidden units
                     when we tune the vj ’s allows us to move faster towards the global optimum.
                     Extension to multiple outputs
                     The multiple outputs case is more involved than the single-output case because it is not Penough to check the condition   ht t qt >  . Consider a new hidden neuron whose output is
                     hi when the input is xi . Let us also denote = [1 ;:::; n ]0 the vector of output weights
                     between the new hidden neuron and the n            o
                               P               o output neurons. The gradient with respect to j
                     is gj =@C =   h           with q@                      tj the value of the j-th output neuron with input j    t t qtj sign(j )     Pxt . This means that if, for a given j, we have j  hqt ttj j<  , moving Pj away from 0 can
                     only increase the cost. Therefore, the right quantity to consider is (j  hqt ttj j )+ .
                                           P  PWe must therefore ﬁnd argmax   (j  hv  j   t t qtj j )2 . As before, this sub-problem is not + convex, but it is not as obvious how to approximate it by a convex problem. The stopping Pcriterion becomes: if there is no jsuch that j  ht t qtj j>  , then all weights must remain
                     equal to 0 and a global minimum is reached.
                     Experimental Results
                     We performed experiments on the 2-D double moon toy dataset (as used in (Delalleau, Ben-
                     gio and Le Roux, 2005)), to be able to compare with the exact version of the algorithm. In
                     these experiments, Q(wh(xt );y t ) = [wh(xt )yt ]2 . The set-up is the following:
                     Select a new linear classiﬁer, either (a) the optimal one or (b) an approximate using logistic
                     regression.
                     Optimize the output weights using a convex optimizer.
                     In case (b), tune both input and output weights by conjugate gradient descent on Cand
                     ﬁnally re-optimize the output weights using LASSO regression.
                     Optionally, remove neurons whose output weight has been set to 0.
                     Using the approximate algorithm yielded for 100 training examples an average penalized
                     ( = 1 ) squared error of 17.11 (over 10 runs), an average test classiﬁcation error of 3.68%
                     and an average number of neurons of 5.5 . The exact algorithm yielded a penalized squared
                     error of 8.09, an average test classiﬁcation error of 5.3%, and required 3 hidden neurons. A
                     penalty of = 1 was nearly optimal for the exact algorithm whereas a smaller penalty further
                     improved the test classiﬁcation error of the approximate algorithm. Besides, when running
                     the approximate algorithm for a long time, it converges to a solution whose quadratic error is
                     extremely close to the one of the exact algorithm.

                     5 Conclusion
                     We have shown that training a NN can be seen as a convex optimization problem, and have
                     analyzed an algorithm that can exactly or approximately solve this problem. We have shown
                     that the solution with the hinge loss involved a number of non-zero weights bounded by
                     the number of examples, and much smaller in practice. We have shown that there exists a
                     stopping criterion to verify if the global optimum has been reached, but it involves solving a
                     sub-learning problem involving a linear classiﬁer with weighted errors, which can be com-                     putationally hard if the exact solution is sought, but can be easily implemented for toy data
                     sets (in low dimension), for comparing exact and approximate solutions.
                     The above experimental results are in agreement with our initial conjecture: when there are
                     many hidden units we are much less likely to stall in the optimization procedure, because
                     there are many more ways to descend on the convex cost C(w). They also suggest, based
                     on experiments in which we can compare with the exact sub-problem minimization, that
                     applying AlgorithmConvexNNwith an approximate minimization for adding each hidden
                     unitwhile continuing to tune the previous hidden unitstends to lead to fast convergence
                     to the global minimum. What can get us stuck in a “local minimum” (in the traditional sense,
                     i.e., of optimizing w’s and v’s together) is simply theinability to ﬁnd a new hidden unit
                     weight vector that can improve the total cost (ﬁt and regularization term) even if there
                     exists one.
                     Note that as a side-effect of the results presented here, we have a simple way to train P            neural
                     networks with hard-threshold hidden units, since increasing   Q0 (^y(x          )t     t );y t )sign(vi xt can be either achieved exactly (at great price) or approximately (e.g. by using a cross-entropy
                     or hinge loss on the corresponding linear classiﬁer).

                     Acknowledgments

                     The authors thank the following for support: NSERC, MITACS, and the Canada Research
                     Chairs. They are also grateful for the feedback and stimulating exchanges with Sam Roweis,
                     Nathan Srebro, and Aaron Courville.

                     References

                     Chvatal, V. (1983).´        Linear Programming. W.H. Freeman.
                     Delalleau, O., Bengio, Y., and Le Roux, N. (2005). Efﬁcient non-parametric function induction
                        in semi-supervised learning. In Cowell, R. and Ghahramani, Z., editors,Proceedings of AIS-
                        TATS’2005, pages 96–103.
                     Freund, Y. and Schapire, R. E. (1997). A decision theoretic generalization of on-line learning and an
                        application to boosting.Journal of Computer and System Science, 55(1):119–139.
                     Friedman, J. (2001). Greedy function approximation: a gradient boosting machine.Annals of Statis-
                        tics, 29:1180.
                     Hettich, R. and Kortanek, K. (1993). Semi-inﬁnite programming: theory, methods, and applications.
                        SIAM Review, 35(3):380–429.
                     Marcotte, P. and Savard, G. (1992). Novel approaches to the discrimination problem.Zeitschrift fr
                        Operations Research (Theory), 36:517–545.
                     Mason, L., Baxter, J., Bartlett, P. L., and Frean, M. (2000). Boosting algorithms as gradient descent.
                        InAdvances in Neural Information Processing Systems 12, pages 512–518.
                     Ratsch, G., Demiriz, A., and Bennett, K. P. (2002). Sparse regression ensembles in inﬁnite and ﬁnite¨
                        hypothesis spaces.Machine Learning.
                     Rumelhart, D., Hinton, G., and Williams, R. (1986). Learning representations by back-propagating
                        errors.Nature, 323:533–536.