11.2 Logistic regression

Logistic regression is an example of a binary classifier, where the output takes one two values 0 or 1 for each data point. We call the two values classes.

Formulation as an optimization problem

Define the sigmoid function

\[S(x)=\frac{1}{1+\exp(-x)}.\]

Next, given an observation \(x\in\real^d\) and a weights \(\theta\in\real^d\) we set

\[h_\theta(x)=S(\theta^Tx)=\frac{1}{1+\exp(-\theta^Tx)}.\]

The weights vector \(\theta\) is part of the setup of the classifier. The expression \(h_\theta(x)\) is interpreted as the probability that \(x\) belongs to class 1. When asked to classify \(x\) the returned answer is

\[\begin{split}x\mapsto \begin{cases}\begin{array}{ll}1 & h_\theta(x)\geq 1/2, \\ 0 & h_\theta(x)<1/2.\end{array}\end{cases}\end{split}\]

When training a logistic regression algorithm we are given a sequence of training examples \(x_i\), each labelled with its class \(y_i\in \{0,1\}\) and we seek to find the weights \(\theta\) which maximize the likelihood function

\[\prod_i h_\theta(x_i)^{y_i}(1-h_\theta(x_i))^{1-y_i}.\]

Of course every single \(y_i\) equals 0 or 1, so just one factor appears in the product for each training data point. By taking logarithms we can define the logistic loss function:

\[J(\theta) = -\sum_{i:y_i=1} \log(h_\theta(x_i))-\sum_{i:y_i=0}\log(1-h_\theta(x_i)).\]

The training problem with regularization (a standard technique to prevent overfitting) is now equivalent to

\[\min_\theta J(\theta) + \lambda\|\theta\|_2.\]

This can equivalently be phrased as

(11.22)\[\begin{split}\begin{array}{lrllr} \minimize & \sum_i t_i +\lambda r & & & \\ \st & t_i & \geq - \log(h_\theta(x)) & = \log(1+\exp(-\theta^Tx_i)) & \mathrm{if}\ y_i=1, \\ & t_i & \geq - \log(1-h_\theta(x)) & = \log(1+\exp(\theta^Tx_i)) & \mathrm{if}\ y_i=0, \\ & r & \geq \|\theta\|_2. & & \end{array}\end{split}\]

Implementation

As can be seen from (11.22) the key point is to implement the softplus bound \(t\geq \log(1+e^u)\), which is the simplest example of a log-sum-exp constraint for two scalar variables \(t,u\). This is equivalent to

\[\exp(u-t) + \exp(-t)\leq 1\]

and further to

(11.23)\[\begin{split}\begin{array}{rclr} (z_1, 1, u-t) & \in & \EXP & (z_1\geq \exp(u-t)), \\ (z_2, 1, -t) & \in & \EXP & (z_2\geq \exp(-t)), \\ z_1+z_2 & \leq & 1. & \end{array}\end{split}\]

To feed these constraints into MOSEK we add more auxiliary variables \(q_1,q_2,v_1,v_2\) with constraints \((z_1, q_1, v_1) \in \EXP\), \((z_2, q_2, v_2) \in \EXP\), \(q_1=q_2=1\), \(v_1=u-t\) and \(v_2=-t\).

Listing 11.10 Implementation of \(t\geq \log(1+e^u)\) as in (11.23). Click here to download.
    // t >= log( 1 + exp(u) )
    // t_i >= log( 1 + exp(u_i) ), i = 0..n-1
    // Adds auxiliary variables and constraints
    public static void softplus(Task task, int t, int u, int n)
    {
      int nvar = task.getnumvar();
      int ncon = task.getnumcon();
      task.appendvars(6*n);
      task.appendcons(3*n);
      int z1 = nvar, z2 = nvar+n, v1 = nvar+2*n, v2 = nvar+3*n, q1 = nvar+4*n, q2 = nvar+5*n;
      int zcon = ncon, v1con = ncon+n, v2con=ncon+2*n;
      int[]    subi = new int[7*n];
      int[]    subj = new int[7*n];
      double[] aval = new double[7*n];
      int         k = 0;

      // Linear constraints
      for(int i = 0; i < n; i++)
      {
        // z1 + z2 = 1
        subi[k] = zcon+i;  subj[k] = z1+i;  aval[k] = 1;  k++;
        subi[k] = zcon+i;  subj[k] = z2+i;  aval[k] = 1;  k++;
        // u - t - v1 = 0
        subi[k] = v1con+i; subj[k] = u+i;   aval[k] = 1;  k++;
        subi[k] = v1con+i; subj[k] = t+i;   aval[k] = -1; k++;
        subi[k] = v1con+i; subj[k] = v1+i;  aval[k] = -1; k++;
        // - t - v2 = 0
        subi[k] = v2con+i; subj[k] = t+i;   aval[k] = -1; k++;
        subi[k] = v2con+i; subj[k] = v2+i;  aval[k] = -1; k++;
      }
      task.putaijlist(subi, subj, aval);
      task.putconboundsliceconst(ncon, ncon+n, boundkey.fx, 1, 1);
      task.putconboundsliceconst(ncon+n, ncon+3*n, boundkey.fx, 0, 0);

      // Bounds for variables
      task.putvarboundsliceconst(nvar, nvar+4*n, boundkey.fr, -inf, inf);
      task.putvarboundsliceconst(nvar+4*n, nvar+6*n, boundkey.fx, 1, 1);

      // Cones
      for(int i = 0; i < n; i++)
      {
        task.appendcone(conetype.pexp, 0.0, new int[]{z1+i, q1+i, v1+i});
        task.appendcone(conetype.pexp, 0.0, new int[]{z2+i, q2+i, v2+i});
      }
    }

Once we have this subroutine, it is easy to implement a function that builds the regularized loss function model (11.22).

Listing 11.11 Implementation of (11.22). Click here to download.
    // Model logistic regression (regularized with full 2-norm of theta)
    // X - n x d matrix of data points
    // y - length n vector classifying training points
    // lamb - regularization parameter
    public static double[] logisticRegression(Env        env,
                                              double[,]  X, 
                                              bool[]     y,
                                              double     lamb)
    {
      int n = X.GetLength(0);
      int d = X.GetLength(1);       // num samples, dimension

      using (Task task = new Task(env, 0, 0))
      {    
        // Variables [r; theta; t; u]
        int nvar = 1+d+2*n;
        task.appendvars(nvar);
        task.putvarboundsliceconst(0, nvar, boundkey.fr, -inf, inf);
        int r = 0, theta = 1, t = 1+d, u = 1+d+n;
        
        // Constraints: theta'*X +/- u = 0
        task.appendcons(n);
        task.putconboundsliceconst(0, n, boundkey.fx, 0, 0);
        
        // Objective lambda*r + sum(t)
        task.putcj(r, lamb);
        for(int i = 0; i < n; i++) 
          task.putcj(t+i, 1.0);

        // The X block in theta'*X +/- u = 0
        int[]    subi   = new int[d*n+n];
        int[]    subj   = new int[d*n+n];
        double[] aval   = new double[d*n+n];
        int         k   = 0;
        for(int i = 0; i < n; i++)
        {
          for(int j = 0; j < d; j++)
          {
            subi[k] = i; subj[k] = theta+j; aval[k] = X[i,j]; k++;
          }
          subi[d*n+i] = i; subj[d*n+i] = u+i;
          if (y[i]) aval[d*n+i] = 1; else aval[d*n+i] = -1;
        }
        task.putaijlist(subi, subj, aval);

        // Softplus function constraints
        softplus(task, t, u, n);

        // Regularization
        task.appendconeseq(conetype.quad, 0.0, 1+d, r);

        // Solution
        task.optimize();
        double[] xx = new double[d];
        task.getxxslice(soltype.itr, theta, theta+d, xx);

        return xx;
      }
    }

Example: 2D dataset fitting

In the next figure we apply logistic regression to the training set of 2D points taken from the example ex2data2.txt . The two-dimensional dataset was converted into a feature vector \(x\in\real^{28}\) using monomial coordinates of degrees at most 6.

_images/logistic-regression.png

Fig. 11.1 Logistic regression example with none, medium and strong regularization (small, medium, large \(\lambda\)). Without regularization we get obvious overfitting.