10.3 Logistic regression

Logistic regression is an example of a binary classifier, where the output takes one two values 0 or 1 for each data point. We call the two values classes.

Formulation as an optimization problem

Define the sigmoid function

S(x)=11+exp(x).

Next, given an observation xRd and a weights θRd we set

hθ(x)=S(θTx)=11+exp(θTx).

The weights vector θ is part of the setup of the classifier. The expression hθ(x) is interpreted as the probability that x belongs to class 1. When asked to classify x the returned answer is

x{1hθ(x)1/2,0hθ(x)<1/2.

When training a logistic regression algorithm we are given a sequence of training examples xi, each labelled with its class yi{0,1} and we seek to find the weights θ which maximize the likelihood function

ihθ(xi)yi(1hθ(xi))1yi.

Of course every single yi equals 0 or 1, so just one factor appears in the product for each training data point. By taking logarithms we can define the logistic loss function:

J(θ)=i:yi=1log(hθ(xi))i:yi=0log(1hθ(xi)).

The training problem with regularization (a standard technique to prevent overfitting) is now equivalent to

minθJ(θ)+λθ2.

This can equivalently be phrased as

(10.21)minimizeiti+λrsubject totilog(hθ(x))=log(1+exp(θTxi))if yi=1,tilog(1hθ(x))=log(1+exp(θTxi))if yi=0,rθ2.

Implementation

As can be seen from (10.21) the key point is to implement the softplus bound tlog(1+eu), which is the simplest example of a log-sum-exp constraint for two terms. Here t is a scalar variable and u will be the affine expression of the form ±θTxi. This is equivalent to

exp(ut)+exp(t)1

and further to

(10.22)(z1,1,ut)Kexp(z1exp(ut)),(z2,1,t)Kexp(z2exp(t)),z1+z21.

This formulation can be entered using affine conic constraints (see Sec. 6.2 (From Linear to Conic Optimization)).

Listing 10.13 Implementation of (10.21). Click here to download.
logisticRegression <- function(X, y, lamb)
{
    prob <- list(sense="min")
    n <- dim(X)[1];
    d <- dim(X)[2];
    
    # Variables: r, theta(d), t(n), z1(n), z2(n)
    prob$c <- c(lamb, rep(0,d), rep(1, n), rep(0,n), rep(0,n));
    prob$bx <-rbind(rep(-Inf,1+d+3*n), rep(Inf,1+d+3*n));

    # z1 + z2 <= 1
    prob$A <- sparseMatrix( rep(1:n, 2), 
                            c((1:n)+1+d+n, (1:n)+1+d+2*n),
                            x = rep(1, 2*n));
    prob$bc <- rbind(rep(-Inf, n), rep(1, n));

    # (r, theta) \in \Q
    FQ <- cbind(diag(rep(1, d+1)), matrix(0, d+1, 3*n));
    gQ <- rep(0, 1+d);

    # (z1(i), 1, -t(i)) \in \EXP, 
    # (z2(i), 1, (1-2y(i))*X(i,) - t(i)) \in \EXP
    FE <- Matrix(nrow=0, ncol = 1+d+3*n);
    for(i in 1:n) {
        FE <- rbind(FE,
                    sparseMatrix( c(1, 3, 4, rep(6, d), 6),
                                  c(1+d+n+i, 1+d+i, 1+d+2*n+i, 2:(d+1), 1+d+i),
                                  x = c(1, -1, 1, (1-2*y[i])*X[i,], -1),
                                  dims = c(6, 1+d+3*n) ) );
    }
    gE <- rep(c(0, 1, 0, 0, 1, 0), n);

    prob$F <- rbind(FQ, FE)
    prob$g <- c(gQ, gE)
    prob$cones <- cbind(matrix(list("QUAD", 1+d, NULL), nrow=3, ncol=1),
                        matrix(list("PEXP", 3, NULL), nrow=3, ncol=2*n));
    rownames(prob$cones) <- c("type","dim","conepar")

    # Solve, no error handling!
    r <- mosek(prob, list(soldetail=1))

    # Return theta
    r$sol$itr$xx[2:(d+1)]
}

Example: 2D dataset fitting

In the next figure we apply logistic regression to the training set of 2D points taken from the example ex2data2.txt . The two-dimensional dataset was converted into a feature vector xR28 using monomial coordinates of degrees at most 6.

_images/logistic-regression.png

Fig. 10.4 Logistic regression example with none, medium and strong regularization (small, medium, large λ). Without regularization we get obvious overfitting.