11.5 Logistic regression¶

Logistic regression is an example of a binary classifier, where the output takes one two values 0 or 1 for each data point. We call the two values classes.

Formulation as an optimization problem

Define the sigmoid function

S (x) = \frac{1}{1 + \exp (- x)} .

Next, given an observation $x \in R^{d}$ and a weights $θ \in R^{d}$ we set

h_{θ} (x) = S (θ^{T} x) = \frac{1}{1 + \exp (- θ^{T} x)} .

The weights vector $θ$ is part of the setup of the classifier. The expression $h_{θ} (x)$ is interpreted as the probability that $x$ belongs to class 1. When asked to classify $x$ the returned answer is

\begin{array}{r} x \mapsto {\begin{cases} \begin{array}{ll} 1 & h_{θ} (x) \geq 1 / 2, \\ 0 & h_{θ} (x) < 1 / 2. \end{array} \end{cases} \end{array}

When training a logistic regression algorithm we are given a sequence of training examples $x_{i}$ , each labelled with its class $y_{i} \in {0, 1}$ and we seek to find the weights $θ$ which maximize the likelihood function

\prod_{i} h_{θ} (x_{i})^{y_{i}} (1 - h_{θ} (x_{i}))^{1 - y_{i}} .

Of course every single $y_{i}$ equals 0 or 1, so just one factor appears in the product for each training data point. By taking logarithms we can define the logistic loss function:

J (θ) = - \sum_{i : y_{i} = 1} \log (h_{θ} (x_{i})) - \sum_{i : y_{i} = 0} \log (1 - h_{θ} (x_{i})) .

The training problem with regularization (a standard technique to prevent overfitting) is now equivalent to

min_{θ} J (θ) + λ ‖ θ ‖_{2} .

This can equivalently be phrased as

(11.20)¶

\begin{array}{r} \begin{array}{lrllr} minimize & \sum_{i} t_{i} + λ r \\ subject to & t_{i} & \geq - \log (h_{θ} (x)) & = \log (1 + \exp (- θ^{T} x_{i})) & if y_{i} = 1, \\ t_{i} & \geq - \log (1 - h_{θ} (x)) & = \log (1 + \exp (θ^{T} x_{i})) & if y_{i} = 0, \\ r & \geq ‖ θ ‖_{2} . \end{array} \end{array}

Implementation

As can be seen from (11.20) the key point is to implement the softplus bound $t \geq \log (1 + e^{u})$ , which is the simplest example of a log-sum-exp constraint for two terms. Here $t$ is a scalar variable and $u$ will be the affine expression of the form $\pm θ^{T} x_{i}$ . This is equivalent to

\exp (u - t) + \exp (- t) \leq 1

and further to

(11.21)¶

\begin{array}{r} \begin{array}{rclr} (z_{1}, 1, u - t) & \in & K_{\exp} & (z_{1} \geq \exp (u - t)), \\ (z_{2}, 1, - t) & \in & K_{\exp} & (z_{2} \geq \exp (- t)), \\ z_{1} + z_{2} & \leq & 1. \end{array} \end{array}

Listing 11.11 Implementation of

t \geq \log (1 + e^{u})

as in (11.21). Click here to download.¶

# t >= log( 1 + exp(u) ) coordinatewise
def softplus(M, t, u):
    n = t.getShape()[0]
    z1 = M.variable(n)
    z2 = M.variable(n)
    M.set(z1 + z2 == 1,
          Expr.hstack(z1, Expr.constTerm(n, 1.0), u-t) == Domain.inPExpCone(),
          Expr.hstack(z2, Expr.constTerm(n, 1.0), -t)  == Domain.inPExpCone())

Once we have this subroutine, it is easy to implement a function that builds the regularized loss function model (11.20).

Listing 11.12 Implementation of (11.20). Click here to download.¶

# Model logistic regression (regularized with full 2-norm of theta)
# X - n x d matrix of data points
# y - length n vector classifying training points
# lamb - regularization parameter
def logisticRegression(X, y, lamb=1.0):
    n, d = int(X.shape[0]), int(X.shape[1])         # num samples, dimension
    M = Model()
    theta = M.variable(d)
    t     = M.variable(n)
    reg   = M.variable()

    M.objective(ObjectiveSense.Minimize, Expr.sum(t) + lamb * reg)
    M.constraint(Var.vstack(reg, theta), Domain.inQCone())

    signs = list(map(lambda y: -1.0 if y==1 else 1.0, y))
    softplus(M, t, Expr.mulElm(X @ theta, signs))

    return M, theta

Example: 2D dataset fitting

In the next figure we apply logistic regression to the training set of 2D points taken from the example ex2data2.txt . The two-dimensional dataset was converted into a feature vector $x \in R^{28}$ using monomial coordinates of degrees at most 6.

_images/logistic-regression.png — Fig. 11.6 Logistic regression example with none, medium and strong regularization (small, medium, large $λ$ ). Without regularization we get obvious overfitting.¶

11.5 Logistic regression¶

Table of Contents

Download PDF

Modeling Cookbook

Cheatsheet