# 11.2 Logistic regression¶

Logistic regression is an example of a binary classifier, where the output takes one two values 0 or 1 for each data point. We call the two values classes.

Formulation as an optimization problem

Define the sigmoid function

$S(x)=\frac{1}{1+\exp(-x)}.$

Next, given an observation $$x\in\real^d$$ and a weights $$\theta\in\real^d$$ we set

$h_\theta(x)=S(\theta^Tx)=\frac{1}{1+\exp(-\theta^Tx)}.$

The weights vector $$\theta$$ is part of the setup of the classifier. The expression $$h_\theta(x)$$ is interpreted as the probability that $$x$$ belongs to class 1. When asked to classify $$x$$ the returned answer is

$\begin{split}x\mapsto \begin{cases}\begin{array}{ll}1 & h_\theta(x)\geq 1/2, \\ 0 & h_\theta(x)<1/2.\end{array}\end{cases}\end{split}$

When training a logistic regression algorithm we are given a sequence of training examples $$x_i$$, each labelled with its class $$y_i\in \{0,1\}$$ and we seek to find the weights $$\theta$$ which maximize the likelihood function

$\prod_i h_\theta(x_i)^{y_i}(1-h_\theta(x_i))^{1-y_i}.$

Of course every single $$y_i$$ equals 0 or 1, so just one factor appears in the product for each training data point. By taking logarithms we can define the logistic loss function:

$J(\theta) = -\sum_{i:y_i=1} \log(h_\theta(x_i))-\sum_{i:y_i=0}\log(1-h_\theta(x_i)).$

The training problem with regularization (a standard technique to prevent overfitting) is now equivalent to

$\min_\theta J(\theta) + \lambda\|\theta\|_2.$

This can equivalently be phrased as

(11.13)$\begin{split}\begin{array}{lrllr} \minimize & \sum_i t_i +\lambda r & & & \\ \st & t_i & \geq - \log(h_\theta(x)) & = \log(1+\exp(-\theta^Tx_i)) & \mathrm{if}\ y_i=1, \\ & t_i & \geq - \log(1-h_\theta(x)) & = \log(1+\exp(\theta^Tx_i)) & \mathrm{if}\ y_i=0, \\ & r & \geq \|\theta\|_2. & & \end{array}\end{split}$

Implementation

As can be seen from (11.13) the key point is to implement the softplus bound $$t\geq \log(1+e^u)$$, which is the simplest example of a log-sum-exp constraint for two terms. Here $$t$$ is a scalar variable and $$u$$ will be the affine expression of the form $$\pm \theta^Tx_i$$. This is equivalent to

$\exp(u-t) + \exp(-t)\leq 1$

and further to

(11.14)$\begin{split}\begin{array}{rclr} (z_1, 1, u-t) & \in & \EXP & (z_1\geq \exp(u-t)), \\ (z_2, 1, -t) & \in & \EXP & (z_2\geq \exp(-t)), \\ z_1+z_2 & \leq & 1. & \end{array}\end{split}$

This formulation can be entered using affine conic constraints (see Sec. 6.2 (From Linear to Conic Optimization)).

Listing 11.8 Implementation of $$t\geq \log(1+e^u)$$ as in (11.14). Click here to download.
#[allow(non_snake_case)]
fn softplus(task : & mut TaskCB, d : i32, n : i32, theta : i32, t : i32, X : &[f64], Y : &[bool]) -> Result<(),String> {
task.append_cons(n)?;     // z1 + z2 = 1
task.append_afes(4*n as i64)?;   //theta * X[i] - t[i], -t[i], z1[i], z2[i]
let z1 = nvar;
let z2 = nvar+n;
let zcon = ncon;
let thetaafe = nafe;
let tafe = nafe+n   as i64;
let z1afe = tafe+n  as i64;
let z2afe = z1afe+n as i64;

// Linear constraints
{
let mut subi = vec![0i32; 2*n as usize];
let mut subj = vec![0i32; 2*n as usize];
let mut aval = vec![0.0;  2*n as usize];

for i in 0..n {
}
for ((i,&zx),si, sj, av) in izip!(iproduct!(0..n,&[z1,z2]),
subi.iter_mut(),
subj.iter_mut(),
aval.iter_mut()) {
*si = zcon+i;
*sj = zx+i as i32;
*av = 1.0;
}

}
// Affine conic expressions
let mut afeidx = vec![0i64; (d*n+4*n) as usize];
let mut varidx = vec![0i32; (d*n+4*n) as usize];
let mut fval   = vec![0.0;  (d*n+4*n) as usize];

// Thetas
let mut k : usize = 0;
for ((i,j),afei,vari) in izip!(iproduct!(0..n,0..d),
& mut afeidx[k..k+(n*d) as usize],
& mut varidx[k..k+(n*d) as usize]) {
*afei = thetaafe + i as i64;
*vari = theta    + j;
}

for ((&yi,_j),&xij,fv) in izip!(iproduct!(Y,0..d), X, & mut fval[k..k+(n*d) as usize]) {
*fv = (if yi {-1.0} else {1.0}) * xij;
}
k += (n*d) as usize;

for fv in fval[k..k+(2*n) as usize].iter_mut() { *fv = -1.0; }
for (i,afei,vari) in izip!(0..n,
&mut afeidx[k..k+(2*n) as usize].iter_mut().step_by(2),
&mut varidx[k..k+(2*n) as usize].iter_mut().step_by(2)) {
*afei = thetaafe+i as i64; *vari = t+i;
}
for (i,afei,vari) in izip!(0..n,
&mut afeidx[k+1..k+(2*n) as usize].iter_mut().step_by(2),
&mut varidx[k+1..k+(2*n) as usize].iter_mut().step_by(2)) {
*afei = tafe+i as i64; *vari = t+i;
}
k += (n*2) as usize;

for fv in fval[k..k+(2*n) as usize].iter_mut() { *fv = 1.0; }
for (i,afei,vari) in izip!(0..n,
&mut afeidx[k..k+(2*n) as usize].iter_mut().step_by(2),
&mut varidx[k..k+(2*n) as usize].iter_mut().step_by(2)) {
*afei = z1afe+i as i64; *vari = z1+i;
}
for (i,afei,vari) in izip!(0..n,
&mut afeidx[k+1..k+(2*n) as usize].iter_mut().step_by(2),
&mut varidx[k+1..k+(2*n) as usize].iter_mut().step_by(2)) {
*afei = z2afe+i as i64; *vari = z2+i;
}

{
// Add a single row with the constant expression "1.0"

// Add an exponential cone domain

// Conic constraints
let zeros = &[0.0,0.0,0.0];
for i in 0..n as i64 {

acci += 2;
}
}

Ok(())
}


Once we have this subroutine, it is easy to implement a function that builds the regularized loss function model (11.13).

Listing 11.9 Implementation of (11.13). Click here to download.
#[allow(non_snake_case)]
fn logistic_regression(X : &[f64],
Y : &[bool],
lamb : f64) -> Result<Vec<f64>,String> {
let n = Y.len() as i32;
let d = (X.len()/Y.len()) as i32;  // num samples, dimension

/* Create the optimization task. */
Some(e) => e,
None => return Err("Failed to create task".to_string()),
}.with_callbacks();

// Variables [r; theta; t]
let nvar : i32 = 1+d+n;
let (r,theta,t) = (0i32,1i32,1+d);
for j in 0..d { task.put_var_name(theta+j,format!("theta[{}]",j).as_str())?; }
for j in 0..n { task.put_var_name(t+j,format!("t[{}]",j).as_str())?; }

// Objective lambda*r + sum(t)
for i in 0..n { task.put_c_j(t+i, 1.0)?; }

// Softplus function constraints
softplus(& mut task, d, n, theta, t, X, Y)?;

// Regularization
// Append a sequence of linear expressions (r, theta) to F
for i in 0..d {
task.put_afe_f_entry(numafe + i as i64 + 1, theta + i, 1.0)?;
}

{
numafe,
vec![0.0; 1+d as usize].as_slice())?;
}
// Solution

In the next figure we apply logistic regression to the training set of 2D points taken from the example ex2data2.txt . The two-dimensional dataset was converted into a feature vector $$x\in\real^{28}$$ using monomial coordinates of degrees at most 6.