12 Regression and regularization

Regression is statistical tool used in many disciplines, in particular in finance and investing. It is used to determine the type and strength of the function relationship between statistical variables. In some cases when data is scarce or has errors, regularization is also useful. In this chapter we introduce regression and regularizaton in the context of conic optimization along the lines of [SHAD13], and show how to apply these methods for portfolio optimization.

12.1 Linear regression

The most basic regression problem assumes a linear relationship \(\mathbf{y} = \mathbf{X}\mathbf{w} + \varepsilon\) between \(N\) explanatory variables \(\mathbf{X} = [\mathbf{x}_1, \dots, \mathbf{x}_N] \in \R^{T\times N}\) and dependent variable \(\mathbf{y} \in \R^T\), and attempts to estimate \(\mathbf{w}\). This is commonly done using the method of ordinary least squares (OLS), which minimizes the norm of the residual term \(\mathbf{y} - \mathbf{X}\mathbf{w}\). In other words, we are looking for the coefficient vector \(\mathbf{w}\) that is the optimal solution of the unconstrained least squares optimization problem

(12.1)\[\begin{split}\begin{array}{lrcl} \mbox{minimize} & \|\mathbf{y} - \mathbf{X}\mathbf{w}\|_2. & &\\ \end{array}\end{split}\]

The same problem can also be written with squared norm:

(12.2)\[\begin{split}\begin{array}{lrcl} \mbox{minimize} & \|\mathbf{y} - \mathbf{X}\mathbf{w}\|_2^2. & &\\ \end{array}\end{split}\]

The geometric interpretation is that we are looking for the vector in the column space of \(\mathbf{X}\) that is closest to \(\mathbf{y}\). If the problem is unconstrained, the solution is simply an orthogonal projecton of \(\mathbf{y}\) onto the column space of \(\mathbf{X}\). [1]

12.1.1 Assumptions

Let \(\mathbf{X}_i\) and \(\mathbf{X}_j\) be two observations. For the OLS method to give meaningful results, we have to impose some assumptions:

  • Exogeneity: \(\mathbb{E}(\varepsilon\mid \mathbf{X}_i) = 0\), meaning that the error term is orthogonal to the explanatory variables, so there are no endogeneous drivers for \(\mathbf{X}\) in the model. This also implies \(\mathrm{Cov}(\varepsilon, \mathbf{X}) = 0\).

  • No autocorrelation: The error terms are uncorrelated between observations. This implies that the off-diagonal of \(\mathrm{Cov}(\varepsilon_i, \varepsilon_j\mid \mathbf{X}_i,\mathbf{X}_j)\) is zero.

  • Homoscedasticity: The error term has the same variance for all observations, i. e., for any values of the explanatory variables. This implies that \(\mathrm{Var}(\varepsilon\mid \mathbf{X}_i) = \sigma^2\mathbf{1}\).

  • No linear dependence: The observation matrix \(\mathbf{X}\) must have full column rank.

We also assume that \(\mathbf{X}\) has at least as many rows as columns, i.e., \(T \geq N\).

Some of the above assumptions can be relaxed by using specific extensions of the OLS method. However, these extensions might be more complex and might have a greater data requirement in order to produce an equally precise model.

12.1.2 Solution

12.1.2.1 Normal equations

If problem (12.1) is unconstrained, we can also derive its explicit solution called the normal equations:

(12.3)\[\mathbf{X}^\mathsf{T}\mathbf{X}\mathbf{w} = \mathbf{X}^\mathsf{T}\mathbf{y}.\]

Note however, that this should be used only as a theoretical result. In practice, it can be numerically unstable because of the matrix inversion step when solving for \(\mathbf{w}\).

12.1.2.2 Conic optimization

If we convert the problem (12.2) into a conic optimization problem, we can not only solve it in a more efficient and numerically stable way, but will also be able to extend the problem with constraints. Here we state the conic equivalent of (12.1):

(12.4)\[\begin{split}\begin{array}{lrcl} \mbox{minimize} & t & &\\ \mbox{subject to} & (t, \mathbf{y} - \mathbf{X}\mathbf{w}) & \in & \Q^{T+1}. \end{array}\end{split}\]

For (12.2) we use the rotated quadratic cone,

(12.5)\[\begin{split}\begin{array}{lrcl} \mbox{minimize} & t & &\\ \mbox{subject to} & (t, 0.5, \mathbf{y} - \mathbf{X}\mathbf{w}) & \in & \Q_\mathrm{r}^{T+2}. \end{array}\end{split}\]

Let us now extend (12.5) with constraints:

(12.6)\[\begin{split}\begin{array}{lrcl} \mbox{minimize} & t & &\\ \mbox{subject to} & (t, 0.5, \mathbf{y} - \mathbf{X}\mathbf{w}) & \in & \Q_\mathrm{r}^{T+2},\\ & \mathbf{1}^\mathsf{T}\mathbf{w} & = & 1,\\ & \mathbf{w} & \geq & 0. \end{array}\end{split}\]

12.1.3 Portfolio optimization as regression problem

Portfolio optimization problems can also take the form of (12.2), if instead of working with the covariance matrix of securities \(\ECov\), we work directly with the security return observation matrix (scenario matrix) \(\mathbf{R} \in \R^{N\times T}\) defined in Sec. 3.1 (Scenarios), where \(T\) is the number of observations. We assume here that the matrix is centered. Then for the sample covariance matrix, we have

(12.7)\[\ECov = \frac{1}{T-1}\mathbf{R}\mathbf{R}^\mathsf{T}.\]

Equation (12.7) gives us a way to define the matrix \(\mathbf{X}\) in problem (12.2): \(\mathbf{X} = \mathbf{R}^\mathsf{T}/\sqrt{T-1}\).

Then we can write a simple benchmark relative optimization problem that minimizes tracking error as

(12.8)\[\begin{split}\begin{array}{lrcl} \mbox{minimize} & \|\mathbf{R}^\mathsf{T}\mathbf{x} - \mathbf{r}_\mathrm{bm}\|_2^2, & &\\ \mbox{subject to} & \mathbf{1}^\mathsf{T}\mathbf{x} & = & 1,\\ & \mathbf{x} & \geq & 0. \end{array}\end{split}\]

where we omitted the factor \(1/\sqrt{T-1}\) from the objective, and \(\mathbf{r}_\mathrm{bm} = \mathbf{R}^\mathsf{T}\mathbf{x}_\mathrm{bm}\) is the benchmark return series, assuming \(\mathbf{x}\) and \(\mathbf{x}_\mathrm{bm}\) represent the same security universe. In general, we could also assume that \(\mathbf{x}\) and \(\mathbf{x}_\mathrm{bm}\) represent different security universes with different return data \(\mathbf{R}\) and \(\mathbf{R}_\mathrm{bm}\). That is equivalent to assuming the same (larger) universe for both, with different security weights fixed to zero.

The conic form of (12.8) becomes

(12.9)\[\begin{split}\begin{array}{lrcl} \mbox{minimize} & t & &\\ \mbox{subject to} & (t, 0.5, \mathbf{R}^\mathsf{T}\mathbf{x} - \mathbf{r}_\mathrm{bm}) & \in & \Q_\mathrm{r}^{T+2},\\ & \mathbf{1}^\mathsf{T}\mathbf{x} & = & 1,\\ & \mathbf{x} & \geq & 0. \end{array}\end{split}\]

12.2 Regularization

In some cases the data matrix \(\mathbf{X}\) in (12.2) has nearly dependent columns, i.e., we have high correlations between the corresponding explanatory variables. We can also have less observations than variables (\(T < N\)). In these situations, the results can be unreliable and highly sensitive to small perturbations and rounding errors. We discuss this topic in detail in the context of portfolio optimization, see in Sec. 4 (Dealing with estimation error). To handle these issues, we can also use regularization techniques to stabilize the computational results.

Regularization typically means that we add a penalty term to our objective function, that helps to direct the solution procedure towards optimal solutions \(\mathbf{w}\) that have favorable properties. This way of adjusting the objective can also be viewed as a soft constraint, meaning that we express a preference regarding the quality of the optimal solution, but without using exact bounds.

12.2.1 Example penalty terms

Here we discuss three examples of regularization terms, two of which are commonly applied. Let \(\mathbf{w}_0\) be a specific solution vector with desired properties.

12.2.1.1 Ridge regularization

Ridge regularization adds a quadratic penalty term to the objective function:

(12.10)\[\begin{split}\begin{array}{lrcl} \mbox{minimize} & \|\mathbf{X}\mathbf{w} - \mathbf{y}\|_2^2 + \lambda\|\Gamma (\mathbf{w}-\mathbf{w}_0)\|_2^2. & &\\ \end{array}\end{split}\]

If \(\Gamma = \mathbf{I}\), then it is called L2 regularization, because it simply gives preference to solution vectors with smaller norms.

We can derive the normal equations also in this case, and we can observe that there is a relation between ridge regularization and covariance shrinkage:

(12.11)\[(\mathbf{X}^\mathsf{T}\mathbf{X} + \lambda\Gamma^\mathsf{T}\Gamma)\mathbf{w} = \mathbf{X}^\mathsf{T}\mathbf{y} + \lambda\Gamma^\mathsf{T}\Gamma\mathbf{w}_0.\]

The conic equivalent of (12.10) will introduce an extra quadratic cone constraint for the regularization term:

(12.12)\[\begin{split}\begin{array}{lrcl} \mbox{minimize} & t + \lambda u & &\\ \mbox{subject to} & (t, 0.5, \mathbf{X}\mathbf{w} - \mathbf{y}) & \in & \Q_\mathrm{r}^{T+2},\\ & (u, 0.5, (\Gamma \mathbf{w}-\mathbf{w}_0)) & \in & \Q_\mathrm{r}^{N+2}. \end{array}\end{split}\]

12.2.1.2 LASSO regularization

LASSO regularization adds a linear penalty term to the objective function:

(12.13)\[\begin{split}\begin{array}{lrcl} \mbox{minimize} & \|\mathbf{X}\mathbf{w} - \mathbf{y}\|_2^2 + \lambda\|\mathbf{w}-\mathbf{w}_0\|_1. & &\\ \end{array}\end{split}\]

It is also called L1 regularization or sparse regularization, because it gives preference to sparse solutions, i.e., ones with few nonzeros.

The conic equivalent of (12.13) will model the 1-norm constraint for the regularization term, as described in Sec. 13.1.1.6 (Manhattan norm (1-norm)):

(12.14)\[\begin{split}\begin{array}{lrcl} \mbox{minimize} & t + \lambda u & &\\ \mbox{subject to} & (t, 0.5, \mathbf{X}\mathbf{w} - \mathbf{y}) & \in & \Q_\mathrm{r}^{T+2},\\ & \mathbf{z} - (\mathbf{w}-\mathbf{w}_0) & \geq & 0,\\ & \mathbf{z} + (\mathbf{w}-\mathbf{w}_0) & \geq & 0,\\ & \mathbf{1}^\mathsf{T}\mathbf{z} - u & = & 0. \end{array}\end{split}\]

12.2.1.3 The \(3/2\) regularization

The penalty term discussed here is not common, but has applications in modeling market impact cost (see Sec. 6.3 (Market impact costs)). We add a subquadratic penalty term to the objective function:

(12.15)\[\begin{split}\begin{array}{lrcl} \mbox{minimize} & \|\mathbf{X}\mathbf{w} - \mathbf{y}\|_2^2 + \lambda\|\mathbf{w}-\mathbf{w}_0\|_{3/2}^{3/2}, & &\\ \end{array}\end{split}\]

where \(\|\mathbf{w}-\mathbf{w}_0\|_{3/2}^{3/2} = \sum_i |w_i-w_{0,i}|^{3/2}\).

The conic equivalent of (12.15) uses the power cone for the regularization term,

(12.16)\[\begin{split}\begin{array}{lrcl} \mbox{minimize} & t + \lambda \sum_i u_i & &\\ \mbox{subject to} & (t, 0.5, \mathbf{X}\mathbf{w} - \mathbf{y}) & \in & \Q_\mathrm{r}^{T+2},\\ & (u_i, 1, w_i-w_{i,0}) & \in & \POW_3^{2/3, 1/3},\ i=1,\dots,N.\\ \end{array}\end{split}\]

12.2.2 Regularization in portfolio optimization

In the context of portfolio optimization, examples of regularization terms appear when we penalize trading costs or do robust optimization. Here we show an example of a trading cost penalty. For robust optimization, see Sec. 10 (Robust optimization).

Let \(\tilde{\mathbf{x}} = \mathbf{x} - \mathbf{x}_0\) denote the traded amount with respect to the initial holdings \(\mathbf{x}_0\). In constrast to Sec. 6 (Transaction costs), where costs are part of the budget constraint, making the portfolio self-financing, here we implement trading costs as penalties in the objectve function.

Suppose we have linear trading costs, expressed by the penalty \(\|\tilde{\mathbf{x}}\|_1\) and market impact costs, expressed by the penalty \(\|\tilde{\mathbf{x}}\|_{3/2}^{3/2}\). Then problem (12.8) with trading costs will be

(12.17)\[\begin{split}\begin{array}{lrcl} \mbox{minimize} & \|\mathbf{R}^\mathsf{T}\mathbf{x} - \mathbf{r}_\mathrm{bm}\|_2^2 + \lambda_1\|\tilde{\mathbf{x}}\|_1 + \lambda_2\|\tilde{\mathbf{x}}\|_{3/2}^{3/2}, & &\\ \mbox{subject to} & \mathbf{1}^\mathsf{T}\mathbf{x} & = & 1,\\ & \mathbf{x} & \geq & 0. \end{array}\end{split}\]

The conic form of (12.17) becomes

(12.18)\[\begin{split}\begin{array}{lrcl} \mbox{minimize} & t + \lambda_1 u + \lambda_2 \sum_i v_i & &\\ \mbox{subject to} & (t, 0.5, \mathbf{R}^\mathsf{T}\mathbf{x} - \mathbf{r}_\mathrm{bm}) & \in & \Q_\mathrm{r}^{T+2},\\ & \mathbf{z} - \tilde{\mathbf{x}} & \geq & 0,\\ & \mathbf{z} + \tilde{\mathbf{x}} & \geq & 0,\\ & \mathbf{1}^\mathsf{T}\mathbf{z} - u & = & 0,\\ & (v_i, 1, \tilde{x}_i) & \in & \POW_3^{2/3, 1/3},\ i=1,\dots,N,\\ & \mathbf{1}^\mathsf{T}\mathbf{x} & = & 1,\\ & \mathbf{x} & \geq & 0. \end{array}\end{split}\]

12.3 Example

In this chapter we present the example (12.18) as MOSEK Fusion code. We will use yearly linear return scenario data from the same eight stocks used in other examples. The benchmark will be the return series of the SPY ETF.

12.3.1 Data preparation

We generate the data the same way as in Sec. 3.4 (Example) up to the point where we have the expected yearly logarithmic return vector \(\EMean_h^\mathrm{log}\) and covariance matrix \(\ECov_h^\mathrm{log}\). Then we generate Monte Carlo scenarios using these parameters, and convert these to linear returns. The number \(T\) is the number of observations we generate.

# Generate logarithmic return observations assuming normal distribution
scenarios_log = np.random.default_rng().
                multivariate_normal(m_log, S_log, T)

# Convert logarithmic return observations to linear return observations
scenarios_lin = np.exp(scenarios_log) - 1

Next, we center the data and separate security data from the benchmark data. Note that we also scale the returns by the factor of \(1/\sqrt{T - 1}\) coming from (12.7). We do this because the relative magnitude of the tracking error term matters compated to the magnitude of the regularization terms. We would like to control these relative magnitudes only by setting \(\lambda_1\) and \(\lambda_2\) parameters.

# Center the return data
centered_return = scenarios_lin - scenarios_lin.mean(axis=0)

# Security return scenarios (scaled)
security_return = scenarios_lin[:, :N] / np.sqrt(T - 1)

# Benchmark return scenarios (scaled)
benchmark_return = scenarios_lin[:, -1] / np.sqrt(T - 1)

12.3.2 Optimization model

We start by defining the variables. The variable \(\mathbf{x}\) represents the portfolio, \(t\) is the tracking error, \(u\) and \(\mathbf{v}\) are the penalty terms. The latter three helps us to model the objective terms as constraints. We also add the usual budget constraint here.

# Variables
# The variable x is the fraction of holdings in each security.
# It is restricted to be positive - no short-selling.
x = M.variable("x", N, Domain.greaterThan(0.0))
xt = Expr.sub(x, x0)

# The variable t models the OLS objective function term (tracking error).
t = M.variable("t", 1, Domain.unbounded())
# The variables u and v model the regularization terms
# (transaction cost penalties).
u = M.variable("u", 1, Domain.unbounded())
v = M.variable("v", N, Domain.unbounded())

# Budget constraint
M.constraint('budget', Expr.sum(x), Domain.equalsTo(1.0))

The objective function will thus be the sum of the above variables:

# Objective
penalty_lin = Expr.mul(lambda_1, u)
penalty_32 = Expr.mul(lambda_2, Expr.sum(v))
M.objective('obj', ObjectiveSense.Minimize,
                   Expr.sum(Expr.hstack([t, penalty_lin, penalty_32])))

The constraints corresponding to the penalty terms are modeled by the following rows:

# Constraints for the penalties
norm1(M, xt, u)
M.constraint('market_impact',
             Expr.hstack(v, Expr.constTerm(N, 1.0), xt),
             Domain.inPPowerCone(1.0 / beta))

The norm1 custom function definition will be given below where the full code of the Fusion model is presented.

Finally, we implement the tracking error constraint:

# Constraint for the tracking error
residual = Expr.sub(Expr.mul(R.T, x), r_bm)
M.constraint('tracking_error',
             Expr.vstack(t, 0.5, residual),
             Domain.inRotatedQCone())

The complete Fusion model will then be the following code:

def absval(M, x, z):
    M.constraint(Expr.sub(z, x), Domain.greaterThan(0.0))
    M.constraint(Expr.add(z, x), Domain.greaterThan(0.0))

def norm1(M, x, t):
    z = M.variable(x.getSize(), Domain.greaterThan(0.0))
    absval(M, x, z)
    M.constraint(Expr.sub(Expr.sum(z), t), Domain.equalsTo(0.0))

def MinTrackingError(N, R, r_bm, x0, lambda_1, lambda_2, beta=1.5):
    with Model("Case study") as M:
        # Settings
        M.setLogHandler(sys.stdout)

        # Variables
        # The variable x is the fraction of holdings in each security.
        # It is restricted to be positive - no short-selling.
        x = M.variable("x", N, Domain.greaterThan(0.0))
        xt = Expr.sub(x, x0)

        # The variable t models the OLS objective function term
        # (tracking error).
        t = M.variable("t", 1, Domain.unbounded())
        # The variables u and v model the regularization terms
        # (transaction cost penalties).
        u = M.variable("u", 1, Domain.unbounded())
        v = M.variable("v", N, Domain.unbounded())

        # Budget constraint
        M.constraint('budget', Expr.sum(x), Domain.equalsTo(1.0))

        # Objective
        penalty_lin = Expr.mul(lambda_1, u)
        penalty_32 = Expr.mul(lambda_2, Expr.sum(v))
        M.objective('obj', ObjectiveSense.Minimize,
                           Expr.add([t, penalty_lin, penalty_32]))

        # Constraints for the penalties
        norm1(M, xt, u)
        M.constraint('market_impact',
                     Expr.hstack(v, Expr.constTerm(N, 1.0), xt),
                     Domain.inPPowerCone(1.0 / beta))

        # Constraint for the tracking error
        residual = Expr.sub(Expr.mul(R.T, x), r_bm)
        M.constraint('tracking_error',
                     Expr.vstack(t, 0.5, residual),
                     Domain.inRotatedQCone())

        # Create DataFrame to store the results.
        # Last security name (the SPY) is removed.
        columns = ["track_err", "lin_tcost", "mkt_tcost"] + \
                  df_prices.columns[:N].tolist()
        df_result = pd.DataFrame(columns=columns)

        # Solve optimization
        M.solve()

        # Save results
        tracking_error = t.level()[0]
        linear_tcost = u.level()[0]
        market_impact_tcost = np.sum(v.level())
        row = pd.Series(
                 [tracking_error, linear_tcost, market_impact_tcost] + \
                 list(x.level()), index=columns)
        df_result = pd.concat([df_result, pd.DataFrame([row])],
                              ignore_index=True)

        return df_result

12.3.3 Results

Here we show how to run the optimization model, and present the results. First, we set the number of stocks, then we set the penalty coefficients \(\lambda_1\) and \(\lambda_2\) to desired values, finally we also define the initial holdings vector \(\mathbf{x}_0\) to be the equal weighted portfolio in this case.

N = 8
lambda_1 = 0.0001
lambda_2 = 0.0001
x0 = np.ones(N) / N

Then we run the optimization model:

df_result = MinTrackingError(N, security_return.T,
                             benchmark_return, x0, lambda_1, lambda_2)

The optimal portfolio composition is

\(\mathbf{x} = [0.0900, 0.1667, 0.1681, 0.1712, 0.0481, 0.1014, 0.1173, 0.1373]\).

Footnotes