# 7 Practical optimization¶

In this chapter we discuss various practical aspects of creating optimization models.

## 7.1 Conic reformulations¶

Many nonlinear constraint families were reformulated to conic form in previous chapters, and these examples are often sufficient to construct conic models for the optimization problems at hand. In case you do need to go beyond the cookbook examples, however, a few composition rules are worth knowing to guarantee correctness of your reformulation.

### 7.1.1 Nested nonlinearity¶

In representing nonlinear inequalities with composites, such as \(\exp( \| x \| )\) or \(x^4 = (x^2)^2\), it is often helpful to extract the nested nonlinearity into a separate constraint. That is, we would like to express \(f(g(x))\) as \(f(r)\) for a new artificial variable \(r\) constrained as \(r = g(x)\). The last constraint is usually not convex (unless \(g\) is linear), so we need to derive an equivalent formulation using a relaxed inequality \(r\geq g(x)\) or \(r\leq g(x)\) instead.

Given an inequality \(t \geq f(g(x))\), the nested function \(g(x)\) can be replaced by a new artificial variable \(r\) if either holds:

- \(g(x)\) is convex and \(f(r)\) is convex and nondecreasing on the image of \(g\). Then \(t\geq f(g(x))\) is equivalent to \(t\geq f(r),\ r\geq g(x)\).
- \(g(x)\) is concave and \(f(r)\) is convex and nonincreasing on the image of \(g\). Then \(t\geq f(g(x))\) is equivalent to \(t\geq f(r),\ r\leq g(x)\).

In case \(g(x)\) is affine, that is \(g(x) = a^T x + b\), it can always be extracted using \(r=a^Tx+b\).

Here are some simple substitutions.

- The inequality \(t \geq \exp(g(x))\) can be represented as \(t \geq \exp(r)\) (using an exponential cone) with \(r \geq g(x)\) for all convex functions \(g(x)\). This is intuitive as we expect solvers to push \(r \rightarrow -\infty\) in attempt to relax \(t \geq \exp(r)\), why only the lower bound on \(r\) is needed to prevent it.
- The inequality \(2st \geq g(x)^2\) can be represented as \(2st \geq r^2\) (a rotated quadratic cone) with \(r \geq g(x)\) if \(g(x)\) is nonnegative convex. This is intuitive as we expect solvers to push \(r \rightarrow 0\) in attempt to relax \(2st \geq r^2\), why only the lower bound on \(r\) is needed if \(r=g(x)\) is nonnegative.
- If we try to express \(2st \geq \exp(x)\) as \(2st \geq r\geq \exp(x)\) then we fail because the first constraint \(2st\geq r\) is not convex. Nevertheless, we may restate it as \(2st \geq \exp(x/2)^2\) and subsequently split into \(2st\geq r^2\) and \(r \geq \exp(x/2)\) following the previous example.

As in the last example, it may require some work to find the correct reformulation of a nested nonlinear constraint. For instance, suppose that for \(x,y,z\geq 0\) with \(xyz>1\) we want to write

A natural first attempt is:

that is, we try to apply the decomposition (2) with \(f(r)=1/r\) and \(g(x,y,z)=xyz-1\). The function \(f\) is indeed convex and nonincreasing on \(r>0\) and the inequality \(tr\geq 1\) is representable with a rotated quadratic cone. Unfortunately \(g\) is not concave. We know that a monomial like \(xyz\) appears in connection with the power cone, but that requires a homogenous constraint such as \(xyz\geq u^3\). This gives us an idea to try

that is \(f(r)=1/r^3\) and \(g(x,y,z)=(xyz-1)^{1/3}\). This provides the right balance: all conditions in (2) are satisfied. Introducing another variable \(u\) we get the following model:

We refer to Sec. 4 (The power cone) to verify that all the constraints above are representable using power cones. We leave it as an exercise to find other conic representations, based on other transformations of the original inequality.

### 7.1.2 Convex univariate piecewise-defined functions¶

Consider a univariate function with \(k\) pieces:

Suppose \(f(x)\) is convex (in particular each piece is convex by itself) and \(f_i(\alpha_i) = f_{i+1}(\alpha_i)\). In representing the epigraph \(t \geq f(x)\) it is helpful to proceed via the equivalent representation:

As the reformulation grows in size with the number of pieces, it is preferable to keep this number low. Trivially, if \(f_i(x) = f_{i+1}(x)\), these two pieces can be merged. Substituting \(f_i(x)\) and \(f_{i+1}(x)\) for \(\max(f_i(x), f_{i+1}(x))\) is sometimes an invariant change to facilitate this merge. For instance, it always works for affine functions \(f_i(x)\) and \(f_{i+1}(x)\). Finally, if \(f(x)\) is symmetric around some point \(\alpha\), we can represent its epigraph via a piecewise function, with only half the number of pieces, defined in terms of a new variable \(z = |x-\alpha|+\alpha\).

The Huber loss function

is convex and its epigraph \(t\geq f(x)\) has an equivalent representation

where \(t_2 \geq x_2^2\) is representable by a rotated quadratic cone. No two pieces of \(f(x)\) can be merged to reduce the size of this formulation, but the loss function does satisfy a simple symmetry; namely \(f(x) = f(-x)\). We can thus represent its epigraph by \(t \geq \hat f(z)\) and \(z \geq |x|\), where

In this particular example, however, unless the absolute value \(z\) from \(z \geq |x|\) is used elsewhere, the cost of introducing it does not outweigh the savings achieved by going from three pieces in \(x\) to two pieces in \(z\).

## 7.2 Avoiding ill-posed problems¶

For a well-posed continuous problem a small change in the input data should induce a small change of the optimal solution. A problem is *ill-posed* if small perturbations of the problem data result in arbitrarily large perturbations of the solution, or even change the feasibility status of the problem. In such cases small rounding errors, or solving the problem on a different computer can result in different or wrong solutions.

In fact, from an algorithmic point of view, even computing a *wrong* solution is numerically difficult for ill-posed problems. A rigorous definition of the degree of ill-posedness is possible by defining a *condition number*. This is an attractive, but not very practical metric, as its evaluation requires solving several auxiliary optimization problems. Therefore we only make the modest recommendations to avoid problems

- that are nearly infeasible,
- with constraints that are linearly dependent, or nearly linearly dependent.

To give some idea about the perils of near linear dependence, consider this pair of optimization problems:

and

The first problem has a unique optimal solution \((x,y)=(10^4,2\cdot 10^4+1)\), while the second problem is unbounded. This is caused by the fact that the hyperplanes (here: straight lines) defining the constraints in each problem are almost parallel. Moreover, we can consider another modification:

with optimal solution \((x,y)=(10^3,2\cdot 10^3+1)\), which shows how a small perturbation of the coefficients induces a large change of the solution.

Typical examples of problems with nearly linearly dependent constraints are discretizations of continuous processes, where the constraints invariably become more correlated as we make the discretization finer; as such there may be nothing wrong with the discretization or problem formulation, but we should expect numerical difficulties for sufficiently fine discretizations.

One should also be careful not to specify problems whose optimal value is only achieved in the limit. A trivial example is

The infimum value 0 is not attained for any finite \(x\), which can lead to unspecified behaviour of the solver. More examples of ill-posed conic problems are discussed in Fig. 7.1 and Sec. 8 (Duality in conic optimization). In particular, Fig. 7.1 depicts some troublesome problems in two dimensions. In (a) minimizing \(y \geq 1 / x\) on the nonnegative orthant is unattained approaching zero for \(x \rightarrow \infty\). In (b) the only feasible point is \((0,0)\), but objectives maximizing \(x\) are not blocked by the purely vertical normal vectors of active constraints at this point, falsely suggesting local progress could be made. In (c) the intersection of the two subsets is empty, but the distance between them is zero. Finally in (d) minimizing \(x\) on \(y \geq x^2\) is unbounded at minus infinity for \(y \rightarrow \infty\), but there is no improving ray to follow. Although seemingly unrelated, the four cases are actually primal-dual pairs; (a) with (b) and (c) with (d). In fact, the missing normal vector property in (b)—desired to certify optimality—can be attributed to (a) not attaining the best among objective values at distance zero to feasibility, and the missing positive distance property in (c)—desired to certify infeasibility—is because (d) has no improving ray.

## 7.3 Scaling¶

Another difficulty encountered in practice involves models that are badly scaled. Loosely speaking, we consider a model to be badly scaled if

- variables are measured on very different scales,
- constraints or bounds are measured on very different scales.

For example if one variable \(x_1\) has units of molecules and another variable \(x_2\) measures temperature, we might expect an objective function such as

and in finite precision the second term dominates the objective and renders the contribution from \(x_1\) insignificant and unreliable.

A similar situation (from a numerical point of view) is encountered when using *penalization* or *big*-\(M\) strategies. Assume we have a standard linear optimization problem (2.12) with the additional constraint that \(x_1=0\). We may eliminate \(x_1\) completely from the model, or we might add an additional constraint, but suppose we choose to formulate a penalized problem instead

reasoning that the large penalty term will force \(x_1=0\). However, if \(\| c \|\) is small we have the exact same problem, namely that in finite precision the penalty term will completely dominate the objective and render the contribution \(c^T x\) insignificant or unreliable. Therefore, the penalty term should be chosen carefully.

Suppose we want to solve an efficient frontier variant of the optimal portfolio problem from Sec. 3.3.3 (Markowitz portfolio optimization) with an objective of the form

where the parameter \(\gamma\) controls the tradeoff between return and risk. The user might choose a very small \(\gamma\) to discount the risk term. This, however, may lead to numerical issues caused by a very large solution. To see why, consider a simple one-dimensional, unconstrained analogue:

and note that the maximum is attained for \(x=1/\gamma\), which may be very large.

Consider again the problem (2.12), but with additional (redundant) constraints \(x_i \leq \gamma\). This is a common approach for some optimization practitioners. The problem we solve is then

with a dual problem

Suppose we do not know *a-priori* an upper bound on \(\| x \|_\infty\), so we choose a very large \(\gamma=10^{12}\) reasoning that this will not change the optimal solution. Note that the large variable bound becomes a penalty term in the dual problem; in finite precision such a large bound will effectively destroy accuracy of the solution.

Suppose we have a mixed integer model which involves a big-M constraint (see Sec. 9 (Mixed integer optimization)), for instance

as in Sec. 9.1.3 (Indicator constraints). The constant \(M\) must be big enough to guarantee that the constraint becomes redundant when \(z=0\), or we risk shrinking the feasible set, perhaps to the point of creating an infeasible model. On the other hand, the user might set \(M\) to a huge, safe value, say \(M=10^{12}\). While this is mathematically correct, it will lead to a model with poorer linear relaxations and most likely increase solution time (sometimes dramatically). It is therefore very important to put some effort into estimating a reasonable value of \(M\) which is safe, but not too big.

## 7.4 The huge and the tiny¶

Some types of problems and constraints are especially prone to behave badly with very large or very small numbers. We discuss such examples briefly in this section.

Sum of squares

The sum of squares \(x_1^2+\cdots+x_n^2\) can be bounded above using the rotated quadratic cone:

but in most cases it is better to bound the square root with a quadratic cone:

The latter has slightly better numerical properties: \(t'\) is roughly comparable with \(x\) and is measured on the same scale, while \(t\) grows as the square of \(x\) which will sooner lead to numerical instabilities.

Exponential cone

Using the function \(e^x\) for \(x\leq -30\) or \(x\geq 30\) would be rather questionable if \(e^x\) is supposed to represent any realistic value. Note that \(e^{30}\approx 10^{13}\) and that \(e^{30.0000001}-e^{30}\approx 10^6\), so a tiny perturbation of the exponent produces a huge change of the result. Ideally, \(x\) should have a single-digit absolute value. For similar reasons, it is even more important to have well-scaled data. For instance, in floating point arithmetic we have the “equality”

since the smaller summand disappears in comparison to the other one, \(10^{17}\) times bigger.

For the same reason it is advised to replace explicit inequalities involving \(\exp(x)\) with log-sum-exp variants (see Sec. 5.2.6 (Log-sum-exp)). For example, suppose we have a constraint such as

Then after a substitution \(t=e^{t'}\) we have

which has the advantage that \(t'\) is of the same order of magnitude as \(x_i\), and that the exponents \(x_i-t'\) are likely to have much smaller absolute values than simply \(x_i\).

Power cone

The power cone is not reliable when one of the exponents is very small. For example, consider the function \(f(x)=x^{0.01}\), which behaves almost like the indicator function of \(x>0\) in the sense that

Now suppose \(x=10^{-20}\). Is the constraint \(f(x)>0.5\) satisfied? In principle yes, but in practice \(x\) could have been obtained as a solution to an optimization problem and it may in fact represent \(0\) up to numerical error. The function \(f(x)\) is sensitive to changes in \(x\) well below standard numerical accuracy. The point \(x=0\) does not satisfy \(f(x)>0.5\) but it is only \(10^{-30}\) from doing so.

## 7.5 The quality of a solution¶

In this section we will discuss how to validate an obtained solution. Assume we have a conic model with continuous variables only and that the optimization software has reported an optimal primal and dual solution. Given such a solution, we might ask how to verify that it is indeed feasible and optimal.

To that end, consider a simple model

where a solver might approximate the solution as

and therefore the approximate optimal objective value is

The true objective value is \(-\sqrt{2}\), so the approximate objective value is wrong by the amount

Most likely this difference is irrelevant for all practical purposes. Nevertheless, in general a solution obtained using floating point arithmetic is only an approximation. Most (if not all) commercial optimization software uses double precision floating point arithmetic, implying that about 16 digits are correct in the computations performed internally by the software. This also means that irrational numbers such as \(\sqrt{2}\) and \(\pi\) can only be stored accurately within 16 digits.

Verifying feasibility

A good practice after solving an optimization problem is to evaluate the reported solution. At the very least this process should be carried out during the initial phase of building a model, or if the reported solution is unexpected in some way. The first step in that process is to check that the solution is feasible; in case of the small example (7.1) this amounts to checking that:

which demonstrates that one constraint is slightly violated due to computations in finite precision. It is up to the user to assess the significance of a specific violation; for example a violation of one unit in

may or may not be more significant than a violation of one unit in

The right-hand side of \(10^9\) may itself be the result of a computation in finite precision, and may only be known with, say 3 digits of accuracy. Therefore, a violation of 1 unit is not significant since the true right-hand side could just as well be \(1.001 \cdot 10^9\). Certainly a violation of \(4\cdot 10^{-16}\) as in our example is within the numerical tolerance for zero and we would accept the solution as feasible. In practice it would be standard to see violations of order \(10^{-8}\).

Verifying optimality

Another question is how to verify that a feasible approximate solution is actually optimal, which is answered by duality theory, which is discussed in Sec. 2.4 (Duality in linear optimization) for linear problems and in Sec. 8 (Duality in conic optimization) in full generality. Following Sec. 8 (Duality in conic optimization) we can derive an equivalent version of the dual problem as:

This problem has a feasible point \((y_1,y_2,y_3)=(0,1/\sqrt{2},1/\sqrt{2})\) with objective value \(-\sqrt{2}\), matching the objective value of the primal problem. By duality theory this proves that both the primal and dual solutions are optimal. Again, in finite precision the dual objective value may be reported as, for example

leading to a duality gap of about \(10^{-16}\) between the approximate primal and dual objective values. It is up to the user to assess the significance of any duality gap and accept or reject the solution as sufficiently good approximation of the optimal value.

## 7.6 Distance to a cone¶

The violation of a linear constraint \(a^Tx\leq b\) under a given solution \(x^*\) is obviously

It is less obvious how to assess violations of conic constraints. To this end suppose we have a convex cone \(K\). The (Euclidean) projection of a point \(x\) onto \(K\) is defined as the solution of the problem

We denote the projection by \(\mathrm{proj}_ K (x)\). The distance of \(x\) to \(K\)

i.e. the objective value of (7.2), measures the violation of a conic constraint involving \(K\). Obviously

This distance measure is attractive because it depends only on the set \(K\) and not on any particular representation of \(K\) using (in)equalities.

Let \(K=\POW_3^{0.1, 0.9}\) be the power cone defined by the inequality \(x^{0.1}y^{0.9}\geq |z|\). The point

is clearly not in the cone, in fact \(x^{0.1}y^{0.9}=0\) while \(|z|=500\), so the violation of the conic inequality is seemingly large. However, the point

belongs to \(\POW_3^{0.1, 0.9}\) (since \((10^{-8})^{0.1}\cdot 10000^{0.9}\approx 630\)), hence

Therefore the distance of \(x^*\) to the cone is actually very small.

Distance to certain cones

For some basic cone types the projection problem (7.2) can be solved analytically. Denoting \([x]_+ = \max(0,x)\) we have that:

For \(x\in\real\) the projection onto the nonnegative half-line is \(\mathrm{proj}_{\real_+}(x) = [x]_+\).

For \((t,x)\in\real\times\real^{n-1}\) the projection onto the quadratic cone is

\[\begin{split}\mathrm{proj}_{\Q^n}(t,x) = \begin{cases} (t,x) & \mbox{if}\ t\geq \|x\|_2, \\ \frac12\left(\frac{t}{\|x\|_2}+1\right)(\|x\|_2,x) & \mbox{if}\ -\|x\|_2< t< \|x\|_2,\\ 0 & \mbox{if}\ t\leq -\|x\|_2. \end{cases}\end{split}\]For \(X=\sum_{i=1}^n \lambda_iq_iq_i^T\) the projection onto the semidefinite cone is

\[\mathrm{proj}_{\PSD^n}(X) = \sum_{i=1}^n[\lambda_i]_+q_iq_i^T.\]