7 Practical optimization

In this chapter we discuss different practical aspects of creating optimization models.

7.1 Avoiding ill-posed problems

For a well-posed continuous problem a small change in the input data should induce a small change of the optimal solution. A problem is ill-posed if small perturbations of the problem data result in arbitrarily large perturbations of the solution, or even change the feasibility status of the problem. In such cases small rounding errors, or solving the problem on a different computer can result in different or wrong solutions.

In fact, from an algorithmic point of view, even computing a wrong solution is numerically difficult for ill-posed problems. A rigorous definition of the degree of ill-posedness is possible by defining a condition number. This is an attractive, but not very practical metric, as its evaluation requires solving several auxiliary optimization problems. Therefore we only make the modest recommendations to avoid problems

  • that are nearly infeasible,
  • with constraints that are linearly dependent, or nearly linearly dependent.
Example 7.1 Near linear dependencies

To give some idea about the perils of near linear dependence, consider this pair of optimization problems:

\[\begin{split}\begin{array}{lrcl} \mbox{maximize} & x & & \\ \mbox{subject to }& 2x-y & \geq & -1,\\ & 2.0001x-y & \leq & 0, \end{array}\end{split}\]

and

\[\begin{split}\begin{array}{lrcl} \mbox{maximize} & x & & \\ \mbox{subject to }& 2x-y & \geq & -1,\\ & 1.9999x-y & \leq & 0. \end{array}\end{split}\]

The first problem has a unique optimal solution \((x,y)=(10^4,2\cdot 10^4+1)\), while the second problem is unbounded. This is caused by the fact that the hyperplanes (here: straight lines) defining the constraints in each problem are almost parallel. Moreover, we can consider another modification:

\[\begin{split}\begin{array}{lrcl} \mbox{maximize} & x & & \\ \mbox{subject to }& 2x-y & \geq & -1,\\ & 2.001x-y & \leq & 0, \end{array}\end{split}\]

with optimal solution \((x,y)=(10^3,2\cdot 10^3+1)\), which shows how a small perturbation of the coefficients induces a large change of the solution.

Typical examples of problems with nearly linearly dependent constraints are discretizations of continuous processes, where the constraints invariably become more correlated as we make the discretization finer; as such there may be nothing wrong with the discretization or problem formulation, but we should expect numerical difficulties for sufficiently fine discretizations.

One should also be careful not to specify problems whose optimal value is only achieved in the limit. A trivial example is

\[\minimize\ e^x, \ x\in\real.\]

The infimum value 0 is not attained for any finite \(x\), which can lead to unspecified behaviour of the solver. More examples of ill-posed conic problems are discussed in Fig. 7.1 and Sec. 8 (Duality in conic optimization). In particular, Fig. 7.1 depicts some troublesome problems in two dimensions. In (a) minimizing \(y \geq 1 / x\) on the nonnegative orthant is unattained approaching zero for \(x \rightarrow \infty\). In (b) the only feasible point is \((0,0)\), but objectives maximizing \(x\) are not blocked by the purely vertical normal vectors of active constraints at this point, falsely suggesting local progress could be made. In (c) the intersection of the two subsets is empty, but the distance between them is zero. Finally in (d) minimizing \(x\) on \(y \geq x^2\) is unbounded at minus infinity for \(y \rightarrow \infty\), but there is no improving ray to follow. Although seemingly unrelated, the four cases are actually primal-dual pairs; (a) with (b) and (c) with (d). In fact, the missing normal vector property in (b)—desired to certify optimality—can be attributed to (a) not attaining the best among objective values at distance zero to feasibility, and the missing positive distance property in (c)—desired to certify infeasibility—is because (d) has no improving ray.

_images/illposed.png

Fig. 7.1 Geometric representations of some ill-posed situations.

7.2 Scaling

Another difficulty encountered in practice involves models that are badly scaled. Loosely speaking, we consider a model to be badly scaled if

  • variables are measured on very different scales,
  • constraints or bounds are measured on very different scales.

For example if one variable \(x_1\) has units of molecules and another variable \(x_2\) measures temperature, we might expect an objective function such as

\[x_1 + 10^{12}x_2\]

and in finite precision the second term dominates the objective and renders the contribution from \(x_1\) insignificant and unreliable.

A similar situation (from a numerical point of view) is encountered when using penalization or big-\(M\) strategies. Assume we have a standard linear optimization problem (2.12) with the additional constraint that \(x_1=0\). We may eliminate \(x_1\) completely from the model, or we might add an additional constraint, but suppose we choose to formulate a penalized problem instead

\[\begin{split}\begin{array}{ll} \mbox{minimize} & c^T x + 10^{12}x_1\\ \mbox{subject to}& Ax = b, \\ & x \geq 0, \end{array}\end{split}\]

reasoning that the large penalty term will force \(x_1=0\). However, if \(\| c \|\) is small we have the exact same problem, namely that in finite precision the penalty term will completely dominate the objective and render the contribution \(c^T x\) insignificant or unreliable. Therefore, the penalty term should be chosen carefully.

Example 7.2 Risk-return tradeoff

Suppose we want to solve an efficient frontier variant of the optimal portfolio problem from Sec. 3.3.3 (Markowitz portfolio optimization) with an objective of the form

\[\mbox{maximize}\ \mu^Tx-\frac12\gamma x^T\Sigma x\]

where the parameter \(\gamma\) controls the tradeoff between return and risk. The user might choose a very small \(\gamma\) to discount the risk term. This, however, may lead to numerical issues caused by a very large solution. To see why, consider a simple one-dimensional, unconstrained analogue:

\[\mbox{maximize}\ x-\frac12 \gamma x^2,\ x\in\real\]

and note that the maximum is attained for \(x=1/\gamma\), which may be very large.

Example 7.3 Explicit redundant bounds

Consider again the problem (2.12), but with additional (redundant) constraints \(x_i \leq \gamma\). This is a common approach for some optimization practitioners. The problem we solve is then

\[\begin{split}\begin{array}{ll} \mbox{minimize} & c^T x\\ \mbox{subject to} & Ax = b,\\ & x \geq 0,\\ & x \leq \gamma e, \end{array}\end{split}\]

with a dual problem

\[\begin{split}\begin{array}{ll} \mbox{maximize} & b^T y - \gamma e^T z\\ \mbox{subject to} & A^T y + s - z = c,\\ & s, z \geq 0. \end{array}\end{split}\]

Suppose we do not know a-priori an upper bound on \(\| x \|_\infty\), so we choose a very large \(\gamma=10^{12}\) reasoning that this will not change the optimal solution. Note that the large variable bound becomes a penalty term in the dual problem; in finite precision such a large bound will effectively destroy accuracy of the solution.

Example 7.4 Big-M

Suppose we have a mixed integer model which involves a big-M constraint (see Sec. 9 (Mixed integer optimization)), for instance

\[a^Tx\leq b+M(1-z)\]

as in Sec. 9.1.3 (Indicator constraints). The constant \(M\) must be big enough to guarantee that the constraint becomes redundant when \(z=0\), or we risk shrinking the feasible set, perhaps to the point of creating an infeasible model. On the other hand, the user might set \(M\) to a huge, safe value, say \(M=10^{12}\). While this is mathematically correct, it will lead to a model with poorer linear relaxations and most likely increase solution time (sometimes dramatically). It is therefore very important to put some effort into estimating a reasonable value of \(M\) which is safe, but not too big.

7.3 The huge and the tiny

Some types of problems and constraints are especially prone to behave badly with very large or very small numbers. We discuss such examples briefly in this section.

Sum of squares

The sum of squares \(x_1^2+\cdots+x_n^2\) can be bounded above using the rotated quadratic cone:

\[t\geq x_1^2+\cdots+ x_n^2 \iff (\frac12,t,x)\in\Qr^n,\]

but in most cases it is better to bound the square root with a quadratic cone:

\[t'\geq \sqrt{x_1^2+\cdots+ x_n^2} \iff (t',x)\in\Q^n.\]

The latter has slightly better numerical properties: \(t'\) is roughly comparable with \(x\) and is measured on the same scale, while \(t\) grows as the square of \(x\) which will sooner lead to numerical instabilities.

Exponential cone

Using the function \(e^x\) for \(x\leq -30\) or \(x\geq 30\) would be rather questionable if \(e^x\) is supposed to represent any realistic value. Note that \(e^{30}\approx 10^{13}\) and that \(e^{30.0000001}-e^{30}\approx 10^6\), so a tiny perturbation of the exponent produces a huge change of the result. Ideally, \(x\) should have a single-digit absolute value. For similar reasons, it is even more important to have well-scaled data. For instance, in floating point arithmetic we have the “equality”

\[e^{20}+e^{-20} = e^{20}\]

since the smaller summand disappears in comparison to the other one, \(10^{17}\) times bigger.

For the same reason it is advised to replace explicit inequalities involving \(\exp(x)\) with log-sum-exp variants (see Sec. 5.2.6 (Log-sum-exp)). For example, suppose we have a constraint such as

\[t \geq e^{x_1}+e^{x_2}+e^{x_3}.\]

Then after a substitution \(t=e^{t'}\) we have

\[1 \geq e^{x_1-t'}+e^{x_2-t'}+e^{x_3-t'}\]

which has the advantage that \(t'\) is of the same order of magnitude as \(x_i\), and that the exponents \(x_i-t'\) are likely to have much smaller absolute values than simply \(x_i\).

Power cone

The power cone is not reliable when one of the exponents is very small. For example, consider the function \(f(x)=x^{0.01}\), which behaves almost like the indicator function of \(x>0\) in the sense that

\[f(0)=0,\ \mbox{and} \ f(x)>0.5\ \mbox{for}\ x>10^{-30}.\]

Now suppose \(x=10^{-20}\). Is the constraint \(f(x)>0.5\) satisfied? In principle yes, but in practice \(x\) could have been obtained as a solution to an optimization problem and it may in fact represent \(0\) up to numerical error. The function \(f(x)\) is sensitive to changes in \(x\) well below standard numerical accuracy. The point \(x=0\) does not satisfy \(f(x)>0.5\) but it is only \(10^{-30}\) from doing so.

7.4 The quality of a solution

In this section we will discuss how to validate an obtained solution. Assume we have a conic model with continuous variables only and that the optimization software has reported an optimal primal and dual solution. Given such a solution, we might ask how to verify that it is indeed feasible and optimal.

To that end, consider a simple model

(7.1)\[\begin{split}\begin{array}{lrcl} \mbox{minimize}& -x_2 & \\ \mbox{subject to}& x_1 + x_2 & \leq & 3, \\ & x_2^2 & \leq & 2, \\ & x_1, x_2 & \geq & 0, \end{array}\end{split}\]

where a solver might approximate the solution as

\[x_1 = 0.0000000000000000 \mbox{ and } x_2 = 1.4142135623730951\]

and therefore the approximate optimal objective value is

\[- 1.4142135623730951.\]

The true objective value is \(-\sqrt{2}\), so the approximate objective value is wrong by the amount

\[1.4142135623730951 - \sqrt{2} \approx 10^{-16}.\]

Most likely this difference is irrelevant for all practical purposes. Nevertheless, in general a solution obtained using floating point arithmetic is only an approximation. Most (if not all) commercial optimization software uses double precision floating point arithmetic, implying that about 16 digits are correct in the computations performed internally by the software. This also means that irrational numbers such as \(\sqrt{2}\) and \(\pi\) can only be stored accurately within 16 digits.

Verifying feasibility

A good practice after solving an optimization problem is to evaluate the reported solution. At the very least this process should be carried out during the initial phase of building a model, or if the reported solution is unexpected in some way. The first step in that process is to check that the solution is feasible; in case of the small example (7.1) this amounts to checking that:

\[\begin{split}\begin{array}{rcrcl} x_1 + x_2 & = & 0.0000000000000000 + 1.4142135623730951 & \leq & 3, \\ x_2^2 & = & 2.0000000000000004 & \leq & 2,\ (?) \\ x_1 & = & 0.0000000000000000 & \geq & 0, \\ x_2 & = & 1.4142135623730951 & \geq & 0, \\ \end{array}\end{split}\]

which demonstrates that one constraint is slightly violated due to computations in finite precision. It is up to the user to assess the significance of a specific violation; for example a violation of one unit in

\[x_1 + x_2 \leq 1\]

may or may not be more significant than a violation of one unit in

\[x_1 + x_2 \leq 10^9.\]

The right-hand side of \(10^9\) may itself be the result of a computation in finite precision, and may only be known with, say 3 digits of accuracy. Therefore, a violation of 1 unit is not significant since the true right-hand side could just as well be \(1.001 \cdot 10^9\). Certainly a violation of \(4\cdot 10^{-16}\) as in our example is within the numerical tolerance for zero and we would accept the solution as feasible. In practice it would be standard to see violations of order \(10^{-8}\).

Verifying optimality

Another question is how to verify that a feasible approximate solution is actually optimal, which is answered by duality theory, which is discussed in Sec. 2.4 (Duality in linear optimization) for linear problems and in Sec. 8 (Duality in conic optimization) in full generality. Following Sec. 8 (Duality in conic optimization) we can derive an equivalent version of the dual problem as:

\[\begin{split}\begin{array}{lrcl} \mbox{maximize} & -3 y_1 - y_2 -y_3& &\\ \mbox{subject to}& 2y_2y_3 & \geq &(y_1+1)^2, \\ & y_1 & \geq & 0. \\ \end{array}\end{split}\]

This problem has a feasible point \((y_1,y_2,y_3)=(0,1/\sqrt{2},1/\sqrt{2})\) with objective value \(-\sqrt{2}\), matching the objective value of the primal problem. By duality theory this proves that both the primal and dual solutions are optimal. Again, in finite precision the dual objective value may be reported as, for example

\[-1.4142135623730950\]

leading to a duality gap of about \(10^{-16}\) between the approximate primal and dual objective values. It is up to the user to assess the significance of any duality gap and accept or reject the solution as sufficiently good approximation of the optimal value.

7.5 Distance to a cone

The violation of a linear constraint \(a^Tx\leq b\) under a given solution \(x^*\) is obviously

\[\max(0, a^Tx^*-b).\]

It is less obvious how to assess violations of conic constraints. To this end suppose we have a convex cone \(K\). The (Euclidean) projection of a point \(x\) onto \(K\) is defined as the solution of the problem

(7.2)\[\begin{split}\begin{array}{ll} \mbox{minimize} & \|p-x\|_2 \\ \mbox{subject to} & p\in K. \end{array}\end{split}\]

We denote the projection by \(\mathrm{proj}_ K (x)\). The distance of \(x\) to \(K\)

\[\mathrm{dist}(x, K ) = \|x-\mathrm{proj}_ K (x)\|_2,\]

i.e. the objective value of (7.2), measures the violation of a conic constraint involving \(K\). Obviously

\[\mathrm{dist}(x, K ) = 0 \iff x\in K .\]

This distance measure is attractive because it depends only on the set \(K\) and not on any particular representation of \(K\) using (in)equalities.

Example 7.5 Surprisingly small distance

Let \(K=\POW_3^{0.1, 0.9}\) be the power cone defined by the inequality \(x^{0.1}y^{0.9}\geq |z|\). The point

\[x^* = (x,y,z) = (0, 10000, 500)\]

is clearly not in the cone, in fact \(x^{0.1}y^{0.9}=0\) while \(|z|=500\), so the violation of the conic inequality is seemingly large. However, the point

\[p = (10^{-8}, 10000, 500)\]

belongs to \(\POW_3^{0.1, 0.9}\) (since \((10^{-8})^{0.1}\cdot 10000^{0.9}\approx 630\)), hence

\[\mathrm{dist}(x^*, \POW_3^{0.1, 0.9}) \leq \|x^*-p\|_2 = 10^{-8}.\]

Therefore the distance of \(x^*\) to the cone is actually very small.

Distance to certain cones

For some basic cone types the projection problem (7.2) can be solved analytically. Denoting \([x]_+ = \max(0,x)\) we have that:

  • For \(x\in\real\) the projection onto the nonnegative half-line is \(\mathrm{proj}_{\real_+}(x) = [x]_+\).

  • For \((t,x)\in\real\times\real^{n-1}\) the projection onto the quadratic cone is

    \[\begin{split}\mathrm{proj}_{\Q^n}(t,x) = \begin{cases} (t,x) & \mbox{if}\ t\geq \|x\|_2, \\ \frac12\left(\frac{t}{\|x\|_2}+1\right)(\|x\|_2,x) & \mbox{if}\ -\|x\|_2< t< \|x\|_2,\\ 0 & \mbox{if}\ t\leq -\|x\|_2. \end{cases}\end{split}\]
  • For \(X=\sum_{i=1}^n \lambda_iq_iq_i^T\) the projection onto the semidefinite cone is

    \[\mathrm{proj}_{\PSD^n}(X) = \sum_{i=1}^n[\lambda_i]_+q_iq_i^T.\]