7 Practical optimization

In this chapter we discuss various practical aspects of creating optimization models.

7.1 Conic reformulations

Many nonlinear constraint families were reformulated to conic form in previous chapters, and these examples are often sufficient to construct conic models for the optimization problems at hand. In case you do need to go beyond the cookbook examples, however, a few composition rules are worth knowing to guarantee correctness of your reformulation.

7.1.1 Perspective functions

The perspective of a function f(x) is defined as sf(x/s) on s>0. From any conic representation of tf(x) one can reach a representation of tsf(x/s) by exchanging all constants k with their homogenized counterpart ks.

Example 7.1

In Sec. 5.2.6 (Log-sum-exp) it was shown that the logarithm of sum of exponentials

tlog(ex1++exn),

can be modeled as:

ui1,(ui,1,xit)Kexp, i=1,,n.

It follows immediately that its perspective

tslog(ex1/s++exn/s),

can be modeled as:

uis,(ui,s,xit)Kexp, i=1,,n.
Proof
7.1

If [tf(x)] [tc(x,r)+c0,A(x,r)+bK] for some linear functions c and A, a cone K, and artificial variable r, then [tsf(x/s)] [ts(c(x/s,r)+c0),A(x/s,r)+bK] [tc(x,r¯)+c0s,A(x,r¯)+bsK], where r¯ is taken as a new artificial variable in place of rs.

7.1.2 Composite functions

In representing composites f(g(x)), it is often helpful to move the inner function g(x) to a separate constraint. That is, we would like tf(g(x)) to be reformulated as tf(r) for a new variable r constrained as r=g(x). This reformulation works directly as stated if g is affine; that is, g(x)=aTx+b. Otherwise, one of two possible relaxations may be considered:

  1. If f is convex and g is convex, then tf(r), rg(x) is equivalent to t{f(g(x)) if g(x)F+,fmin otherwise ,

  2. If f is convex and g is concave, then tf(r), rg(x) is equivalent to t{f(g(x)) if g(x)F,fmin otherwise ,

where F+ (resp. F) is the domain on which f is nondecreasing (resp. nonincreasing), and fmin is the minimum of f, if finite, and otherwise. Note that Relaxation 1 provides an exact representation if g(x)F+ for all x in domain, and likewise for Relaxation 2 if g(x)F for all x in domain.

Proof
7.2

Consider Relaxation 1. Intuitively, if rF+, you can expect clever solvers to drive r towards in attempt to relax tf(r), and this push continues until rg(x) becomes active or else — in case g(x)F+ — until fmin is reached. On the other hand, if rF (and thus rg(x)F) then solvers will drive r towards + until fmin is reached.

Example 7.2

Here are some simple substitutions.

  • The inequality texp(g(x)) can be rewritten as texp(r) and rg(x) for all convex functions g(x). This is because f(r)=exp(r) is nondecreasing everywhere.

  • The inequality 2stg(x)2 can be rewritten as 2str2 and rg(x) for all nonnegative convex functions g(x). This is because f(r)=r2 is nondecreasing on r0.

  • The nonconvex inequality t(x21)2 can be relaxed as tr2 and rx21. This relaxation is exact on x210 where f(r)=r2 is nondecreasing, i.e., on the disjoint domain |x|1. On the domain |x|1 it represents the strongest possible convex cut given by tfmin=0.

Example 7.3

Some amount of work is generally required to find the correct reformulation of a nested nonlinear constraint. For instance, suppose that for x,y,z0 with xyz>1 we want to write

t1xyz1.

A natural first attempt is:

t1r,rxyz1,

which corresponds to Relaxation 2 with f(r)=1/r and g(x,y,z)=xyz1>0. The function f is indeed convex and nonincreasing on all of g(x,y,z), and the inequality tr1 is moreover representable with a rotated quadratic cone. Unfortunately g is not concave. We know that a monomial like xyz appears in connection with the power cone, but that requires a homogeneous constraint such as xyzu3. This gives us an idea to try

t1r3,r3xyz1,

which is f(r)=1/r3 and g(x,y,z)=(xyz1)1/3>0. This provides the right balance: all conditions for validity and exactness of Relaxation 2 are satisfied. Introducing another variable u we get the following model:

tr31,xyzu3,u(r3+13)1/3,

We refer to Sec. 4 (The power cones) to verify that all the constraints above are representable using power cones. We leave it as an exercise to find other conic representations, based on other transformations of the original inequality.

7.1.3 Convex univariate piecewise-defined functions

Consider a univariate function with k pieces:

f(x)={f1(x) if xα1,fi(x) if αi1xαi, for i=2,,k1,fk(x) if αk1x,

Suppose f(x) is convex (in particular each piece is convex by itself) and fi(αi)=fi+1(αi). In representing the epigraph tf(x) it is helpful to proceed via the equivalent representation:

t=i=1ktii=1k1fi(αi),x=i=1kxii=1k1αi,tifi(xi) for i=1,,k,x1α1,αi1xiαi for i=2,,k1,αk1xk.
Proof
7.3

In the special case when k=2 and α1=f1(α1)=f2(α1)=0 the epigraph of f is equal to the Minkowski sum {(t1+t2,x1+x2) : tifi(xi)}. In general the epigraph over two consecutive pieces is obtained by shifting them to (0,0), computing the Minkowski sum and shifting back. Finally more than two pieces can be joined by continuing this argument by induction.

As the reformulation grows in size with the number of pieces, it is preferable to keep this number low. Trivially, if fi(x)=fi+1(x), these two pieces can be merged. Substituting fi(x) and fi+1(x) for max(fi(x),fi+1(x)) is sometimes an invariant change to facilitate this merge. For instance, it always works for affine functions fi(x) and fi+1(x). Finally, if f(x) is symmetric around some point α, we can represent its epigraph via a piecewise function, with only half the number of pieces, defined in terms of a new variable z=|xα|+α.

Example 7.4 Huber loss function

The Huber loss function

f(x)={2x1 if x1,x2 if 1x1,2x1 if 1x,

is convex and its epigraph tf(x) has an equivalent representation

tt1+t2+t32,x=x1+x2+x3,t1=2x11, t2x22, t3=2x31,x11,1x21,1x3,

where t2x22 is representable by a rotated quadratic cone. No two pieces of f(x) can be merged to reduce the size of this formulation, but the loss function does satisfy a simple symmetry; namely f(x)=f(x). We can thus represent its epigraph by tf^(z) and z|x|, where

f^(z)={z2 if z1,2z1 if 1z.

In this particular example, however, unless the absolute value z from z|x| is used elsewhere, the cost of introducing it does not outweigh the savings achieved by going from three pieces in x to two pieces in z.

7.2 Avoiding ill-posed problems

For a well-posed continuous problem a small change in the input data should induce a small change of the optimal solution. A problem is ill-posed if small perturbations of the problem data result in arbitrarily large perturbations of the solution, or even change the feasibility status of the problem. In such cases small rounding errors, or solving the problem on a different computer can result in different or wrong solutions.

In fact, from an algorithmic point of view, even computing a wrong solution is numerically difficult for ill-posed problems. A rigorous definition of the degree of ill-posedness is possible by defining a condition number. This is an attractive, but not very practical metric, as its evaluation requires solving several auxiliary optimization problems. Therefore we only make the modest recommendations to avoid problems

  • that are nearly infeasible,

  • with constraints that are linearly dependent, or nearly linearly dependent.

Example 7.5 Near linear dependencies

To give some idea about the perils of near linear dependence, consider this pair of optimization problems:

maximizexsubject to 2xy1,2.0001xy0,

and

maximizexsubject to 2xy1,1.9999xy0.

The first problem has a unique optimal solution (x,y)=(104,2104+1), while the second problem is unbounded. This is caused by the fact that the hyperplanes (here: straight lines) defining the constraints in each problem are almost parallel. Moreover, we can consider another modification:

maximizexsubject to 2xy1,2.001xy0,

with optimal solution (x,y)=(103,2103+1), which shows how a small perturbation of the coefficients induces a large change of the solution.

Typical examples of problems with nearly linearly dependent constraints are discretizations of continuous processes, where the constraints invariably become more correlated as we make the discretization finer; as such there may be nothing wrong with the discretization or problem formulation, but we should expect numerical difficulties for sufficiently fine discretizations.

One should also be careful not to specify problems whose optimal value is only achieved in the limit. A trivial example is

minimize ex, xR.

The infimum value 0 is not attained for any finite x, which can lead to unspecified behaviour of the solver. More examples of ill-posed conic problems are discussed in Fig. 7.1 and Sec. 8 (Duality in conic optimization). In particular, Fig. 7.1 depicts some troublesome problems in two dimensions. In (a) minimizing y1/x on the nonnegative orthant is unattained approaching zero for x. In (b) the only feasible point is (0,0), but objectives maximizing x are not blocked by the purely vertical normal vectors of active constraints at this point, falsely suggesting local progress could be made. In (c) the intersection of the two subsets is empty, but the distance between them is zero. Finally in (d) minimizing x on yx2 is unbounded at minus infinity for y, but there is no improving ray to follow. Although seemingly unrelated, the four cases are actually primal-dual pairs; (a) with (b) and (c) with (d). In fact, the missing normal vector property in (b)—desired to certify optimality—can be attributed to (a) not attaining the best among objective values at distance zero to feasibility, and the missing positive distance property in (c)—desired to certify infeasibility—is because (d) has no improving ray.

_images/illposed.png

Fig. 7.1 Geometric representations of some ill-posed situations.

7.3 Scaling

Another difficulty encountered in practice involves models that are badly scaled. Loosely speaking, we consider a model to be badly scaled if

  • variables are measured on very different scales,

  • constraints or bounds are measured on very different scales.

For example if one variable x1 has units of molecules and another variable x2 measures temperature, we might expect an objective function such as

x1+1012x2

and in finite precision the second term dominates the objective and renders the contribution from x1 insignificant and unreliable.

A similar situation (from a numerical point of view) is encountered when using penalization or big-M strategies. Assume we have a standard linear optimization problem (2.12) with the additional constraint that x1=0. We may eliminate x1 completely from the model, or we might add an additional constraint, but suppose we choose to formulate a penalized problem instead

minimizecTx+1012x1subject toAx=b,x0,

reasoning that the large penalty term will force x1=0. However, if c is small we have the exact same problem, namely that in finite precision the penalty term will completely dominate the objective and render the contribution cTx insignificant or unreliable. Therefore, the penalty term should be chosen carefully.

Example 7.6 Risk-return tradeoff

Suppose we want to solve an efficient frontier variant of the optimal portfolio problem from Sec. 3.3.3 (Markowitz portfolio optimization) with an objective of the form

maximize μTx12γxTΣx

where the parameter γ controls the tradeoff between return and risk. The user might choose a very small γ to discount the risk term. This, however, may lead to numerical issues caused by a very large solution. To see why, consider a simple one-dimensional, unconstrained analogue:

maximize x12γx2, xR

and note that the maximum is attained for x=1/γ, which may be very large.

Example 7.7 Explicit redundant bounds

Consider again the problem (2.12), but with additional (redundant) constraints xiγ. This is a common approach for some optimization practitioners. The problem we solve is then

minimizecTxsubject toAx=b,x0,xγe,

with a dual problem

maximizebTyγeTzsubject toATy+sz=c,s,z0.

Suppose we do not know a-priori an upper bound on x, so we choose a very large γ=1012 reasoning that this will not change the optimal solution. Note that the large variable bound becomes a penalty term in the dual problem; in finite precision such a large bound will effectively destroy accuracy of the solution.

Example 7.8 Big-M

Suppose we have a mixed integer model which involves a big-M constraint (see Sec. 9 (Mixed integer optimization)), for instance

aTxb+M(1z)

as in Sec. 9.1.3 (Indicator constraints). The constant M must be big enough to guarantee that the constraint becomes redundant when z=0, or we risk shrinking the feasible set, perhaps to the point of creating an infeasible model. On the other hand, the user might set M to a huge, safe value, say M=1012. While this is mathematically correct, it will lead to a model with poorer linear relaxations and most likely increase solution time (sometimes dramatically). It is therefore very important to put some effort into estimating a reasonable value of M which is safe, but not too big.

7.4 The huge and the tiny

Some types of problems and constraints are especially prone to behave badly with very large or very small numbers. We discuss such examples briefly in this section.

Sum of squares

The sum of squares x12++xn2 can be bounded above using the rotated quadratic cone:

tx12++xn2(12,t,x)Qrn,

but in most cases it is better to bound the square root with a quadratic cone:

tx12++xn2(t,x)Qn.

The latter has slightly better numerical properties: t is roughly comparable with x and is measured on the same scale, while t grows as the square of x which will sooner lead to numerical instabilities.

Exponential cone

Using the function ex for x30 or x30 would be rather questionable if ex is supposed to represent any realistic value. Note that e301013 and that e30.0000001e30106, so a tiny perturbation of the exponent produces a huge change of the result. Ideally, x should have a single-digit absolute value. For similar reasons, it is even more important to have well-scaled data. For instance, in floating point arithmetic we have the “equality”

e20+e20=e20

since the smaller summand disappears in comparison to the other one, 1017 times bigger.

For the same reason it is advised to replace explicit inequalities involving exp(x) with log-sum-exp variants (see Sec. 5.2.6 (Log-sum-exp)). For example, suppose we have a constraint such as

tex1+ex2+ex3.

Then after a substitution t=et we have

1ex1t+ex2t+ex3t

which has the advantage that t is of the same order of magnitude as xi, and that the exponents xit are likely to have much smaller absolute values than simply xi.

Power cone

The power cone is not reliable when one of the exponents is very small. For example, consider the function f(x)=x0.01, which behaves almost like the indicator function of x>0 in the sense that

f(0)=0, and f(x)>0.5 for x>1030.

Now suppose x=1020. Is the constraint f(x)>0.5 satisfied? In principle yes, but in practice x could have been obtained as a solution to an optimization problem and it may in fact represent 0 up to numerical error. The function f(x) is sensitive to changes in x well below standard numerical accuracy. The point x=0 does not satisfy f(x)>0.5 but it is only 1030 from doing so.

7.5 Semidefinite variables

Special care should be given to models involving semidefinite matrix variables (see Sec. 6 (Semidefinite optimization)), otherwise it is easy to produce an unnecessarily inefficient model. The general rule of thumb is:

  • having many small matrix variables is more efficient than one big matrix variable,

  • efficiency drops with growing number of semidefinite terms involved in linear constraints. This can have much bigger impact on the solution time than increasing the dimension of the semidefinite variable.

Let us consider a few examples.

Block matrices

Given two matrix variables X,Y0 do not assemble them in a block matrix and write the constraints as

[X00Y]0.

This increases the dimension of the problem and, even worse, introduces unnecessary constraints for a large portion of entries of the block matrix.

Schur complement

Suppose we want to model a relaxation of a rank-one constraint:

[XxxT1]0.

where xRn and XS+n. The correct way to do this is to set up a matrix variable YS+n+1 with only a linear number of constraints:

Yi,n+1=xi,i=1,,nYn+1,n+1=1,Y0,

and use the upper-left n×n part of Y as the original X. Going the other way around, i.e. starting with a variable X and aligning it with the corner of another, bigger semidefinite matrix Y introduces n(n+1)/2 equality constraints and will quickly have formidable solution times.

Sparse LMIs

Suppose we want to model a problem with a sparse linear matrix inequality (see Sec. 6.2.1 (Linear matrix inequalities)) such as:

minimizecTxsubject toA0+i=1kAixi0,

where xRk and Ai are symmetric n×n matrices. The representation of this problem in primal form is:

minimizecTxsubject toA0+i=1kAixi=X,X0,

and the linear constraint requires a full set of n(n+1)/2 equalities, many of which are just Xk,l=0, regardless of the sparsity of Ai. However the dual problem (see Sec. 8.6 (Semidefinite duality and LMIs)) is:

maximizeA0,Zsubject toAi,Z=ci, i=1,,k,Z0,

and the number of nonzeros in linear constraints is just joint number of nonzeros in Ai. It means that large, sparse LMIs should almost always be dualized and entered in that form for efficiency reasons.

7.6 The quality of a solution

In this section we will discuss how to validate an obtained solution. Assume we have a conic model with continuous variables only and that the optimization software has reported an optimal primal and dual solution. Given such a solution, we might ask how to verify that it is indeed feasible and optimal.

To that end, consider a simple model

(7.1)minimizex2subject tox1+x23,x222,x1,x20,

where a solver might approximate the solution as

x1=0.0000000000000000 and x2=1.4142135623730951

and therefore the approximate optimal objective value is

1.4142135623730951.

The true objective value is 2, so the approximate objective value is wrong by the amount

1.414213562373095121016.

Most likely this difference is irrelevant for all practical purposes. Nevertheless, in general a solution obtained using floating point arithmetic is only an approximation. Most (if not all) commercial optimization software uses double precision floating point arithmetic, implying that about 16 digits are correct in the computations performed internally by the software. This also means that irrational numbers such as 2 and π can only be stored accurately within 16 digits.

Verifying feasibility

A good practice after solving an optimization problem is to evaluate the reported solution. At the very least this process should be carried out during the initial phase of building a model, or if the reported solution is unexpected in some way. The first step in that process is to check that the solution is feasible; in case of the small example (7.1) this amounts to checking that:

x1+x2=0.0000000000000000+1.41421356237309513,x22=2.00000000000000042, (?)x1=0.00000000000000000,x2=1.41421356237309510,

which demonstrates that one constraint is slightly violated due to computations in finite precision. It is up to the user to assess the significance of a specific violation; for example a violation of one unit in

x1+x21

may or may not be more significant than a violation of one unit in

x1+x2109.

The right-hand side of 109 may itself be the result of a computation in finite precision, and may only be known with, say 3 digits of accuracy. Therefore, a violation of 1 unit is not significant since the true right-hand side could just as well be 1.001109. Certainly a violation of 41016 as in our example is within the numerical tolerance for zero and we would accept the solution as feasible. In practice it would be standard to see violations of order 108.

Verifying optimality

Another question is how to verify that a feasible approximate solution is actually optimal, which is answered by duality theory, which is discussed in Sec. 2.4 (Duality in linear optimization) for linear problems and in Sec. 8 (Duality in conic optimization) in full generality. Following Sec. 8 (Duality in conic optimization) we can derive an equivalent version of the dual problem as:

maximize3y1y2y3subject to2y2y3(y1+1)2,y10.

This problem has a feasible point (y1,y2,y3)=(0,1/2,1/2) with objective value 2, matching the objective value of the primal problem. By duality theory this proves that both the primal and dual solutions are optimal. Again, in finite precision the dual objective value may be reported as, for example

1.4142135623730950

leading to a duality gap of about 1016 between the approximate primal and dual objective values. It is up to the user to assess the significance of any duality gap and accept or reject the solution as sufficiently good approximation of the optimal value.

7.7 Distance to a cone

The violation of a linear constraint aTxb under a given solution x is obviously

max(0,aTxb).

It is less obvious how to assess violations of conic constraints. To this end suppose we have a convex cone K. The (Euclidean) projection of a point x onto K is defined as the solution of the problem

(7.2)minimizepx2subject topK.

We denote the projection by projK(x). The distance of x to K

dist(x,K)=xprojK(x)2,

i.e. the objective value of (7.2), measures the violation of a conic constraint involving K. Obviously

dist(x,K)=0xK.

This distance measure is attractive because it depends only on the set K and not on any particular representation of K using (in)equalities.

Example 7.9 Surprisingly small distance

Let K=P30.1,0.9 be the power cone defined by the inequality x0.1y0.9|z|. The point

x=(x,y,z)=(0,10000,500)

is clearly not in the cone, in fact x0.1y0.9=0 while |z|=500, so the violation of the conic inequality is seemingly large. However, the point

p=(108,10000,500)

belongs to P30.1,0.9 (since (108)0.1100000.9630), hence

dist(x,P30.1,0.9)xp2=108.

Therefore the distance of x to the cone is actually very small.

Distance to certain cones

For some basic cone types the projection problem (7.2) can be solved analytically. Denoting [x]+=max(0,x) we have that:

  • For xR the projection onto the nonnegative half-line is projR+(x)=[x]+.

  • For (t,x)R×Rn1 the projection onto the quadratic cone is

    projQn(t,x)={(t,x)if tx2,12(tx2+1)(x2,x)if x2<t<x2,0if tx2.
  • For X=i=1nλiqiqiT the projection onto the semidefinite cone is

    projS+n(X)=i=1n[λi]+qiqiT.