5 Exponential cone optimization

So far we discussed optimization problems involving the major “polynomial” families of cones: linear, quadratic and power cones. In this chapter we introduce a single new object, namely the three-dimensional exponential cone, together with examples and applications. The exponential cone can be used to model a variety of constraints involving exponentials and logarithms.

5.1 Exponential cone

The exponential cone is a convex subset of R3 defined as

(5.1)Kexp={(x1,x2,x3) : x1x2ex3/x2, x2>0} {(x1,0,x3) : x10, x30}.

Thus the exponential cone is the closure in R3 of the set of points which satisfy

(5.2)x1x2ex3/x2, x1,x2>0.

When working with logarithms, a convenient reformulation of (5.2) is

(5.3)x3x2log(x1/x2), x1,x2>0.

Alternatively, one can write the same condition as

x1/x2ex3/x2, x1,x2>0,

which immediately shows that Kexp is in fact a cone, i.e. αxKexp for xKexp and α0. Convexity of Kexp follows from the fact that the Hessian of f(x,y)=yexp(x/y), namely

D2(f)=ex/y[y1xy2xy2x2y3]

is positive semidefinite for y>0.

_images/expcone.png

Fig. 5.1 The boundary of the exponential cone Kexp.

5.2 Modeling with the exponential cone

Extending the conic optimization toolbox with the exponential cone leads to new types of constraint building blocks and new types of representable sets. In this section we list the basic operations available using the exponential cone.

5.2.1 Exponential

The epigraph tex is a section of Kexp:

(5.4)tex(t,1,x)Kexp.

5.2.2 Logarithm

Similarly, we can express the hypograph tlogx, x0:

(5.5)tlogx(x,1,t)Kexp.

5.2.3 Entropy

The entropy function H(x)=xlogx can be maximized using the following representation which follows directly from (5.3):

(5.6)txlogxtxlog(1/x)(1,x,t)Kexp.

5.2.4 Relative entropy

The relative entropy or Kullback-Leiber divergence of two probability distributions is defined in terms of the function D(x,y)=xlog(x/y). It is convex, and the minimization problem tD(x,y) is equivalent to

(5.7)tD(x,y)txlog(y/x)(y,x,t)Kexp.

Because of this reparametrization the exponential cone is also referred to as the relative entropy cone, leading to a class of problems known as REPs (relative entropy problems). Having the relative entropy function available makes it possible to express epigraphs of other functions appearing in REPs, for instance:

xlog(1+x/y)=D(x+y,y)+D(y,x+y).

5.2.5 Softplus function

In neural networks the function f(x)=log(1+ex), known as the softplus function, is used as an analytic approximation to the rectifier activation function r(x)=x+=max(0,x). The softplus function is convex and we can express its epigraph tlog(1+ex) by combining two exponential cones. Note that

tlog(1+ex)ext+et1

and therefore tlog(1+ex) is equivalent to the following set of conic constraints:

(5.8)u+v1,(u,1,xt)Kexp,(v,1,t)Kexp.

5.2.6 Log-sum-exp

We can generalize the previous example to a log-sum-exp (logarithm of sum of exponentials) expression

tlog(ex1++exn).

This is equivalent to the inequality

ex1t++exnt1,

and so it can be modeled as follows:

(5.9)ui1,(ui,1,xit)Kexp, i=1,,n.

5.2.7 Log-sum-inv

The following type of bound has applications in capacity optimization for wireless network design:

tlog(1x1++1xn),xi>0.

Since the logarithm is increasing, we can model this using a log-sum-exp and an exponential as:

tlog(ey1++eyn),xieyi,i=1,,n.

Alternatively, one can also rewrite the original constraint in equivalent form:

ets(1x1++1xn)1,xi>0.

and then model the right-hand side inequality using the technique from Sec. 3.2.7 (Harmonic mean). This approach requires only one exponential cone.

5.2.8 Arbitrary exponential

The inequality

ta1x1a2x2anxn,

where ai are arbitrary positive constants, is of course equivalent to

texp(ixilogai)

and therefore to (t,1,ixilogai)Kexp.

5.2.9 Lambert W-function

The Lambert function W:R+R+ is the unique function satisfying the identity

W(x)eW(x)=x.

It is the real branch of a more general function which appears in applications such as diode modeling. The W function is concave. Although there is no explicit analytic formula for W(x), the hypograph {(x,t) : 0x, 0tW(x)} has an equivalent description:

xtet=tet2/t

and so it can be modeled with a mix of exponential and quadratic cones (see Sec. 3.1.2 (Rotated quadratic cones)):

(5.10)(x,t,u)Kexp,(xtexp(u/t)),(1/2,u,t)Qr,(ut2).

5.2.10 Other simple sets

Here are a few more typical sets which can be expressed using the exponential and quadratic cones. The presentations should be self-explanatory; we leave the simple verifications to the reader.

Table 5.1 Sets representable with the exponential cone

Set

Conic representation

t(logx)2,0<x1

(12,t,u)Qr3,(x,1,u)Kexp,x1

tloglogx,x>1

(u,1,t)Kexp,(x,1,u)Kexp

t(logx)1,x>1

(u,t,2)Qr3,(x,1,u)Kexp

tlogx,x>1

(12,u,t)Qr3,(x,1,u)Kexp

txlogx,x>1

(x,u,2t)Qr3,(x,1,u)Kexp

tlog(11/x),x>1

(x,u,2)Qr3,(1u,1,t)Kexp

tlog(1+1/x),x>0

(x+1,u,2)Qr3,(1u,1,t)Kexp

5.3 Geometric programming

Geometric optimization problems form a family of optimization problems with objective and constraints in special polynomial form. It is a rich class of problems solved by reformulating in logarithmic-exponential form, and thus a major area of applications for the exponential cone Kexp. Geometric programming is used in circuit design, chemical engineering, mechanical engineering and finance, just to mention a few applications. We refer to [BKVH07] for a survey and extensive bibliography.

5.3.1 Definition and basic examples

A monomial is a real valued function of the form

(5.11)f(x1,,xn)=cx1a1x2a2xnan,

where the exponents ai are arbitrary real numbers and c>0. A posynomial (positive polynomial) is a sum of monomials. Thus the difference between a posynomial and a standard notion of a multi-variate polynomial known from algebra or calculus is that (i) posynomials can have arbitrary exponents, not just integers, but (ii) they can only have positive coefficients.

For example, the following functions are monomials (in variables x,y,z):

(5.12)xy, 2x1.5y1x0.3, 3xy/z, 1

and the following are examples of posynomials:

(5.13)2x+yz, 1.5x3z+5/y, (x2y2+3z0.3)4+1.

A geometric program (GP) is an optimization problem of the form

(5.14)minimizef0(x)subject tofi(x)1,i=1,,m,xj>0,j=1,,n,

where f0,,fm are posynomials and x=(x1,,xn) is the variable vector.

A geometric program (5.14) can be modeled in exponential conic form by making a substitution

xj=eyj, j=1,,n.

Under this substitution a monomial of the form (5.11) becomes

cea1y1ea2y2eanyn=exp(aTy+logc)

for a=(a1,,an). Consequently, the optimization problem (5.14) takes an equivalent form

(5.15)minimize tsubject tolog(kexp(a0,k,Ty+logc0,k))t,log(kexp(ai,k,Ty+logci,k))0, i=1,,m,

where ai,kRn and ci,kR for all i,k. These are now log-sum-exp constraints we already discussed in Sec. 5.2.6 (Log-sum-exp). In particular, the problem (5.15) is convex, as opposed to the posynomial formulation (5.14).

Example

We demonstrate this reduction on a simple example. Take the geometric problem

minimizex+y2zsubject to0.1x+2y11,z1+yx21.

By substituting x=eu, y=ev, z=ew we get

minimize tsubject tolog(eu+e2v+w)t,log(e0.5u+log0.1+ev+log2)0,log(ew+ev2u)0.

and using the log-sum-exp reduction from Sec. 5.2.6 (Log-sum-exp) we write an explicit conic problem:

minimize tsubject to(p1,1,ut),(q1,1,2v+wt)Kexp,p1+q11,(p2,1,0.5u+log0.1),(q2,1,v+log2)Kexp,p2+q21,(p3,1,w),(q3,1,v2u)Kexp,p3+q31.

Solving this problem yields (x,y,z)(3.14,2.43,1.32).

5.3.2 Generalized geometric models

In this section we briefly discuss more general types of constraints which can be modeled with geometric programs.

Monomials

If m(x) is a monomial then the constraint m(x)=c is equivalent to two posynomial inequalities m(x)c11 and m(x)1c1, so it can be expressed in the language of geometric programs. In practice it should be added to the model (5.15) as a linear constraint

kakyk=logc.

Monomial inequalities m(x)c and m(x)c should similarly be modeled as linear inequalities in the yi variables.

In similar vein, if f(x) is a posynomial and m(x) is a monomial then f(x)m(x) is still a posynomial inequality because it can be written as f(x)m(x)11.

It also means that we can add lower and upper variable bounds: 0<c1xc2 is equivalent to x1c11 and c21x1.

Products and positive powers

Expressions involving products and positive powers (possibly iterated) of posynomials can again be modeled with posynomials. For example, a constraint such as

((xy2+z)0.3+y)(1/x+1/y)2.21

can be replaced with

xy2+zt, t0.3+yu, x1+y1v, uv2.21.

Other transformations and extensions

  • If f1,f2 are already expressed by posynomials then max{f1(x),f2(x)}t is clearly equivalent to f1(x)t and f2(x)t. Hence we can add the maximum operator to the list of building blocks for geometric programs.

  • If f,g are posynomials, m is a monomial and we know that m(x)g(x) then the constraint f(x)m(x)g(x)t is equivalent to t1f(x)m(x)1+g(x)m(x)11.

  • The objective function of a geometric program (5.14) can be extended to include other terms, for example:

    minimize f0(x)+klogmk(x)

    where mk(x) are monomials. After the change of variables x=ey we get a slightly modified version of (5.15):

    minimizet+bTysubject tokexp(a0,k,Ty+logc0,k)t,log(kexp(ai,k,Ty+logci,k))0, i=1,,m,

    (note the lack of one logarithm) which can still be expressed with exponential cones.

5.3.3 Geometric programming case studies

Frobenius norm diagonal scaling

Suppose we have a matrix MRn×n and we want to rescale the coordinate system using a diagonal matrix D=Diag(d1,,dn) with di>0. In the new basis the linear transformation given by M will now be described by the matrix DMD1=(diMijdj1)i,j. To choose D which leads to a “small” rescaling we can for example minimize the Frobenius norm

DMD1F2=ij((DMD1)ij)2=ijMij2di2dj2.

Minimizing the last sum is an example of a geometric program with variables di (and without constraints).

Maximum likelihood estimation

Geometric programs appear naturally in connection with maximum likelihood estimation of parameters of random distributions. Consider a simple example. Suppose we have two biased coins, with head probabilities p and 2p, respectively. We toss both coins and count the total number of heads. Given that, in the long run, we observed i heads ni times for i=0,1,2, estimate the value of p.

The probability of obtaining the given outcome equals

(n0+n1+n2n0)(n1+n2n1)(p2p)n2(p(12p)+2p(1p))n1((1p)(12p))n0,

and, up to constant factors, maximizing that expression is equivalent to solving the problem

maximizep2n2+n1sn1qn0rn0subject toq1p,r12p,s34p,

or, as a geometric problem:

minimizep2n2n1sn1qn0rn0subject toq+p1,r+2p1,13s+43p1.

For example, if (n0,n1,n2)=(30,53,16) then the above problem solves with p=0.29.

An Olympiad problem

The 26th Vojtěch Jarník International Mathematical Competition, Ostrava 2016. Let a,b,c be positive real numbers with a+b+c=1. Prove that

(1a+1bc)(1b+1ac)(1c+1ab)1728

Using the tricks introduced in Sec. 5.3.2 (Generalized geometric models) we formulate this problem as a geometric program:

minimizepqrsubject top1a1+p1b1c11,q1b1+q1a1c11,r1c1+r1a1b11,a+b+c1.

Unsurprisingly, the optimal value of this program is 1728, achieved for (a,b,c,p,q,r)=(13,13,13,12,12,12).

Power control and rate allocation in wireless networks

We consider a basic wireless network power control problem. In a wireless network with n logical transmitter-receiver pairs if the power output of transmitter j is pj then the power received by receiver i is Gijpj, where Gij models path gain and fading effects. If the i-th receiver’s own noise is σi then the signal-to-interference-plus-noise (SINR) ratio of receiver i is given by

(5.16)si=Giipiσi+jiGijpj.

Maximizing the minimal SINR over all receivers (maxminisi), subject to bounded power output of the transmitters, is equivalent to the geometric program with variables p1,,pn,t:

(5.17)minimizet1subject topminpjpmax,j=1,,n,t(σi+jiGijpj)Gii1pi11, i=1,,n.

In the low-SNR regime the problem of system rate maximization is approximated by the problem of maximizing ilogsi, or equivalently minimizing isi1. This is a geometric problem with variables pi,si:

(5.18)minimizes11sn1subject topminpjpmax,j=1,,n,si(σi+jiGijpj)Gii1pi11, i=1,,n.

For more information and examples see [BKVH07].

5.4 Exponential cone case studies

In this section we introduce some practical optimization problems where the exponential cone comes in handy.

5.4.1 Risk parity portfolio

Consider a simple version of the Markowitz portfolio optimization problem introduced in Sec. 3.3.3 (Markowitz portfolio optimization), where we simply ask to minimize the risk r(x)=xTΣx of a fully-invested long-only portfolio:

(5.19)minimizexTΣxsubject toi=1nxi=1,xi0, i=1,,n,

where Σ is a symmetric positive definite covariance matrix. We can derive from the first-order optimality conditions that the solution to (5.19) satisfies rxi=rxj whenever xi,xj>0, i.e. marginal risk contributions of positively invested assets are equal. In practice this often leads to concentrated portfolios, whereas it would benefit diversification to consider portfolios where all assets have the same total contribution to risk:

(5.20)xirxi=xjrxj, i,j=1,,n.

We call (5.20) the risk parity condition. It indeed models equal risk contribution from all the assets, because as one can easily check rxi=(Σx)ixTΣx and

r(x)=ixirxi.

Risk parity portfolios satisfying condition (5.20) can be found with an auxiliary optimization problem:

(5.21)minimizexTΣxcilogxisubject toxi>0,i=1,,n,

for any c>0. More precisely, the gradient of the objective function in (5.21) is zero when rxi=c/xi for all i, implying the parity condition (5.20) holds. Since (5.20) is scale-invariant, we can rescale any solution of (5.21) and get a fully-invested risk parity portfolio with ixi=1.

The conic form of problem (5.21) is:

(5.22)minimizetceTssubject to(t,Σ1/2x)Qn+1,(txTΣx),(xi,1,si)Kexp,(silogxi).

5.4.2 Entropy maximization

A general entropy maximization problem has the form

(5.23)maximizeipilogpisubject toipi=1,pi0,pI,

where I defines additional constraints on the probability distribution p (these are known as prior information). In the absence of complete information about p the maximum entropy principle of Jaynes posits to choose the distribution which maximizes uncertainty, that is entropy, subject to what is known. Practitioners think of the solution to (5.23) as the most random or most conservative of distributions consistent with I.

Maximization of the entropy function H(x)=xlogx was explained in Sec. 5.2 (Modeling with the exponential cone).

Often one has an a priori distribution q, and one tries to minimize the distance between p and q, while remaining consistent with I. In this case it is standard to minimize the Kullback-Leiber divergence

DKL(p||q)=ipilogpi/qi

which leads to an optimization problem

(5.24)minimizeipilogpi/qisubject toipi=1,pi0,pI.

5.4.3 Hitting time of a linear system

Consider a linear dynamical system

(5.25)x(t)=Ax(t)

where we assume for simplicity that A=Diag(a1,,an) with ai<0 with initial condition x(0)i=xi. The resulting dynamical system x(t)=x(0)exp(At) converges to 0 and one can ask, for instance, for the time it takes to approach the limit up to distance ε. The resulting optimization problem is

minimizetsubject toi(xiexp(ait))2ε,

with the following conic form, where the variables are t,q1,,qn:

minimizetsubject to(ε,x1q1,,xnqn)Qn+1,(qi,1,ait)Kexp, i=1,,n.

See Fig. 5.2 for an example. Other criteria for the target set of the trajectories are also possible. For example, polyhedral constraints

cTxd, cR+n,dR+

are also expressible in exponential conic form for starting points x(0)R+n, since they correspond to log-sum-exp constraints of the form

log(iexp(aixi+log(cixi)))logd.

For a robust version involving uncertainty on A and x(0) see [CS14].

_images/dyn-sys-convergence.png

Fig. 5.2 With A=Diag(0.3,0.06) and starting point x(0)=(2.2,1.3) the trajectory reaches distance ε=0.5 from origin at time t15.936.

5.4.4 Logistic regression

Logistic regression is a technique of training a binary classifier. We are given a training set of examples x1,,xnRd together with labels yi{0,1}. The goal is to train a classifier capable of assigning new data points to either class 0 or 1. Specifically, the labels should be assigned by computing

(5.26)hθ(x)=11+exp(θTx)

and choosing label y=0 if hθ(x)<12 and y=1 for hθ(x)12. Here hθ(x) is interpreted as the probability that x belongs to class 1. The optimal parameter vector θ should be learned from the training set, so as to maximize the likelihood function:

ihθ(xi)yi(1hθ(xi))1yi.

By taking logarithms and adding regularization with respect to θ we reach an unconstrained optimization problem

(5.27)minimizeθRd λθ2+iyilog(hθ(xi))(1yi)log(1hθ(xi)).

Problem (5.27) is convex, and can be more explicitly written as

minimizeiti+λrsubject totilog(hθ(x))=log(1+exp(θTxi)) if yi=1,tilog(1hθ(x))=log(1+exp(θTxi)) if yi=0,rθ2,

involving softplus type constraints (see Sec. 5.2 (Modeling with the exponential cone)) and a quadratic cone. See Fig. 5.3 for an example.

_images/logistic-regression.png

Fig. 5.3 Logistic regression example with none, medium and strong regularization (small, medium, large λ). The two-dimensional dataset was converted into a feature vector xR28 using monomial coordinates of degrees at most 6. Without regularization we get obvious overfitting.