ACE 592 - Lecture 4.1

Unconstrained optimization: introduction and derivative-free methods

Author
Affiliation

Diego S. Cardoso

University of Illinois Urbana-Champaign

0.1 Course Roadmap

  1. Introduction to Scientific Computing
  2. Fundamentals of numerical methods
  3. Systems of equations
  4. Optimization
    1. Unconstrained optimization: intro
    2. Unconstrained optimization: line search and trust region methods
    3. Constrained optimization: theory and methods
    4. Constrained optimization: modeling framework
  5. Function approximation
  6. Structural estimation

0.2 Agenda

  • Today we begin exploring one of the most important methods in numerical methods: optimization
  • We start with an introduction and general aspects of numerical optimization
  • Then, we will cover basic derivative-free optimization algorithms

0.3 Main references for today

  • Miranda & Fackler (2002), Ch. 4
  • Judd (1998), Ch. 4
  • Nocedal & Writght (2006), Ch. 2
  • A.K. Dixit (1990), Optimization in Economic Theory
  • Lecture notes from Ivan Rudik (Cornell) and Florian Oswald (SciencesPo)

0.4 Optimization problems

Optimization problems are ubiquitous in economics. Examples?

. . .

  • Agent behavior
    • Consumer utility maximization
    • Firm profit maximization

. . .

  • Social planner maximizing welfare

. . .

  • Econometrics
    • Minimization of squared errors (OLS)
    • Minimization of empirical moment functions (GMM)
    • Maximization of likelihood (MLE)

. . .

In this unit, we will review the fundamentals of optimization and cover the main algorithms in numerical optimization

1 Fundamentals of unconstrained optimization

1.1 Optimization setup

We want to minimize an objective function f(x)

\min_x f(x)

where x \in \mathbb{R}^n is a real vector with n \ge 1 components and f : \mathbb{R}^n \rightarrow \mathbb{R} is smooth

  • In unconstrained optimization, we impose no restrictions on x

1.2 Optimization setup

We will focus on minimization problems

  • That’s because the optimization literature and programming packages usually frame optimization as minimization

But it’s simple to convert minimization into maximization problem. How?

. . .

  • Flip the sign of f(x)!

\min_x f(x) \Leftrightarrow \max_x g(x)

where g(x) = -f(x)

1.3 Optimization solutions: global vs. local optima

A point x^* is a global minimizer (or optimum) of f if

f(x^*) \le f(x) \;\; \forall x \in \mathbb{R}^n

Since we can’t evaluate the function at infinite points, finding global optima is generally a difficult problem

  • We can’t be sure if the function suddenly rises between two points we evaluate
  • Most algorithms can only find local optima

1.4 Optimization solutions: global vs. local optima

A point x^* is a local minimizer (or optimum) of f if there is a neighborhood \mathcal{N} of x^* such that

f(x^*) \le f(x) \;\; \forall x \in \mathcal{N}

  • A neighborhood of x^* is an open set that contains x^*

. . .

We call x^* a strict local minimizer if f(x^*) < f(x) \;\; \forall x \in \mathcal{N}

1.5 Identifying local optima

A first approach to checking whether a point is a local optimum is to evaluate the function at all points around it

But if f is smooth, calculus makes it much easier, especially if f is twice continuously differentiable

  • We only need to evaluate the gradient \nabla f(x^*) (first derivative) and the Hessian \nabla^2 f(x^*) (second derivative)

There are four key theorems to help us1

1.6 Identifying local optima

Theorem 1: First-Order Necessary Conditions

If x^* is a local minimizer and f is continuously differentiable in an open neighborhood of x^*, then \nabla f(x^*) = 0

. . .

  • If n = 1, this means that f^{\prime}(x^*) = 0

. . .

Note that this is only a necessary condition

So, we can look for points where the first derivative is zero (with rootfinding)

But, once we find them, we can’t be sure yet if these points indeed are local minimizers

1.7 Identifying local optima

Theorem 2: Second-Order Necessary Conditions

If x^* is a local minimizer of f and \nabla^2 f exists and is continuously differentiable in an open neighborhood of x^*, then \nabla f(x^*) = 0 and \nabla^2 f(x^*) is positive semidefinite

. . .

  • If n = 1, this means that f^{\prime \prime}(x^*) \geq 0, i.e., the function is locally convex
    • Positive semidefiniteness is the multidimensional analogous of convexity in 1D
    • A matrix B is positive semidefinite if p^\prime B p \geq 0 for all p

. . .

  • For maximization problems, we check whether \nabla^2 f(x^*) is negative semidefinite

1.8 Identifying local optima

Theorem 3: Second-Order Sufficient Conditions

Suppose \nabla^2 f is continuous in an open neighborhood of x^* and that \nabla f(x^*) = 0 and \nabla^2 f is positive definite. Then x^* is a strict local minimizer of f

. . .

  • Note that these are sufficient conditions, not necessary conditions
    • For example, f(x) = x^4 has a local minimizer at x^* = 0. But this point does not satisfy the 2nd-order sufficient conditions

1.9 Conditions for global optima

Another theorem can help us characterize global optima

Theorem 4: Second-Order Sufficient Conditions for Global Optima

When f is convex, any local maximizer x^* is a global minimizer of f. If in addition f is differentiable, then any point x^* at which \nabla f(x^*) = 0 is a global minimizer of f

  • If the function is globally convex, any local minimizer we find is also a global minimizer

2 Optimization algorithms

2.1 Optimization algorithms

Optimization problems have many similarities to problems we’ve already seen in the course

  • FOCs of an unconstrained optimization problem are similar to a rootfinding problem
  • FOCs of a constrained optimization problem are similar to a complementarity problem

2.2 Optimization algorithms

We typically want to find a global optimum of our objective function f

Typically, analytic problems are set up to have a unique minimum so any local solver can generally find the global optimum

  • But many problems in Economics don’t satisfy the typical sufficiency conditions for a unique minimum (strictly decreasing and convex), such as
    • Games with multiple equilibria
    • Concave state transitions
    • Certain types of estimation procedures

2.3 Optimization algorithms

We make two initial distinctions between solvers (i.e., optimization algorithms):

  • Local vs global: are we finding an optimum in a local region, or globally?
    • Most solvers search local optima

. . .

  • Derivative-using vs derivative-free: do we want to use higher-order information?

. . .

In this course, we’ll focus on local solvers

  • Global solvers are usually stochastic or subdivide the search space and apply local solvers
  • Common global solvers: genetic algorithms, simulated annealing, DIRECT, and Sto-go

2.4 Optimization algorithms

How do we find a local minimum?

Do we need to evaluate every single point?

. . .

Optimization algorithms typically have the following set up:

  1. Start at some x_0
  2. Work through a series of iterates \{x^{(k)}\}_{k=1}^\infty until it “converges” with sufficient accuracy

. . .

If the function is smooth, we can take advantage of that information about the function’s shape to figure out which direction to move in next

2.5 Solution strategies: line search vs. trust region

When we move from x^{(k)} to the next iteration, x^{(k+1)}, we have to decide

  • Which direction from x^{(k)}
  • How far to go from x^{(k)}

. . .

There are two fundamental solution strategies that differ in the order of those decisions

  • Line search methods first choose a direction and then select the optimal step size

. . .

  • Trust region methods first choose a step size and then select the optimal direction

. . .

We’ll see the details of each strategy later. Let’s start with two relatively simple, derivative-free methods

2.7 Derivative-free optimization: golden search

  1. Select points x_1,x_2 \in [a,b] where x_2 > x_1
  2. Compare f(x_1) and f(x_2)
  • If f(x_1) < f(x_2), replace [a,b] with [a,x_2]
  • Else, replace [a,b] with [x_1,b]
  1. Repeat until a convergence criterion is met

. . .

Replace the endpoint of the interval next to the evaluated point with the highest value

\rightarrow keep the lower evaluated point in the interval

\rightarrow guarantees that a local minimum still exists

2.8 Derivative-free optimization: golden search

How do we pick x_1 and x_2?

. . .

Achievable goal for selection process:

  • New interval is independent of whether the upper or lower bound is replaced
  • Only requires one function evaluation per iteration

. . .

There’s one algorithm that satisfies this

2.9 Derivative-free optimization: golden search

Golden search algorithm for point selection:

\begin{gather} x_i = a + \alpha_i (b-a) \notag \\ \alpha_1 = \frac{3-\sqrt{5}}{2}, \qquad \alpha_2 = \frac{\sqrt{5} - 1}{2} \end{gather}

. . .

The value of \alpha_2 is called the golden ratio, from where the algorithm gets its name

2.10 Golden search in practice

Let’s write a function to perform the golden search algorithm golden_search(f, lower_bound, upper_bound), then use it to find the minimizer of f(x) = 2x^2 - 4x between -4 and 4.

Steps:

  1. Calculate points x_1,x_2 \in [a,b]
  • x_1 = a + \frac{3-\sqrt{5}}{2} (b-a) and x_2 = a + \frac{\sqrt{5} - 1}{2} (b-a)
  1. Compare f(x_1) and f(x_2)
  • If f(x_1) < f(x_2), replace [a,b] with [a,x_2]
  • Else, replace [a,b] with [x_1,b]
  1. Repeat until a convergence criterion is met

2.11 Golden search in Julia

Code
function golden_search(f, lower_bound, upper_bound)
    alpha_1 = (3 - sqrt(5))/2  # GS parameter 1
    alpha_2 = (sqrt(5) - 1)/2  # GS parameter 2
    tolerance = 1e-2           # tolerance for convergence
    difference = 1e10
    while difference > tolerance
        x_1 = lower_bound + alpha_1*(upper_bound - lower_bound)  # new x_1
        x_2 = lower_bound + alpha_2*(upper_bound - lower_bound)  # new x_2
        if f(x_1) < f(x_2)  # update bounds
            upper_bound = x_2
        else
            lower_bound = x_1
        end
        difference = x_2 - x_1
    end
    println("Minimum is at x = $((lower_bound+upper_bound)/2).")
end;

2.12 Golden search in Julia

Code
f(x) = 2x^2 - 4x;
golden_search(f, -4, 4)
Minimum is at x = 0.996894379984858.

2.13 Golden search in Julia

2.14 Derivative-free optimization: Nelder-Mead

Golden search is nice and simple but only works in one dimension

There are several derivative free methods for minimization that work in multiple dimensions. The most commonly used one is the Nelder-Mead (NM) algorithm

. . .

NM works by first constructing a simplex: we evaluate the function at n+1 points in an n dimensional problem

It then manipulates the highest value point, similar to golden search

2.15 Derivative-free optimization: Nelder-Mead

There are six operations:

. . .

  • Order: order the value at the vertices of the simplex f(x_1)\leq \dots \leq f(x_{n+1})

. . .

  • Centroid: calculate x_0, the centroid of the non - x_{n+1} points

2.16 Derivative-free optimization: Nelder-Mead

  1. Reflection: reflect x_{n+1} through the opposite face of the simplex and evaluate the new point: x_r = x_0 + \alpha(x_0 - x_{n+1}), \alpha > 0
  • If this improves upon the second-highest but is not the lowest value point, replace x_{n+1} with x_r and restart
  • If this is the lowest value point so far, expand
  • If f(x_r) > f(x_n), contract

2.17 Derivative-free optimization: Nelder-Mead

  1. Expansion: push the reflected point further in the same direction

  2. Contraction: Contract the highest value point toward the middle

  • Compute x_c = x_0 + \gamma(x_0 - x_{n+1}), 0 < \gamma \leq 0.5
  • If x_c is better than the worst point, replace x_{n+1} with x_c and restart
  • Else, shrink

2.18 Derivative-free optimization: Nelder-Mead

  1. Shrinkage: shrink the simplex toward the best point
  • Replace all points but the best one with x_i = x_1 + \sigma(x_i - x_1)

Repeats these steps until a convergence criterion is met: usually, when the sum of the distances between the vertices and the centroid is smaller than some tolerance level

2.19 Nelder-Mead illustration

Let’s see NM in action minimizing a classic function in the optimization literature: the Rosenbrock (or banana) function

f(x, y) = (a - x)^2 + b(y - x^2)^2

Its global minimizer is (x, y) = (a, a^2 )

Contour (level) plot of the Rosenbrock function \rightarrow

a = 1, b = 100

2.20 Nelder-Mead illustration

(Here, x_1 = x and x_2 = y)

2.21 Nelder-Mead in Julia

Nelder-Mead is a pain to code efficiently: don’t spend the time doing it yourself!

For unconstrained optimization, Julia’s most mature package is Optim.jl, which includes NM

. . .

Code
using Optim;
# Define Rosenbrock function with a = 1, b = 100
f(x) = (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2; 

x0 = [0.0, 0.0]; # initial guess

2.22 Nelder-Mead in Julia

Code
soln = Optim.optimize(f, x0, NelderMead())
 * Status: success

 * Candidate solution
    Final objective value:     3.525527e-09

 * Found with
    Algorithm:     Nelder-Mead

 * Convergence measures
    √(Σ(yᵢ-ȳ)²)/n ≤ 1.0e-08

 * Work counters
    Seconds run:   0  (vs limit Inf)
    Iterations:    60
    f(x) calls:    117

2.23 Nelder-Mead in Julia

We can check the solution and the minimum value attained using

Code
soln.minimizer
2-element Vector{Float64}:
 0.9999634355313174
 0.9999315506115275
Code
soln.minimum
3.5255270584829996e-9

2.24 Nelder-Mead in Julia

If you want to check the steps, run Optim.optimize with option store_trace = true

Code
soln = Optim.optimize(f, x0, NelderMead(), Optim.Options(store_trace=true));
Optim.trace(soln)
61-element Vector{OptimizationState{Float64, NelderMead{Optim.AffineSimplexer, Optim.AdaptiveParameters}}}:
      0     9.506641e-01     4.576214e-02
 * time: 9.5367431640625e-7

      1     9.506641e-01     2.023096e-02
 * time: 6.9141387939453125e-6

      2     9.506641e-01     2.172172e-02
 * time: 7.867813110351562e-6

      3     9.262175e-01     5.243757e-02
 * time: 7.867813110351562e-6

      4     8.292372e-01     4.259749e-02
 * time: 9.775161743164062e-6

      5     8.292372e-01     4.265507e-02
 * time: 9.775161743164062e-6

      6     8.138907e-01     3.109209e-02
 * time: 9.775161743164062e-6

      7     7.569606e-01     3.215435e-02
 * time: 9.775161743164062e-6

      8     7.382898e-01     2.418419e-02
 * time: 1.0967254638671875e-5

      9     6.989376e-01     2.426367e-02
 * time: 1.0967254638671875e-5

     10     6.800415e-01     2.124416e-02
 * time: 1.0967254638671875e-5

     11     6.475000e-01     1.809652e-02
 * time: 1.1920928955078125e-5

     12     6.377042e-01     2.151280e-02
 * time: 1.1920928955078125e-5

 ⋮
     49     3.589907e-05     1.447031e-05
 * time: 2.8848648071289062e-5

     50     1.331661e-05     1.424732e-05
 * time: 2.8848648071289062e-5

     51     1.565213e-06     4.917747e-06
 * time: 2.8848648071289062e-5

     52     1.565213e-06     3.456266e-06
 * time: 2.8848648071289062e-5

     53     1.565213e-06     9.104041e-07
 * time: 2.9802322387695312e-5

     54     1.565213e-06     6.870496e-07
 * time: 2.9802322387695312e-5

     55     4.966342e-07     5.777837e-07
 * time: 2.9802322387695312e-5

     56     2.272707e-07     1.307209e-07
 * time: 2.9802322387695312e-5

     57     2.120249e-07     7.108936e-08
 * time: 3.0994415283203125e-5

     58     6.942358e-08     8.182068e-08
 * time: 3.0994415283203125e-5

     59     1.876334e-08     2.129210e-08
 * time: 3.0994415283203125e-5

     60     1.876334e-08     9.270314e-09
 * time: 3.0994415283203125e-5

2.25 Final comments on Nelder-Mead

  • Nelder-Mead is commonly used in optimization packages but it’s slow and unreliable
    • It has no real useful convergence properties
  • Only use Nelder-Mead if you’re solving a problem with derivatives that are costly to calculate or approximate

Footnotes

  1. See proofs in Nocedal & Wright (2006), chapter 2↩︎