Subsections


Basics on Continuous Optimization

Most of the problems solved in this thesis can be formulated as optimization problems. In this section, we give some general points on optimization. We also give details on the most used optimization algorithms in this document. The statistical grounds of certain type of cost function will be explained in section 3.1.

Generalities on Optimization

Unconstrained optimization.

In its general form, an unconstrained minimization problem has the following form (25,52,154,136,22):

$\displaystyle \min_{\mathbf{x}} f(\mathbf{x}).$ (2.17)

The function $ f$ is a scalar-valued function named the cost function or the criterion. It is assumed that the cost function is defined on  $ \mathbb{R}^n$. In some cases later explained, $ f$ can be a vector-valued function instead of a scalar-valued one. Note that we only consider the case of the minimization of the cost function since the problem of maximization can easily be turned into a minimization problem (by taking the negative of the cost function).

Constrained optimization.

A constrained minimization problem is similar to an unconstrained minimization problem except that supplemental conditions on the solution must be satisfied. Such a problem has the following form:

\begin{displaymath}\begin{array}{rl}
 \displaystyle \min_{\mathbf{x}} & f(\mathbf{x})  
 \textrm{subject to} & \mathbf{x} \in C,
 \end{array}\end{displaymath} (2.18)

where $ C$ is a set defining the constraints. These constraints can take many forms. For instance, it can be inequality constraints such as $ x_i \geq 0$, linear equality constraints such as $ \mathsf{A}\mathbf{x} = \mathbf{b}$ for some matrix $ \mathsf{A}$ and some vector  $ \mathbf{b}$. It can also be more complex constraints such as $ g(\mathbf{x}) \leq 0$ where $ g$ is some function. These are just a few examples.

In this section we focus on unconstrained minimization problem since it is the main tool used in this thesis.

Optimization algorithms.

An optimization algorithm is an algorithm that provides a solution to an optimization problem. There exists a vast amount of optimization algorithms. The algorithm used to solve an optimization problem depends on the properties of the cost function and of the constraints. For instance, the optimization algorithm depends on the differentiability of the cost function. It can also exploit a particular form of the cost function (for instance, a cost function defined as a sum of squares).

Global optimization.

The ultimate goal of a minimization problem is to find a global minimum of the cost function, i.e. a vector  $ \mathbf{x}^\star$ such that:

$\displaystyle \forall \mathbf{x} \in \Omega, f(\mathbf{x}^\star) \leq f(\mathbf{x}).$ (2.19)

The solutions proposed by optimization algorithms are only guaranteed to be local minima. A vector  $ \mathbf{x}$ is said to be a local minimizer of the function $ f$ if the following statement is true:

$\displaystyle \exists r \in \mathbb{R}_+^\star : \forall \mathbf{y} : \Vert\mathbf{x} - \mathbf{y}\Vert \leq r : f(\mathbf{x}) \leq f(\mathbf{y}).$ (2.20)

When the function $ f$ is differentiable, a necessary condition for  $ \mathbf{x}$ to be a local minimizer is:

$\displaystyle \boldsymbol{\nabla}f(\mathbf{x}) = \mathbf{0}.$ (2.21)

A local minimum of the function $ f$ is not necessarily a global minimum of the function. Indeed, a cost function may have several local minima. Deciding whether a local minima is a global minimum or not is an undecidable problem.

Convex functions and global optimization.

Convexity is a particularly interesting property for a function. Indeed, if a cost function is convex then it has only one minimum which is therefore a global minimum of the function. Note that the converse is not true; there exist non-convex functions which have only one (global) minimum.

Convex function.

Let $ f : \Omega \subset \mathbb{R}^n \rightarrow \mathbb{R}$ be a scalar-valued function. Let $ \mathbf{x} \in \Omega$ and $ \mathbf{y} \in \Omega$. A scalar function $ f$ is said to be convex if the following condition holds for all $ \alpha \in [0,1]$:

$\displaystyle f(\alpha \mathbf{x} + (1-\alpha)\mathbf{y}) \leq \alpha f(\mathbf{x}) + (1-\alpha) f(\mathbf{y}).$ (2.22)

Alternatively, if the function $ f$ is twice differentiable, then $ f$ is convex if the Hessian matrix $ \boldsymbol{\mathsf{H}}_f$ is positive definite.

A strictly convex function is a function that satisfies equation (2.22) with the $ \leq$ sign replaced by $ <$. Note that the Hessian matrix of a twice differentiable strictly convex function is not necessarily strictly positive definite: it might only be positive definite.


Optimization Algorithms

In this section, we present the unconstrained minimization algorithms mostly used in this thesis. The choice of an optimization algorithm depends on the properties of the cost function to be minimized. These algorithms can be classified according to several criterion. Table 2.1 summarizes the optimization algorithm reviewed in this section and the properties that must be satisfied by the cost function.

This section is organized as follows: we go from the less restrictive algorithms (in terms of conditions on the cost function) to most restrictive ones (i.e. the more specific). Two sections not dedicated to a specific optimization algorithm are intercalated in this section: section 2.2.2.1 gives some general points on iterative optimization algorithms, and section 2.2.2.6 talks about the normal equations.


Table 2.1: Optimization algorithms reviewed in section 2.2.2. They are ordered from the more generic to the less generic. `Abb.' in the second column stands for `Abbreviation'.
Name
Abb.
Cost function
Sec Downhill simplex
Gradient descent GD Once differentiable 2.2.2.3
Newton $ \times$ Twice differentiable 2.2.2.4
Gauss-Newton GN Sum of squared functions $ \rightarrow$ non-linear least-squares (NLS) 2.2.2.5
Levenberg-Marquardt LM NLS 2.2.2.7
Cholesky factorization $ \times$ Linear system of equations $ \mathsf{A}\mathbf{x} = \mathbf{b}$ (with $ \mathsf{A}$ positive definite) and, coincidently, sum of squared linear functions $ \rightarrow$ linear least-squares (LLS) 2.2.2.8
QR factorization $ \times$ LLS 2.2.2.9
Singular Value Decomposition SVD LLS 2.2.2.10
Iteratively Reweighted Least-Squares IRLS Weighted sum of squared functions with variable weights 2.2.2.11
Golden section search $ \times$ Mono-variate, scalar-valued, uni-modal 2.2.2.12



Iterative Optimization Algorithms

Before talking about the details, let us start with some generalities on the optimization algorithms. Most of these algorithms are iterative algorithms. This means that they start from an initial value  $ \mathbf{x}^{(0)}$ which is then iteratively updated. Therefore, an iterative optimization algorithm builds a sequence $ \{ \mathbf{x}^{(i)}\}_{i=1}^{i^\star}$ that may converge towards a local minimum of the cost function, i.e. $ \mathbf{x}^{(i^\star)}$ may be close to a local minimum of the solution. With a cost function that has several minima, the initial value  $ \mathbf{x}^{(0)}$ determines which minimum is found. This initial value is thus extremely important.

An important aspect of iterative optimization algorithm is the stopping criterion. For the algorithm to be valid, the stopping criterion must be a condition that will be satisfied in a finite and reasonable amount of time. A stopping criterion is usually the combination of several conditions. For instance, one can decide to stop the algorithm when the change in the solution becomes very small, i.e. when $ \Vert \mathbf{x}^{(i+1)} - \mathbf{x}^{(i)}\Vert < \varepsilon$ with  $ \varepsilon$ a small fixed constant (for instance, $ \varepsilon < 10^{-6}$). Of course, this approach does not account for the scale of the solution. One may replace this condition using, for instance, a relative change, i.e. $ \Vert \mathbf{x}^{(i+1)} - \mathbf{x}^{(i)}\Vert / \Vert\mathbf{x}^{(i)}\Vert < \varepsilon$. Another standard stopping criterion is the change in the cost function value: $ \vert f(\mathbf{x}^{(i+1)}) - f(\mathbf{x}^{(i)})\vert \leq \varepsilon$. A maximal number of iterations must also be determined to guarantee that the optimization algorithm will finish in a finite amount of time. Indeed, the previous two stopping criterion may never be satisfied. In the rest of this section, we will denote STOP-CRIT the stopping criterion. In addition to what was just presented, the content of STOP-CRIT will be further detailed depending on the optimization algorithm.


Downhill Simplex

The downhill simplex method is an optimization algorithm due to (134). It is a heuristic that does not make any assumption on the cost function to minimize. In particular, the cost function must not satisfy any condition of differentiability. It relies on the use of simplices, i.e. polytopes of dimension $ n+1$. For instance, in two dimensions, a simplex is a polytope with 3 vertices, most commonly known as a triangle. In three dimensions, a simplex is tetrahedron.

We now explain the mechanisms of the downhill simplex method of (134). Note that more sophisticated versions of this method have been proposed. This is the case, for instance, of the algorithm used by the fminsearch function of Matlab.

The downhill simplex method starts from an initial simplex. Each step of the method consists in an update of the current simplex. These updates are carried out using four operations: reflection, expansion, contraction, and multiple contraction. Let $ f : \mathbb{R}^n \rightarrow \mathbb{R}$ be the function to minimize and let $ \{\mathbf{x}_0, \ldots, \mathbf{x}_n\}$ be the current simplex ( $ \mathbf{x}_i \in \mathbb{R}^n$ for all $ i \in \llbracket 0, n \rrbracket $). Let $ h \in \llbracket 0, n\rrbracket $ be the index of the `worst vertex', i.e. the value $ h = \arg \max_i f(\mathbf{x}_i)$ and let $ l \in \llbracket 0,n \rrbracket $ be the index of the `best vertex', i.e. the value $ l = \arg \min_i f(\mathbf{x}_i)$. The downhill simplex method and the four fundamental operations are detailed in algorithm 1. Figure 2.1 illustrates the four fundamental operations on a 3-simplex. In figure 2.3, as an example, the downhill simplex method is used to optimize the Rosenbrock function (see below).


\begin{algorithm}
% latex2html id marker 1486
\SetCommentSty{textsf}
\SetKwIn...
...{x}_l$}
}
\caption{Downhill simplex of \citep{nelder1965}.}
\end{algorithm}

Figure 2.1: Illustration in 2 dimensions of the four fundamental operations applied to the current simplex by the downhill simplex method of (134): (a) reflection and expansion, (b) contraction, and (c) multiple contraction.
Image nelder-mead-expansion Image nelder-mead-contraction Image nelder-mead-mult-contraction
(a) (b) (c)

Stopping criterion.

The stopping criterion used by (134) is defined by:

$\displaystyle \sqrt{\frac{1}{n+1}\sum_{i=0}^n \left (f(\mathbf{x}_i) - \overline{f(\mathbf{x}_i)}\right )^2} \leq \varepsilon,$ (2.23)

with $ \overline{f(\mathbf{x}_i)}$ the average of the values $ \{ f(\mathbf{x}_i) \}_{i=0}^n$, and $ \varepsilon$ a predefined constant. This criterion has the advantage of linking the size of the simplex with an approximation of the local curvature of the cost function. A very accurate minimum is often obtained for a high curvature of the cost function. On the contrary, a minimum located in a flat valley of the cost function carries less information. Therefore, it does not make much sense to shrink an optimal simplex which would be almost flat.

Rosenbrock function.

When it is possible, the algorithms presented in this section are illustrated on the Rosenbrock function. This function, also known as the banana function, has been a standard test case for optimization algorithms. It is a two-dimensional function defined as:

$\displaystyle f(\mathbf{x}) = (1-x_1)^2 + \alpha (x_2 - x_1^2)^2,$ (2.24)

with $ \alpha$ a positive constant. In our illustrations, we use  $ \alpha = 10$. The minimum is inside a long, narrow, parabolic flat valley. Finding the valley is simple but finding the actual minimum of the function is less trivial. The minimum of the Rosenbrock function is located at the point of coordinate $ \mathbf{x}^\star = (1,1)$, whatever the value given to $ \alpha$. Figure 2.2 gives an illustration of this function.

Figure 2.2: Contour plot of the Rosenbrock banana function.
Image rosenbrock

Figure 2.3: Illustration of the downhill simplex method of (134) with the Rosenbrock function. The minimum of this function is at the point of coordinate (1,1), represented by the red point on the figures. Here, the convergence is reached in 85 iterations.
Image init Image 001 Image 002 Image 003
Initialization Iteration #1 Iteration #2 Iteration #3
Image 008 Image 020 Image 070 Image 085
Iteration #8 Iteration #20 Iteration #70 Iteration #85


Gradient descent

The method of gradient descent (also known as the method of steepest descent or Cauchy's method) is one the most simple algorithm for continuous optimization (52). It was first designed to solve linear system a long time ago (43). It is an iterative algorithm that relies on the fact that the value of the cost function decreases fastest in the direction of the negative gradient (for a differentiable cost function). The principle of the gradient descent algorithm is given in algorithm 2.


\begin{algorithm}
% latex2html id marker 1607
\SetKwInOut{Input}{input}
\SetK...
...
\Return{$\mathbf{x}^{(k)}$}
}
\caption{Gradient Descent}
\end{algorithm}

The exact line search procedure of line 5 in algorithm 2 can be replaced with an approximate line search procedure such that the value  $ \alpha^{(k)}$ satisfies the Wolfe conditions (136,126) (see below).

Properties.

As a first remark, we can notice that the successive search directions used by the gradient descent method are always orthogonal to the level sets of the cost functions. Second, we may also notice that if an exact line search procedure is used, then the method of gradient descent is guaranteed to converge to a local minimum. Under some mild assumptions, it has also been proven to converge with an inexact line search procedure (see, for instance, (40)). However, this convergence is usually very slow. This stems from the fact that, if an exact line search procedure is used, two successive directions are necessarily orthogonal. This is easy to see. Indeed, if $ \alpha^{(k)}$ minimizes $ f(\mathbf{x}^{(k)} - \alpha^{(k)} \boldsymbol{\nabla}f(\mathbf{x}^{(k)}))$ then we have:

  $\displaystyle \frac{\partial}{\partial \alpha} f(\mathbf{x}^{(k)} - \alpha^{(k)} \boldsymbol{\nabla}f(\mathbf{x}^{(k)})) = 0$ (2.25)
$\displaystyle \Leftrightarrow \quad$ $\displaystyle \boldsymbol{\nabla}f(\mathbf{x}^{(k)} - \alpha^{(k)} \boldsymbol{\nabla}f(\mathbf{x}^{(k)}))^\mathsf{T}\boldsymbol{\nabla}f(\mathbf{x}^{(k)}) = 0$ (2.26)
$\displaystyle \Leftrightarrow \quad$ $\displaystyle \boldsymbol{\nabla}f(\mathbf{x}^{(k+1)})^\mathsf{T}\boldsymbol{\nabla}f(\mathbf{x}^{(k)}) = 0,$ (2.27)

which proves that the search directions at iteration $ (k)$ and $ (k+1)$ are orthogonal. Therefore, the path formed by the values  $ \{\mathbf{x}^{(0)}, \ldots, \mathbf{x}^\star\}$ is shaped like a stairs. Figure 2.4 gives an illustration on the Rosenbrock function of this typical behaviour of the gradient descent method.

Figure 2.4: Illustration of the successive steps taken by the method of the gradient descent for optimizing the Rosenbrock functions. Notice that the successive search directions are orthogonal. In (b), we see the typical stair effect of this method. It makes the convergence very slow. Indeed, after 100 iterations, convergence has not yet been reached.
Image gradient-descent Image gradient-descent-zoom
(a) (b)

Approximate line search and Wolfe conditions.

Given a search direction  $ \mathbold{\delta}$, an exact line search procedure amounts to solve the following mono-dimensional minimization problem:

$\displaystyle \min_{\alpha} f(\mathbf{x} + \alpha \mathbold{\delta}).$ (2.28)

Problem (2.28) can be solved using an algorithm such as the golden section search that will be presented in section 2.2.2.12. Depending on the function $ f$, problem (2.28) can be heavy to compute. Another approach consists in doing an approximate line search. It consists in starting from an initial estimate of $ \alpha$ and then refine it so that the Wolfe conditions are satisfied. The Wolfe conditions gives some criteria to decide whether a step $ \alpha$ is satisfactory or not. In its basic form, there are two Wolfe conditions:
  1. $ f(\mathbf{x} + \alpha \mathbold{\delta}) \leq f(\mathbf{x}) + \omega_1 \alpha \mathbold{\delta}^\mathsf{T}\boldsymbol{\nabla}f(\mathbf{x})$
  2. $ \mathbold{\delta}^\mathsf{T}\boldsymbol{\nabla}f(\mathbf{x} + \alpha \mathbold...
...ta}) \geq \omega_2 \mathbold{\delta}^\mathsf{T}\boldsymbol{\nabla}f(\mathbf{x})$
with $ \omega_1$ and $ \omega_2$ two constants such that $ 0 < \omega_1 < \omega_2 < 1$. $ \omega_1$ is usually chosen as a small constant and $ \omega_2$ as a constant (much) bigger than $ \omega_1$. The first condition (also known as the Armijo condition) avoids long steps that would not decrease much the cost function. The second condition (also known as the curvature condition) forces the step to minimize the curvature of the cost function.


Newton's Method

Newton's method is another iterative optimization algorithm that minimizes a twice-differentiable function $ f : \mathbb{R}^n \rightarrow \mathbb{R}$. It relies on the second order Taylor expansion of the function $ f$ around the point  $ \mathbf{x}$:

$\displaystyle f(\mathbf{x} + \mathbold{\delta}) = f(\mathbf{x}) + \boldsymbol{\...
...{\mathsf{H}}_f(\mathbf{x}) \mathbold{\delta}+ O(\Vert\mathbold{\delta}\Vert^2).$ (2.29)

Each step of the Newton's method consists in finding the minimum of the quadratic approximation of the function $ f$ around the current point  $ \mathbf{x}$. This principle can be stated using the second order Taylor expansion of $ f$:

$\displaystyle \min_{\mathbold{\delta}} f(\mathbf{x}) + \boldsymbol{\nabla}f(\ma...
...bold{\delta}^\mathsf{T}\boldsymbol{\mathsf{H}}_f(\mathbf{x}) \mathbold{\delta}.$ (2.30)

A necessary condition for problem (2.30) is obtained by setting to zero the derivative (with respect to  $ \mathbold{\delta}$) of the cost function in equation (2.30). This amounts to solve the following linear system of equations:

$\displaystyle \boldsymbol{\mathsf{H}}_f(\mathbf{x}) \mathbold{\delta}= -\boldsymbol{\nabla}f(\mathbf{x}).$ (2.31)

The search direction  $ \mathbold{\delta}$ for the Newton method is thus given by:

$\displaystyle \mathbold{\delta}= - \big(\boldsymbol{\mathsf{H}}_f(\mathbf{x})\big)^{-1}\boldsymbol{\nabla}f(\mathbf{x}).$ (2.32)

Note that, in practice, the inverse Hessian matrix in equation (2.32) does not need to be explicitly calculated ; it is possible to compute  $ \mathbold{\delta}$ from equation (2.31) using an efficient solver of linear systems.

The complete Newton's method is summarized in algorithm 3.


\begin{algorithm}
% latex2html id marker 1735
\SetKwInOut{Input}{input}
\SetK...
...}
\Return{$\mathbf{x}^{(k)}$}
}
\caption{Newton's Method}
\end{algorithm}

Algorithm 3 is the constant step version of Newton's algorithm. The update  $ \mathbf{x} = \mathbf{x} + \mathbold{\delta}$ is often replaced by the update $ \mathbf{x} = \mathbf{x} + \gamma \mathbold{\delta}$ where $ \gamma$ is a positive value smaller than 1. This is done to insure that the Wolfe conditions are satisfied for each step of the algorithm.

Convergence.

Every local minimum of the function $ f$ has a neighbourhood $ N$ in which the convergence of Newton's method with constant step is quadratic (52). In other words, if we initialize the Newton's method with  $ \mathbf{x}^{(0)} \in N$, the convergence is quadratic. Outside of the neighbourhoods of the local minima of the cost function, there are no guarantee that Newton's method will converge. Indeed the condition of equation (2.31) is just a necessary condition for having a local minimum: it is not a sufficient condition. The condition of equation (2.31) is also satisfied for a local maximum or a saddle point of the cost function. Therefore, Newton's method can converge towards such points.

Illustration.

An illustration of Newton's method on the Rosenbrock function is given in figure 2.5. Two variants are presented in this figure: the constant step length variant and the optimal step length determined with a line search strategy.
Figure 2.5: Illustration of the successive steps taken by Newton's method for optimizing the Rosenbrock functions. (a) Constant step length variant. (b) Step length determined with an optimal line search strategy. On the Rosenbrock function, the Newton method is usually faster to converge than the gradient descent strategy. In this particular example, convergence is reached within 6 iterations for the constant step length variant and 10 iterations with the optimal line search strategy.
Image newton-method Image newton-line-search-method
(a) (b)

Quasi-Newton algorithms.

Newton's method is great since it usually converges quickly. This means that it does not take a lot of iterations to reach the minimum. However, each iteration of this algorithm can be costly since it requires one to compute the Hessian matrix. The goal of quasi-Newton approach is to replace the Hessian matrix by some good approximation, less heavy to compute. There exists several formula to make these approximations iteratively: DFP, SR1, BFGS (52). We will not detail these formula here but we will give the general principle. As for Newton's method, quasi-Newton relies on the second order Taylor expansion of the function to optimize, except that the Hessian matrix is replaced with an approximation  $ \mathsf{A}$:

$\displaystyle f(\mathbf{x} + \mathbold{\delta}) \approx f(\mathbf{x}) + \boldsy...
...{\delta}+ \frac{1}{2} \mathbold{\delta}^\mathsf{T}\mathsf{A} \mathbold{\delta}.$ (2.33)

The gradient of this approximation with respect to  $ \mathbold{\delta}$ is given by:

$\displaystyle \boldsymbol{\nabla}f(\mathbf{x} + \mathbold{\delta}) \approx \boldsymbol{\nabla}f(\mathbf{x}) + \mathsf{A} \mathbold{\delta}.$ (2.34)

The general principle of a quasi-Newton approach is to choose the matrix $ \mathsf{A}$ such that:

$\displaystyle \boldsymbol{\nabla}f(\mathbf{x} + \mathbold{\delta}) = \boldsymbol{\nabla}f(\mathbf{x}) + \mathsf{A} \mathbold{\delta}.$ (2.35)

The difference between the different formula are the properties that the matrix  $ \mathsf{A}$ satisfies at each iteration of the algorithm. For instance, the BFGS method guarantees that the matrix  $ \mathsf{A}$ will always be symmetric and positive definite. In this case, the approximation of the cost function is convex and, therefore, the search direction is guaranteed to be a descent direction.


Gauss-Newton Algorithm

The Gauss-Newton method is an optimization algorithm used to minimize a cost function that can be written as a sum of squares, i.e. if $ f : \mathbb{R}^n \rightarrow \mathbb{R}$:

$\displaystyle f(\mathbf{x}) = \sum_{i=1}^m \left( f_i(\mathbf{x}) \right)^2,$ (2.36)

where each $ f_i$ for $ i \in \llbracket 1,m \rrbracket $ is a function of the form $ f_i : \mathbb{R}\rightarrow \mathbb{R}$. The only hypothesis that must be satisfied for the Gauss-Newton algorithm is that the functions $ f_i$ must all be differentiable. Equation (2.36) is the very definition of a least-squares problem. For reasons that will become clear in the next section, least-squares cost functions often arises when estimating the parameters of a parametric model. The derivation of the Gauss-Newton algorithm is more conveniently done if we consider the minimization of a vector-valued function  $ \mathcal {F} : \mathbb{R}^n \rightarrow \mathbb{R}^m$:

$\displaystyle \min_{\mathbf{x}} \Vert \mathcal {F}(\mathbf{x}) \Vert^2,$ (2.37)

where $ \mathcal {F}$ is defined as:

$\displaystyle \mathcal {F}(\mathbf{x}) =
 \begin{pmatrix}
 f_1(\mathbf{x})  
 \vdots  
 f_m(\mathbf{x})
 \end{pmatrix}.$ (2.38)

Note that solving problem (2.37) is completely equivalent to minimizing the function $ f$ as defined in equation (2.36). The Gauss-Newton algorithm is an iterative algorithm where each step consists in minimizing the first-order approximation of the function  $ \mathcal {F}$ around the current solution. The first-order approximation of  $ \mathcal {F}$ is given by:

$\displaystyle \mathcal {F}(\mathbf{x} + \mathbold{\delta}) = \mathcal {F}(\mathbf{x}) + \boldsymbol{\mathsf{J}}_{\mathcal {F}}(\mathbf{x}) \mathbold{\delta}.$ (2.39)

Therefore, each step of the Gauss-Newton algorithm consists in determining the step  $ \mathbold{\delta}$ by solving the following minimization problem:

$\displaystyle \min_{\mathbold{\delta}} \Vert \mathcal {F}(\mathbf{x}) + \boldsymbol{\mathsf{J}}_{\mathcal {F}}(\mathbf{x}) \mathbold{\delta}\Vert^2.$ (2.40)

Problem (2.40) is a linear least-squares minimization problem. As it will be seen later, such a problem can easily be solved. Algorithm 4 gives the complete principle of the Gauss-Newton method. As for the others algorithms presented so far, a line-search procedure can be combined to the step.


\begin{algorithm}
% latex2html id marker 1859
\SetKwInOut{Input}{input}
\SetK...
...turn{$\mathbf{x}^{(k)}$}
}
\caption{Method of Gauss-Newton}
\end{algorithm}

Properties.

It can be shown that the step  $ \mathbold{\delta}$ is a descent direction (22). If the algorithm converges then the limit is a stationary point of the function $ f$ (but not necessarily a minimum). However, with no assumptions on the initial solution, there is no guarantee that the algorithm will converge, even locally.

With a good starting point and a `nice' function $ f$ (i.e. a mildly nonlinear function), the convergence speed of the Gauss-Newton method is almost quadratic. However, it can be worse than quadratic if the starting point is far from the minimum or if the matrix  $ \boldsymbol{\mathsf{J}}_{\mathcal {F}}^\mathsf{T}\boldsymbol{\mathsf{J}}_{\mathcal {F}}$ is ill-conditioned.

Illustration.

The Rosenbrock function can be written as a sum of squares if we define the functions $ f_1$ and $ f_2$ in the following way:

$\displaystyle f_1(\mathbf{x})$ $\displaystyle = 1-x_1$ (2.41)
$\displaystyle f_2(\mathbf{x})$ $\displaystyle = \sqrt{\alpha} (x_2 - x_1^2)$ (2.42)

Figure 2.6 shows the step taken by the Gauss-Newton algorithm for minimizing the Rosenbrock function.

Figure 2.6: Illustration of the Gauss-Newton method on the Rosenbrock function. For this simple example, the optimal solution is found in 2 iterations.
Image gauss-newton


The Normal Equations

We broke a little bit our classification of the optimization algorithms and present now a basic tool for linear least-squares problems: the normal equations. We do so because, as it was just seen, the iterations of the Gauss-Newton algorithm involves the solution of a linear least-squares minimization problem. Besides, a part of the algorithm we present next, namely the Levenberg-Marquardt algorithm, uses the so-called normal equations.

Let $ f : \mathbb{R}^n \rightarrow \mathbb{R}$ be the function to minimize. Let us suppose that $ f$ can be written as a sum of squares of the form $ f(\mathbf{x}) = \sum_{i=1}^m (f_i(\mathbf{x}) - y_i)^2$, where the functions  $ f_i : \mathbb{R}^n \rightarrow \mathbb{R}$ are linear with respect to  $ \mathbf{x}$ (i.e. $ \exists \mathbf{f}_i \in \mathbb{R}^n$ such that $ f_i(\mathbf{x}) = \mathbf{f}_i^\mathsf{T}\mathbf{x}$, for all $ i \in \llbracket 1,m \rrbracket $) and where $ y_i \in \mathbb{R}$ for all $ i \in \llbracket 1,m \rrbracket $. Minimizing the function $ f$ can be stated in matrix form:

$\displaystyle \min_{\mathbf{x}} \Vert\mathsf{F} \mathbf{x} - \mathbf{y}\Vert^2,$ (2.43)

where the matrix $ \mathsf{F} \in \mathbb{R}^{m \times n}$ and the vector  $ \mathbf{y} \in \mathbb{R}^m$ are defined by:

$\displaystyle \mathsf{F}^\mathsf{T}= \begin{bmatrix}\mathbf{f}_1 & \cdots & \ma...
...\qquad
 \mathbf{y}^\mathsf{T}= \begin{pmatrix}y_1 & \cdots & y_m \end{pmatrix}.$ (2.44)

If the matrix  $ \mathsf{F}$ has full column rank (22), problem (2.43) can be solved using the normal equations:

$\displaystyle \mathsf{F}^\mathsf{T}\mathsf{F} \mathbf{x} = \mathsf{F}^\mathsf{T}\mathbf{y}.$ (2.45)

The hypothesis that  $ \mathsf{F}$ has full column rank allows us to say that the square matrix  $ \mathsf{F}^\mathsf{T}\mathsf{F}$ is invertible. The solution of problem (2.45) (and therefore of problem (2.43)) is given by:

$\displaystyle \mathbf{x} = \left( \mathsf{F}^\mathsf{T}\mathsf{F} \right)^{-1}\mathsf{F}^\mathsf{T}\mathbf{y}.$ (2.46)

Since the matrix  $ \left( \mathsf{F}^\mathsf{T}\mathsf{F} \right)^{-1}\mathsf{F}^\mathsf{T}$ has full column rank, it is the Moore-Penrose pseudo-inverse matrix of  $ \mathsf{F}$ and is denoted  $ \mathsf{F}^\dagger $. It is generally a bad practice to explicitly compute the pseudo-inverse matrix (83). We will see later in this document several methods designed to efficiently solve the linear system of equation (2.45). Note that $ \mathsf{F}$ being full column rank necessary implies that  $ \mathsf{F}$ must have more rows than columns, i.e. $ m \geq n$. There exists several ways to derive the normal equations. For instance, one can get them by writing the optimality conditions of problem (2.43), i.e. :

$\displaystyle \frac{\partial}{\partial \mathbf{x}} \Vert\mathsf{F} \mathbf{x} - \mathbf{y}\Vert^2 = 0.$ (2.47)

The normal equations are trivially derived by noticing that $ \Vert\mathsf{F} \mathbf{x} - \mathbf{y}\Vert^2 = (\mathsf{F} \mathbf{x} - \mathbf{y})^\mathsf{T}(\mathsf{F} \mathbf{x} - \mathbf{y})$. Another way to obtain the normal equations is to consider the over-determined linear system of equations:

$\displaystyle \mathsf{F} \mathbf{x} = \mathbf{y}.$ (2.48)

Since this linear system of equations is over-determined, it does not have necessarily an exact solution. In this case, we can seek for an approximate solution by finding the point of the column space of  $ \mathsf{F}$ which is as close as possible to the point  $ \mathbf{y}$. The shortest distance between a point and an hyperplane is the distance between the point and its orthogonal projection into the hyperplane. This is illustrated by figure 2.7. In other words, the vector $ \mathbf{y} - \mathsf{F} \mathbf{x}$ must lie in the left nullspace of  $ \mathsf{F}$, which can be written:

$\displaystyle \mathsf{F}^\mathsf{T}(\mathbf{y} - \mathsf{F} \mathbf{x}) = 0.$ (2.49)

Equation (2.49) are exactly equivalent to the normal equations.

Figure 2.7: Graphical interpretation of the normal equations.
Image least-squares-proj


Levenberg-Marquardt

The Levenberg algorithm is another method to solve a least-squares minimization problem proposed by (110). (121) proposed a slight variation of the initial method of Levenberg known as the Levenberg-Marquardt algorithm. The content and the presentation of this section is inspired by (117). We use the same notation than in the section dedicated to the Gauss-Newton algorithm. The normal equations can be used for the step in the Gauss-Newton algorithm. This step, denoted  $ \mathbold{\delta}_{gn}$ in this section, can thus be written $ \mathbold{\delta}_{gn} = (\boldsymbol{\mathsf{J}}^\mathsf{T}\boldsymbol{\mathsf{J}})^{-1}\boldsymbol{\mathsf{J}}\mathbf{f}$, where $ \boldsymbol{\mathsf{J}}$ is the Jacobian matrix of the function  $ \mathcal {F}$ evaluated at  $ \mathbf{x}$, and $ \mathbf{f}^\mathsf{T}=\begin{pmatrix}f_1(\mathbf{x}) & \ldots & f_m(\mathbf{x}) \end{pmatrix}$. The Levenberg and the Levenberg-Marquardt algorithms are damped versions of the Gauss-Newton method. The step for the Levenberg algorithm, denoted  $ \mathbold{\delta}_{l}$, is defined as:

$\displaystyle (\boldsymbol{\mathsf{J}}^\mathsf{T}\boldsymbol{\mathsf{J}}+ \lambda \mathbf{I}) \mathbold{\delta}_{l} = \boldsymbol{\mathsf{J}}\mathbf{f},$ (2.50)

while the step for the Levenberg-Marquardt algorithm, denoted  $ \mathbold{\delta}_{lm}$, is defined as:

$\displaystyle (\boldsymbol{\mathsf{J}}^\mathsf{T}\boldsymbol{\mathsf{J}}+ \lamb...
...mbol{\mathsf{J}}) ) \mathbold{\delta}_{lm} = \boldsymbol{\mathsf{J}}\mathbf{f}.$ (2.51)

From now on, we only consider the Levenberg-Marquardt algorithm2.2. The linear system of equation (2.51) is called the augmented normal equations. The value $ \lambda$ is a positive value named the damping parameter. It plays several roles. First, the matrix $ \boldsymbol{\mathsf{J}}^\mathsf{T}\boldsymbol{\mathsf{J}}+ \lambda \mathrm{diag}(\boldsymbol{\mathsf{J}}^\mathsf{T}\boldsymbol{\mathsf{J}})$ is positive definite. Therefore, $ \mathbold{\delta}_{lm}$ is necessarily a descent direction. For large values of $ \lambda$, $ \mathbold{\delta}_{lm} \approx - \frac{1}{m} \boldsymbol{\nabla}f$. In this case, the Levenberg-Marquardt algorithm is almost a gradient descent method (with a short step). This is a good strategy when the current solution is far from the minimum. On the contrary, if $ \lambda$ is a small value, then the Levenberg-Marquardt step  $ \mathbold{\delta}_{lm}$ is almost identical to the Gauss-Newton step  $ \mathbold{\delta}_{gn}$. This is a desired behaviour for the final steps of the algorithm since, near the minimum, the convergence of the Gauss-Newton method can be almost quadratic.

Both the length and the direction of the step are influenced by the damping parameter. Consequently, there is no need for a line-search procedure in the iterations of this algorithm. The value of the damping parameter $ \lambda$ is updated along with the iterations according to the following strategy. If the current $ \lambda$ results in an improvement of the cost function, then the step is applied and $ \lambda$ is divided by a constant $ \nu$ (with, typically, $ \nu = 2$). On the contrary, if the step resulting of the current $ \lambda$ increases of the function, the step is discarded and $ \lambda$ is multiplied by $ \nu$. This is the most basic stratagem for updating the damping parameter. There exists more evolved approach, such as the one presented in (117). The whole Levenberg-Marquardt method is given in algorithm 5. Figure 2.8 gives an illustration of the Levenberg-Marquardt algorithm on the Rosenbrock function.


\begin{algorithm}
% latex2html id marker 2052
\SetKwInOut{Input}{input}
\SetK...
...urn{$\mathbf{x}$}
}
\caption{Levenberg-Marquardt algorithm}
\end{algorithm}

Figure 2.8: Illustration of the Levenberg-Marquardt algorithm for optimizing the Rosenbrock function. In this particular example, this algorithm took 25 iterations to reach the minimum.
Image levenberg-marquardt


Cholesky Factorization

A word about linear systems of equations.

We now provides some tools for solving efficiently a linear system of equations. This is of great interest in optimization since we have seen that solving a linear least-squares minimization problem is equivalent to finding a solution to the linear system formed by the normal equations. Besides, other optimization problems (not necessarily written as linear least-squares problem) involves a linear least-squares problem in their iterations. Let us consider the following linear system of equations:

$\displaystyle \mathsf{A} \mathbf{x} = \mathbf{b},$ (2.52)

with $ \mathsf{A} \in \mathbb{R}^{n \times n}$ a square matrix, and $ \mathbf{b} \in \mathbb{R}^n$. The problem equation (2.52) admits a single solution if the determinant of the matrix  $ \mathsf{A}$ is not null. In this case, the solution is theoretically given with the inverse matrix of  $ \mathsf{A}$, i.e. $ \mathbf{x} = \mathsf{A}^{-1}\mathbf{b}$. However, explicitly forming the matrix  $ \mathsf{A}^{-1}$ is generally not the best way of solving a linear system of equations (83). This stems from the fact that a direct computation of the inverse matrix is numerically unstable. Besides, it is not possible to exploit sparse properties of a matrix when explicitly forming the inverse matrix. A very basic tool that can be used to solve a linear system is the Gauss elimination algorithm (154,83). The main advantage of this algorithm is that it does not need strong assumption on the matrix of the system (except that it is square and invertible, of course). Here, we will not give the details of the Gauss elimination algorithm. Instead, we will present other algorithms more suited to the intended purpose, i.e. optimization algorithms.

The Cholesky factorization.

The Cholesky factorization is a method for solving a linear system of equations. If the matrix $ \mathsf{A}$ is positive definite, then the Cholesky factorization of  $ \mathsf{A}$ is:

$\displaystyle \mathsf{A} = \mathsf{R}^\mathsf{T}\mathsf{R},$ (2.53)

with $ \mathsf{R} \in \mathbb{R}^{n \times n}$ an upper triangular matrix with strictly positive diagonal entries. Note that, for example, the left-hand side of the augmented normal equations used in the iterations of the Levenberg-Marquardt algorithm is a positive definite matrix. If we substitute the matrix  $ \mathsf{A}$ in equation (2.52) by its Cholesky factorization, we obtain:

$\displaystyle \mathsf{R}^\mathsf{T}\mathsf{R} \mathbf{y} = \mathbf{b}.$ (2.54)

If we note $ \mathbf{y}' = \mathsf{R} \mathbf{y}$ then solving equation (2.52) is equivalent to successively solving the two linear systems $ \mathsf{R}^\mathsf{T}\mathbf{y}' = \mathbf{b}$ and $ \mathsf{R} \mathbf{x} = \mathbf{y}'$. This is easily and efficiently done using a back-substitution process because of the triangularity of the matrices $ \mathsf{R}$ and $ \mathsf{R}^\mathsf{T}$.

Although efficient, this approach works only if the problem is well-conditioned. If not, some of the diagonal entries of the matrix  $ \mathsf{R}$ are close to zero and the back-substitution process becomes numerically unstable. In practice, the Cholesky factorization is particularly interesting for sparse matrices. For instance, the package cholmod of (55), dedicated to sparse matrices, heavily uses Cholesky factorization.


QR Factorization

We now give a method based on the QR factorization for solving an over-determined linear least-squares minimization problem of the form:

$\displaystyle \min_{\mathbf{x}} \Vert\mathsf{F} \mathbf{x} - \mathbf{y}\Vert^2,$ (2.55)

with $ \mathsf{F} \in \mathbb{R}^{m \times n}$, $ m \geq n$, and $ \mathbf{y} \in \mathbb{R}^m$. As we have seen in section 2.2.2.6, solving problem (2.55) amounts to solve the normal equations, namely:

$\displaystyle \mathsf{F}^\mathsf{T}\mathsf{F} \mathbf{x} = \mathsf{F}^\mathsf{T}\mathbf{y}.$ (2.56)

The QR factorization of the matrix  $ \mathsf{F}$ is:

$\displaystyle \mathsf{F} = \mathsf{Q} \mathsf{R} 
 = \begin{bmatrix}\mathsf{Q}_...
...thsf{Q}_2 \end{bmatrix} \begin{bmatrix}\mathsf{R}_1  \mathbf{0}\end{bmatrix},$ (2.57)

with  $ \mathsf{Q} \in \mathbb{R}^{m \times m}$ an orthonormal matrix, $ \mathsf{Q}_1 \in \mathbb{R}^{m \times n}$, $ \mathsf{Q}_2 \in \mathbb{R}^{m \times (m-n)}$, and $ \mathsf{R}_1 \in \mathbb{R}^{n \times n}$ an upper triangular matrix. If the matrix  $ \mathsf{F}$ has full column rank, then the columns of the matrix $ \mathsf{Q}_1$ form an orthonormal basis for the range of  $ \mathsf{F}$ (84). If we replace the matrix  $ \mathsf{F}$ by its QR factorization into the normal equations, we obtain:

$\displaystyle \begin{bmatrix}\mathsf{R}_1^\mathsf{T}& \mathbf{0}\end{bmatrix} \...
...{bmatrix} \begin{bmatrix}\mathsf{Q}_1  \mathsf{Q}_2 \end{bmatrix} \mathbf{y}.$ (2.58)

This last expression can be simplified using the fact that the matrix  $ \mathsf{Q}_1$ is orthonormal:

$\displaystyle \mathsf{R}_1 \mathbf{x} = \mathsf{Q}_1^\mathsf{T}\mathbf{y}.$ (2.59)

As for the approach based on the Cholesky factorization, the linear system of equation (2.59) can be easily solved with a back-substitution algorithm.


Singular Value Decomposition

The singular value decomposition is a tool that can be used to solve certain linear least-squares minimization problems. The singular value decomposition of a real matrix  $ \mathsf{A} \in \mathbb{R}^{m \times n}$ is given by:

$\displaystyle \mathsf{A} = \mathsf{U} \mathsf{\Sigma} \mathsf{V}^\mathsf{T},$ (2.60)

with $ \mathsf{U} \in \mathbb{R}^{m \times n}$ a unitary matrix, $ \mathsf{\Sigma} \in \mathbb{R}^{n \times n}$ a diagonal matrix with non-negative elements on its diagonal, and $ \mathsf{V} \in \mathbb{R}^{n \times n}$ a unitary matrix. The diagonal entries of  $ \mathsf{\Sigma}$, named the singular values, are generally sorted from the largest to the smallest. If the matrix  $ \mathsf{A}$ has full column rank then the diagonal entries of  $ \mathsf{\Sigma}$ are all strictly positive. In fact, the number of strictly positive elements on the diagonal of  $ \mathsf{\Sigma}$ is the rank of the matrix  $ \mathsf{A}$.

Non-homogeneous linear least-squares.

Once again, we consider the problem of minimizing the over-determined linear least-squares function $ \Vert \mathsf{F} \mathbf{x} - \mathbf{y}\Vert^2$. This least-squares problem is called non-homogeneous because there exists a non-null right-hand side  $ \mathbf{y}$. The singular value decomposition can be used to solve this problem. Indeed, if we replace the matrix  $ \mathsf{F}$ in the normal equations by its singular value decomposition $ \mathsf{F} = \mathsf{U} \mathsf{\Sigma} \mathsf{V}^\mathsf{T}$, we obtain:

$\displaystyle \mathsf{V} \mathsf{\Sigma} \mathsf{U}^\mathsf{T}\mathsf{U} \maths...
...thsf{T}\mathbf{x} = \mathsf{V} \mathsf{\Sigma} \mathsf{U}^\mathsf{T}\mathbf{y}.$ (2.61)

Using the fact that $ \mathsf{U}$ is a unitary matrix and that  $ \mathsf{\Sigma}$ is a diagonal matrix, we can write that:

$\displaystyle \mathbf{x} = \mathsf{V} \mathsf{\Sigma}^{-1}\mathsf{U}^\mathsf{T}\mathbf{y}.$ (2.62)

In other words, we have that  $ \mathsf{F}^\dagger = \mathsf{V} \mathsf{\Sigma}^{-1}\mathsf{U}^\mathsf{T}$. As for the other algorithms, the computational complexity of this method is in $ \mathcal {O}(n^3)$. The main advantage of the SVD approach for linear least-squares problem is when the matrix  $ \mathsf{F}$ is almost rank deficient. In this case, the smallest singular values in  $ \mathsf{\Sigma}$ are very close to zero. Therefore, a simple computation of  $ \mathsf{\Sigma}^{-1}$ would lead to unsatisfactory results (the inverse of a diagonal matrix being obtained by taking the reciprocals of the diagonal entries of  $ \mathsf{\Sigma}$). In this case, one can replace the inverse matrix $ \mathsf{\Sigma}^{-1}$ in equation (2.62) by the pseudo-inverse matrix  $ \mathsf{\Sigma}^\dagger $. The pseudo-inverse of a diagonal matrix $ \mathsf{\Sigma} = \mathrm{diag}\begin{pmatrix}\sigma_1 & \cdots & \sigma_{n'} & \mathbf{0}\end{pmatrix} \in \mathbb{R}^{n \times n}$ (with $ n' \leq n$) is the matrix  $ \mathsf{\Sigma}^\dagger = \mathrm{diag}\begin{pmatrix}\frac{1}{\sigma_1} & \cdots & \frac{1}{\sigma_{n'}} & \mathbf{0}\end{pmatrix} \in \mathbb{R}^{n \times n}$.

Homogeneous linear least-squares.

We now consider the case of the homogeneous linear least-squares minimization problem. It corresponds to the following minimization problem:

$\displaystyle \min_{\mathbf{x}} \sum_{i=1}^m (f_i(\mathbf{x}))^2,$ (2.63)

where the functions $ f_i : \mathbb{R}^n \rightarrow \mathbb{R}$ are linear with respect to  $ \mathbf{x}$, i.e. $ \exists \mathbf{f}_i \in \mathbb{R}^n$ such that $ f_i(\mathbf{x}) = \mathbf{f}_i^\mathsf{T}\mathbf{x}$ for all $ i \in \llbracket 1,m \rrbracket $. This minimization problem can be written in matrix form:

$\displaystyle \min_{\mathbf{x}} \Vert \mathsf{F} \mathbf{x} \Vert^2,$ (2.64)

with $ \mathsf{F} = \begin{bmatrix}\mathbf{f}_1 & \ldots & \mathbf{f}_m \end{bmatrix}^\mathsf{T}\in \mathbb{R}^{m \times n}$. An obvious solution to problem (2.64) is $ \mathbf{x} = \mathbf{0}$. Of course, this dummy solution is generally not interesting. Instead of solving problem (2.64), one replaces it with the following constrained optimization problem:

\begin{displaymath}\begin{array}{rl}
 \displaystyle \min_{\mathbf{x}} & \Vert \m...
... \textrm{subject to} & \Vert \mathbf{x} \Vert = 1.
 \end{array}\end{displaymath} (2.65)

Problem (2.65) is easily solved using the singular value decomposition. Indeed, let  $ \mathsf{F} = \mathsf{U} \mathsf{\Sigma} \mathsf{V}^\mathsf{T}$ be the singular value decomposition of  $ \mathsf{F}$. If we write  $ \mathbf{x}' = \mathsf{V}^\mathsf{T}\mathbf{x}$, then problem (2.65) is equivalent to:

\begin{displaymath}\begin{array}{rl}
 \displaystyle \min_{\mathbf{x}'} & \Vert \...
...\textrm{subject to} & \Vert \mathbf{x}' \Vert = 1.
 \end{array}\end{displaymath} (2.66)

This stems from the fact that the matrices  $ \mathsf{U}$ and  $ \mathsf{V}$ are unitary and, consequently, we have that $ \Vert \mathsf{U} \mathsf{\Sigma} \mathsf{V}^\mathsf{T}\mathbf{x} \Vert = \Vert \mathsf{\Sigma} \mathsf{V}^\mathsf{T}\mathbf{x}\Vert$ and $ \Vert \mathbf{x}' \Vert = \Vert \mathsf{V}^\mathsf{T}\mathbf{x} \Vert = \Vert \mathbf{x}\Vert$. Remind the fact that for a full column rank matrix the diagonal elements of $ \mathsf{\Sigma}$ are $ \sigma_1 \geq \ldots \geq \sigma_{n} > 0$. Therefore, an evident minimizer of problem (2.66) is $ \mathbf{x}' = \begin{pmatrix}0 & \cdots & 0 & 1 \end{pmatrix}^\mathsf{T}$. Finally, since we have that $ \mathbf{x} = \mathsf{V} \mathbf{x}'$, the solution of our initial homogeneous linear least-squares problem is given by the last column of the matrix  $ \mathsf{V}$. If the matrix  $ \mathsf{F}$ has not full column rank, then the same principle can be applied except that one may take the column of $ \mathsf{V}$ corresponding the last non-zero singular value.


Iteratively Reweighed Least Squares

The method of Iteratively Reweighed Least Squares (IRLS) is an optimization algorithm designed to minimize a function  $ f : \mathbb{R}^n \rightarrow \mathbb{R}$ of the form:

$\displaystyle f(\mathbf{x}) = \sum_{i=1}^n w(\mathbf{x}) \Vert f_i(\mathbf{x}) - y_i \Vert^2,$ (2.67)

where $ w$ is a function from  $ \mathbb{R}^n$ to  $ \mathbb{R}$. For instance, such a cost function arises when M-estimators are involved. IRLS is an iterative method for which each step involves the resolution of the following weighted least squares problem:

$\displaystyle \mathbf{x}^{(k+1)} = \arg \min_{\mathbf{x}} \sum_{i=1}^n w(\mathbf{x}^{(k)}) \Vert f_i(\mathbf{x}) - y_i \Vert^2.$ (2.68)


Golden Section Search

We now arrive at the last optimization algorithm presented in this document: the golden section search algorithm. This algorithm allows one to minimize a monovariate function $ f : \mathbb{R}\rightarrow \mathbb{R}$. The minimization necessarily takes place over a given interval. The function $ f$ has to be continuous and unimodal on this interval for the golden section search algorithm to work. It can be used as the line search procedure for the previously presented algorithm of this section. It was first proposed in (106) and the refined in (8). The general principle of the golden section search is to bracket the location of the minimum of $ f$. This is achieved by updating and refining a set of 3 locations $ x_1 < x_2 < x_3$ with the assumptions that $ f(x_2) \leq f(x_1)$ and $ f(x_2) \leq f(x_3)$. These locations are then updated according to the value taken by $ f$ at a new point $ x_4$ such that $ x_2 < x_4 < x_3$. If $ f(x_4) < f(x_2)$ then the 3 locations become $ x_2 < x_4 < x_3$. Otherwise, i.e. if $ f(x_4) \geq f(x_2)$, they become $ x_1 < x_2 < x_4$. The principle is illustrated in figure 2.9.

Figure 2.9: Illustration of an update of the golden section search algorithm. (a) If the function to optimize is uni-modal and if $ f(x_4) > f(x_2)$ then the minimum of the function is necessarily located on the interval $ [x_1,x_4]$. (b) On the contrary, if $ f(x_4) < f(x_2)$ then the minimum of the function lies in the interval $ [x_2,x_3]$.
Image golden-section-search-1 Image golden-section-search-2
(a) (b)

As just described, the update of the 3 locations implies that the minimum of a function will be located either in the interval $ [x_1,x_4]$ or $ [x_2,x_3]$. The principle of the golden section search algorithm is to impose the length of these two intervals: they must be equal. It implies that we must have:

$\displaystyle x_3 - x_4 = \varphi (x_4 - x_2),$ (2.69)

where $ \varphi$ is the golden ratio, i.e. $ \varphi = \frac{1 + \sqrt{5}}{2}$.


Contributions to Parametric Image Registration and 3D Surface Reconstruction (Ph.D. dissertation, November 2010) - Florent Brunet
Webpage generated on July 2011
PDF version (11 Mo)