The way of nonlinear optimization


Mathematical knowledge

1、 Nonlinear function
Linear function is another name of a function of first degree, then nonlinear function means that the function image is not a function of a straight line.
Nonlinear functions include exponential function, power function, logarithmic function, polynomial function and so on.

2、 Taylor expansion
1. Taylor formula:
Taylor’s formula is to add a_ The function f (x) with n-th derivative at 0 $is constructed by using the relation between $(x-x)_ The method of approximating function by polynomial of degree n of $0) $.
If the function f (x) contains $X_ If the closed interval [a, b] of 0 $has n-order derivative and (n + 1) order derivative on the open interval (a, b), then for any point x on the closed interval [a, b], the following formula is established:
The way of nonlinear optimization

2. Taylor expansion:
In practical application, Taylor formula needs to be truncated and only finite terms are taken. The Taylor series of finite term of a function is called Taylor expansion.

3. McLaughlin unfolds:
The McLaughlin expansion of the function refers to $X in the Taylor formula above_ If f (x) is n-order continuous differentiable at x = 0, then the following formula holds:
The way of nonlinear optimization

3、 Nonlinear optimization
1. Simple linear optimization
For example, if the single variable’s $y = x ^ 2 $can be solved directly by using the first derivative of 0.

2. Complex linear optimization
Multiple variables, partial derivative is 0, can not find the analytical solution.

Solutions to complex nonlinear optimization problems:

4、 Jacobian () matrix
Also known as the first derivative matrix.

In vector analysis, Jacobian matrix is functionalfirst-order partial derivative A matrix arranged in a certain way.

The shape of Jacobian matrix is $m * n $.
5、 Hessian matrix
It is also called second derivative matrix.

It’s a multivariate functionSecond partial derivative The local curvature of the function is described.

Hesse matrix is a symmetric matrix of NxN order composed of the second partial derivative of the objective function f at point X.

Hessian matrix controls the gradient update step size
The formula $X is derived from Taylor expansion_ {t+1} = x_ {t} – H ^ {- 1} g $, $H ^ {- 1} $is equivalent to the step size in the first-order optimization method.

The positive definiteness of Hessian matrix guarantees the direction of gradient descent (the eigenvalue is positive)
$H^{-1}g=\sum_ {i}^{n}{ \frac{e_ {i}^Tg}{\lambda _ {i}} e_ {i} } $, where $e_ I $is the unit eigenvector, $\ lambda_ I $is the eigenvalue of the corresponding eigenvalue vector.

6、 Others
1. Higher order partial derivative
An example of a higher-order partial derivative:
There are multivariable functions $z = f (x, y) $, $/ frac {partial ^ 3Z} {partial_ y\partial_ {x {^ 2}}} $is the third order partial derivative of function Z. the decomposition steps are as follows:
1. Let’s find the first order partial derivative of Z to y_ Y}$
2. Then, find the result of the previous step, $/ frac {partial Z} {partial_ y} The second partial derivative of $to X_ y\partial_ {x}}$
3. Finally, we use the result of the previous step $/ frac {partial ^ 2Z} {partial_ y\partial_ {x} The third order partial derivative of X is obtained_ y\partial_ {x{^2}}}$

2. Positive definite matrix
When the second-order Hessian matrix is not positive, it can not guarantee that the generation direction is the descent direction.

Classification of optimization methods

1、 According to the derivative order used
1. First order optimization
The SGD decreased with random gradient,

2. Second order optimization

2、 According to the adaptive learning rate
1. Fixed learning rate
The SGD decreased with random gradient,

2. Adaptive learning rate

3、 Yes no…

4、 Sort by time

optimization method


Parameters and data

Gradient descent method

1、 Optimization algorithm, when can direct derivation (maximum likelihood estimation)? When must gradient descent be used?

When the amount of data is relatively small, the derivative can be directly obtained.

The reasons for not directly deriving are as follows
1. Large amount of data needs to replace all the data into the calculation, the amount of calculation is large, and the memory requirement is high.
2. Complex functions cannot be solved directly.
3. Direct solution requires the inverse of matrix, but not all matrices have inverse

The parameters are updated in each step of gradient descent, and only need to be calculated

2、 Can the gradient rise method get the maximum value?

3、 Stop condition of gradient descent?
1. The approximate number of iteration steps can be set a little larger
2. The difference of loss function between two iterations is defined to stop the difference

Stochastic gradient descent

1、 Where is the stochastic gradient descent method?
Find a batch of sample data to update the parameter value.
A parameter value update: parameter value = parameter value step size * gradient corresponding to the parameter value (assuming that the function calculates the partial derivative of the parameter)

It is reflected in the sample range size (not a full sample, but a sample) used in each parameter update (each iteration process).


The way of nonlinear optimization

This formula is similar to gradient descent, only the learning rate coefficient is different.

$g_ {T, I} $represents the gradient value of the current parameter.
$G_ {T, I} $denotes the sum of the squares of the gradient accumulated before step t.


1. Auto adjust learning rate:

1.1. For the parameters with low frequency, the learning rate should be higher,

1.2. The parameters with high frequency use less learning rate.

1. Automatic adjustment of learning rate in the process of learning.

2. Using the information of sparse gradient.

1. Adagrad results in the decrease of learning rate and premature convergence over time

Momentum method

1. Formula:
The momentum method adds the product of gradient value and decay rate based on the current gradient value,
The attenuation rate is generally less than or equal to 0.9.
2. Principle:
When the SGD algorithm with momentum term updates the model parameters, the current gradient direction is the same as the previous gradient direction, it will increase the update strength;
For the parameters with different gradient direction from the previous gradient direction, it will be reduced, that is, the update in the current gradient direction is slowed down.

3. Advantages:
In this way, the last gradient value can be accumulated continuously.
Therefore, compared with SGD, momentum method can converge faster and reduce oscillation.

Newton method

1. Hessian matrix:
Using the inverse matrix of Hessian matrix instead of the artificial learning rate, the descent direction can be found perfectly when the gradient drops, and it will not fall into the local minimum value

2. Second derivative:
It belongs to the second order optimization method

1. The time complexity of inverse matrix is approximately o (N3), which is too expensive to be suitable for big data.


Adadelta is an improved algorithm of adagrad, and Newton method is used in the updating process.

Minimum angle regression

Solving lasso regression

The above classification is relatively rough, so I have time to continue sorting out.

In addition, online learning (optimization) algorithm and optimization algorithm belong to the same problem, but the number of samples and the characteristics of the model learned are different.

Engineering practice

Optimization algorithm implemented in common Toolkits

Thoughts and understanding of Taylor expansion
What is nonlinear optimization?
Higher order partial derivative
What is the function of eigenvalue of Hessian matrix and its derivation in optimization
SGD / adagrad / Adam

Learning note 13: stochastic gradient descent (SGD)
Summary of gradient descent
Overview of gradient descent optimization algorithm
Summary of common optimization methods in deep learning: Adam, SGD, momentum, adagard, etc