### numerical calculation

#### 1. Overflow and underflow

The first problem in numerical computation is numerical overflow and numerical underflow. Overflow refers to the value is very large, overflow is Nan value, and underflow refers to the value is very small, underflow is 0. Numerical overflow is likely to lead to numerical calculation problems, so corresponding solutions are also produced.

##### Underflow

If the underflow is 0, the division by 0 operation should be considered first. When division is involved in the calculation, if the denominator underflow is 0, an error will be reported in the calculation process. A famous example is softmax and cross entropy loss function. The cross entropy loss function involves the calculation of $\ frac {1} {y} $when it propagates backward. When the Y underflow of softmax is 0, an error will be reported in this process. The solution is as follows**Softmax and cross entropy loss function are combined into one process to avoid division operation**。

##### Overflow

When the value exceeds the maximum value range of the computer, it will overflow to Nan, resulting in the failure of the whole calculation process. The formula of softmax is as follows:

$$y_i=\dfrac{e^{z_i}}{\sum\limits_{j}e^{z_j}}$$

When the $Z in the molecule_ When I $is large, the whole $e ^ {Z_ i} $will grow rapidly, and it is likely to overflow into illegal value. The solution is $Z_ {i,new}=z_ i-\max\limits_ j{z_ j} In this case, the maximum number of molecules is 1, which avoids the problem of molecular overflow. Although this will bring the problem of underflow (the molecular exponential power will have a very small negative value), when the molecular underflow is 0, the output of $y = 0 $is a meaningful value（**But the 0 here also determines the necessity of the merge operation mentioned above**）。 And the above operation will also make the denominator contain 1, avoiding the division by 0 operation.

#### 2. Morbid condition

The change speed of function relative to the small change of input is the condition number. For a matrix, the condition number is the ratio of the absolute values of the maximum and minimum eigenvalues: $[Max] limits_ {i,j}\dfrac{\lambda_ i}{\lambda_ j}$。 When**When the condition number of a matrix is large, matrix inversion is very sensitive to the input error**This is an inherent property of matrix, not an error problem, which is called ill conditioned condition. (in a low voice, PCA means discarding lambda.)

#### 3. Second derivative

The gradient descent only uses the first derivative, which points to the direction of the fastest descent of the function. However, this is an oversimplified idea. We only do the first-order Taylor expansion for the objective function. From this point of view, it is indeed the fastest declining direction. But what if it becomes a second-order Taylor expansion?**The second derivative is to control how the first derivative changes with the input and judge whether it will produce the expected improvement**。

Hessian matrix can be decomposed into $d ^ thd $. The second derivative of D direction is the eigenvalue. Second order Taylor expansion of function f (x) is performed

$$f(x)\approx f(x_0)+\nabla f(x_0)(x-x_0)+\frac{1}{2}(x-x_0)^TH(x-x_0)$$

When we use learning rate $/ epsilon $the new point will be $X_ 0 – \ epsilon g $, then:

$$f(x_0-\epsilon g)\approx f(x_0)-\epsilon g^Tg+\frac{1}{2}\epsilon^2g^THg$$

It can be seen from the above formula that when the second-order term is 0 or negative, the function will decline, but $/ epsilon $needs to be small enough for the result to be accurate (Taylor expansion itself is a local linear approximation); when the second-order term is positive, f (x) will even rise, and the gradient decline will fail.

In addition, if**If the condition number of Hessian matrix is very poor, then the gradient descent will be very poor**. Because the gradient increases rapidly in some directions and slowly in some directions, the SGD oscillates.

When the second derivative is 0, the saddle point is more likely, and the eigenvalues are positive and negative. Newton’s method is easily attracted by saddle point.

### supplement

#### 1. Radial basis function

SVM will be used as a kernel function

$$ K(x,x’)=e^{-\dfrac{||x-x’||_ 2 ^ 2} {2 / sigma ^ 2}}, and sigma is a free parameter$$

Let $- gamma = – – dfrac {1} {2 / sigma ^ 2} $, then $k = e ^ {gamma | x-x ‘||_ 2^2}$。 RBF makes the sample point only be activated by the nearby input and output, which has less parameters than polynomial kernel. In addition, RBF network is activated by radial basis function.

#### 2. Jensen inequality

On convex functions, if for any set of points, $\ {x}_ i\}，\lambda_ I / ge0 and_ i\lambda_ I = 1, if we use mathematical induction, we can prove the convex function f ([sum] limits)_ {i=1}^{M}\lambda_ ix_ i)\leq\sum\limits_ {i=1}^{M}\lambda_ if(x_ i) In probability theory, $f (E (x)) \ Leq e (f (x))$

Jensen inequality is useful in proving EM algorithm. In addition, if a function is convex or nonconvex, it can be judged by the second derivative / Hessian. If $f ” (x) ge0 $or Hessian positive semidefinite, it is convex. It can also be judged by Jenson that if the function is convex, then $f (E (x)) \ Leq e (f (x))$

#### 3. Global optimum and local optimum

Plato asked his teacher Socrates one day what is love? Socrates told him to go to the wheat field once and pick the biggest ear of wheat. He was not allowed to look back. He could only pick it once. Plato came out empty handed, his reason is, see good, but don’t know is the best, fluke again and again, come to the end, just found not as good as the front, so give up. Socrates told him, “this is love.” This story makes us understand a truth. Because of some uncertainties in life, it is difficult to find the global optimal solution, or it does not exist at all. We should set some restrictive conditions, and then find the optimal solution within this range, that is, the local optimal solution. It is better to gain something than to return empty handed, even if it is just an interesting experience.

Plato asked one day what is marriage? Socrates told him to go to the woods once and choose the best tree as the Christmas tree. He was not allowed to look back, but only once. This time, exhausted, he dragged back a fir tree that looked straight and green, but a little sparse. His reason was that with the lessons of the last time, he finally saw a good looking fir tree, and found that his time and physical strength were not enough, and whether it was the best or not, he took it back. Socrates told him, “this is marriage.”

Optimization problems are generally divided into local optimization and global optimization. Among them,

- Local optimization is to find the minimum value in a limited area of function value space, while global optimization is to find the minimum value in the whole area of function value space.
- The local minimum point of a function is a point whose function value is less than or equal to a nearby point, but it may be larger than a distant point.
- The global minimum is the kind whose function value is less than or equal to all feasible points.

#### 4. Use standard deviation instead of variance

Standard deviation is used instead of variance to describe the degree of data dispersion, because standard deviation has three advantages

- Same order of magnitude
- The unit is consistent
- It is convenient for calculation. Under normal distribution, $\ begin {cases} 99% is in three sigma before and after \\\\\\\\\\\\\\\\\\\\\\\$

#### 5. Secondary planning

The quadratic programming problem with n variables and M constraints is as follows

$$\begin{cases}\argmin\limits_Xf(X)=\frac{1}{2}X^TQX+C^TX \\\ s.t. AX\leq b \end{cases}$$

When q is semi positive, it is a convex quadratic programming problem and the feasible region is not empty, then there is a global optimal solution; if q is not positive, then it is NP hard and has multiple stationary points; q = 0 degenerates to ordinary quadratic programming.

If a point x is a global minimum, then it satisfies KKT condition. If f (x) is a convex function, then KKT becomes a necessary and sufficient condition, that is, if KKT is satisfied, then x is a global minimum.

On the duality problem, the duality of quadratic programming is also quadratic programming, and the duality of convex quadratic programming is also convex quadratic programming.

The solutions of convex quadratic programming include interior point method, conjugate gradient method and ellipsoid method.

There is cvxopt in Python to solve the quadratic programming problem.