Machine learning learning learning notes 03 least square method, maximum likelihood method, cross entropy

Time:2022-6-3

loss function

How much is the difference between the neural network standard and the human brain standardQuantitative expression

least square method

First, understand how the two probability models compare. There are three ideas,Least squares, maximum likelihood estimation, cross entropy

When a picture is judged by the human brain\(x1\), the result of neural network judgment is\(y1\), subtract them directly\(\left|x_{1}-y_{1}\right|\)Is the range of their differences. We take multiple pictures to judge the sum. When the final value is the smallest,\(\min \sum_{i=1}^{n}\left|x_{i}-y_{i}\right|\)The two models can be considered approximate.

However, the absolute value is not fully differentiable in the definition field, so it can be squared\(\min \sum_{i=1}^{n}\left(x_{i}-y_{i}\right)^{2}\)

That’s itleast square methodBut it is difficult to judge the difference between the two probability models as a loss function. So introducemaximum likelihood estimation

maximum likelihood estimation

likelihood If the real situation has happened, we assume that there are many models. In this probability model, the probability of this situation is called likelihood value.

If the likelihood value is the largest, the probability will be higher. The probability model at this time should be the closest to the standard model.

\[\begin{array}{l}
P\left(C_{1}, C_{2}, C_{3}, \ldots, C_{10} \mid \theta\right) \\
P\left(x_{1}, x_{2}, x_{3}, x_{4}, \ldots, x_{n} \mid W, b\right)
\end{array}
\]

\(\theta\)Is a probability model for coin tossing,\(W,b\)It is the probability model of neural network. The former result is whether the coin is positive or negative, and the latter result is whether the picture is a cat.

\[\begin{array}{l}
P\left(x_{1}, x_{2}, x_{3}, x_{4}, \ldots, x_{n} \mid W, b\right) \\
=\prod_{i=1}^{n} P\left(x_{i} \mid W, b\right)
\end{array}
\]

If the neural network is such a parameter, what is the probability of the input picture if it is a cat? What is the probability if it is not a cat? After judging all the pictures, the value obtained by multiplying is the likelihood value. The maximum likelihood value is the closest value.

But in training\(W、b\)No matter what kind of photo you enter, it is a fixed value. If we use the photo of the cat to determine, the label is\(1\), then there is no way to train, the theory is feasible but not operational. But we can also make use of the conditions that can be obtained when training the neural network\(x_{i}\)Also available\(y_{i}\)\(y_{i}\)Output result dependency of\(W,b\)。 Each time you input a photo, it is different,\(y_{i}\)The result is different.

\[\begin{array}{l}
=\prod_{i=1}^{n} P\left(x_{i} \mid W, b\right) \\
=\prod_{i=1}^{n} P\left(x_{i} \mid y_{i}\right)
\end{array}
\]

\(x_{i}\)The value of is\(0、1\), which conforms to binomial Bernoulli distribution, and the expression of probability distribution is

\[f(x)=p^{x}(1-p)^{1-x}=\left\{\begin{array}{ll}
p, & x=1 \\
1-p, & x=0
\end{array}\right.
\]

\(x=1\)Is the probability that the picture is a cat. and\(p\)namely\(y_{i}\)(the probability that the neural network recognizes a cat), bring it into the replacement $p\left (x\u{i} \mid y\u{i}\right)$

\[=\prod_{i=1}^{n} y_{i}^{x_{i}}\left(1-y_{i}\right)^{1-x_{i}}
\]

We prefer to add in succession and add in front\(log\), and simplify

\[\begin{array}{l}
\log \left(\prod_{i=1}^{n} y_{i}^{x_{i}}\left(1-y_{i}\right)^{1-x_{i}}\right) \\
=\sum_{i=1}^{n} \log \left(y_{i}^{x_{i}}\left(1-y_{i}\right)^{1-x_{i}}\right) \\
=\sum_{i=1}^{n}\left(x_{i} \cdot \log y_{i}+\left(1-x_{i}\right) \cdot \log \left(1-y_{i}\right)\right)
\end{array}
\]

Therefore, to find the maximum likelihood value is to find the following formula

\[\begin{array}{l}
\max \left(\sum_{i=1}^{n}\left(x_{i} \cdot \log y_{i}+\left(1-x_{i}\right) \cdot \log \left(1-y_{i}\right)\right)\right) \\
\min -\left(\sum_{i=1}^{n}\left(x_{i} \cdot \log y_{i}+\left(1-x_{i}\right) \cdot \log \left(1-y_{i}\right)\right)\right)
\end{array}
\]

Review the logarithm

  1. $\log _{a}(1)=0 $
  2. $ \log _{a}(a)=1 $
  3. \(negative and zero have no logarithm \)
  4. \(\log _{a} b * \log _{b} a=1\)
  5. $ \log _{a}(M N)=\log _{a} M+\log _{a} N $
  6. $\log _{a}(M / N)=\log _{a} M-\log _{a} N $
  7. $ \log _{a} M^{n}=n \log _{a} M(\mathrm{M}, \mathrm{N} \in \mathrm{R}) $
  8. $ \log _{a^{n}} M=\frac{1}{n} \log _{a} M $
  9. $a^{\log _{a} b}=b $

Cross entropy

If you want to directly compare two models, the premise is that the two models are of the same type, otherwise they cannot be commensurate. If the probability model is to be uniformly measured, we need to introduce entropy (the degree of chaos in a system).

information content

If we want to obtain the function of information, we need to define it. And find the formula that can make the system self consistent.

\[{f} ({x}): = \text {information}\\

Bring the above fourth formula into the third one to get the following second one. In order to make the definition of information quantity self consistent, we give the definition\(log\), which conforms to the form of multiplication and addition.

\[\begin{array}{c}
f(x):=? \log _{?} x \\
f\left(x_{1} \cdot x_{2}\right)=f\left(x_{1}\right)+f\left(x_{2}\right)
\end{array}
\]

In order to conform to our most intuitive feeling, because the smaller the probability, the greater the amount of information. and\(log\)The function is monotonically increasing. We change the direction.

\[\begin{array}{c}
f(x):=-\log _{2} x \\
f\left(x_{1} \cdot x_{2}\right)=f\left(x_{1}\right)+f\left(x_{2}\right)
\end{array}
\]

Look at the number of bits of data in the computer, input a 16 bit data to the computer, and the random value before input is\(1/2^{16}\)The probability after input directly becomes\(1\)。 The amount of information is\(16\)Bit.

The amount of information can be understood as how difficult it is to determine an event from its original uncertainty, large amount of information and high difficulty.

Entropy is not a measure of a specific event, but an event of the whole system. How difficult is it for a system to be uncertain to be certain

They are all measures of difficulty, and the units can also be all bits.

Definition of system entropy

The amount of information contributed by the above to the system can be regarded as the expected calculation.

KL divergence

KL divergence is absolutely greater than or equal to\(0\)Yes, when\(Q、P\)Equal when equal\(0\), when it is not equal, it must be greater than\(0\)

To make\(Q、P\)The two models are close, so the cross entropy must be minimized

Cross entropy\(m\)The one with more events in the two probability models, replace it with\(n\)Is the number of pictures.

yes\(m\)Explanation of choice: if\(p\)The number of events for is\(m\)\(q\)The number of events for is\(n\)\(m>n\), then write\(∑\)Sum with larger\(m\)Superscript. Can be decomposed into,\(∑ 1 to n+ ∑ n+1 to m\), then for\(q\)For, because\(q\)Quantity of only\(n\), then the corresponding\(q\)Part of\(∑ n+1 to m\)All equal to\(0\)

\[\begin{array}{l}
\boldsymbol{H}(\boldsymbol{P}, \boldsymbol{Q}) \\
=\sum_{i=1}^{m} p_{i} \cdot\left(-\log _{2} q_{i}\right) \\
=\sum_{i=1}^{n} x_{i} \cdot\left(-\log _{2} q_{i}\right) \\
=-\sum_{i=1}^{n}\left(x_{i} \cdot \log _{2} y_{i}+\left(1-x_{i}\right) \cdot \log _{2}\left(1-y_{i}\right)\right)
\end{array}
\]

\(P\)It is the benchmark, the probability model to be compared, and the human brain model we want to compare is either completely cat or not cat.

\(x_{i}\)There are two cases, and\(y_{i}\)Only judge how much the picture looks like a cat, not how much the cat looks like a cat, but\(x_{i}\)And\(q_{i}\)To correspond, when\(x_{i}\)by\(1\)To judge how much like a cat, when\(x_{i}\)by\(0\)To judge the probability of not being like a cat.

Finally, we derive that the formula is the same as that derived from maximum likelihood. But from a physical point of view, the two are very different, just the same in form.

  • In the maximum likelihood method\(log\)Let’s replace the continued multiplication with the addition, which we used to introduce. And cross entropy\(log\)It is written in the definition of information quantity, with\(2\)As the base, the calculated unit is bit, which is dimensioned.
  • The maximum likelihood method is used to calculate the maximum value, and we usually calculate the minimum value. The cross entropy minus sign is written in the definition.