Partial understanding of entropy, relative entropy and cross entropy

Time:2020-2-10

Entropy was first proposed by Rudolf Julius Emanuel Clausius, a German physicist and the main founder of thermodynamics. It is defined as the rate of change of input heat relative to temperature in the reversible process, that is
$$dS = \frac {dQ}{T}$$
The above representation is quite different from the information entropy we know today, because the entropy defined by Claude E. Shannon in information theory is from the statistical physics explanation of entropy proposed by Ludwig E. Boltzmann. Boltzmann established the relationship between entropy S and the number of possible microscopic states w corresponding to the macroscopic state of the system, namely $s \ propto LNW $. Then Max Karl Ernst Ludwig Planck introduced the scale coefficient $K {B} $, and obtained the Planck Boltzmann formula
$$S = k_{B}\cdot lnW$$
The equivalent expression of the above formula is
$$S=k_{B}\sum_{i}p_{i}ln{\frac{1}{p_{i}}}$$
Where I marks all possible microscopic states, and $p {I} $indicates the probability of possible microscopic states. It can be seen from the above formula that the reciprocal of probability, namely $\ frac {1} {P {I}} $, is used to represent the corresponding number of possible microscopic states. A less rigorous but intuitive explanation is that for a microstate with a probability of occurrence of $p {I} $, it can be considered that it contains $\ frac {1} {P {I} $. So we can get Shannon entropy in information theory
$$S=k\cdot\sum_{i=1}^{n}p_{i}log_{2}\frac{1}{p_{i}}$$

As for why the base number of logarithm here changed from e to 2, there is little difference from a mathematical point of view, but Shannon defines a measure here based on another perspective, and uses coin toss to analogy the number of coins corresponding to the micro state number $\ frac {1} {P {I}}} $. Specifically, if the number of micro States is $2 ^ {3} $, then the corresponding three coins (each coin has two possibilities of positive and negative, and a total of eight combinations can be given in order), the more the corresponding coins, the greater the entropy, so it can be understood that “coins” as a measure of entropy. In fact, the unit of this coin is called bit. The reason why the logarithm is based on 2 is that the information in the computer is stored in “0” and “1”. 3 bit information means that it needs 3 bits of binary number to be completely determined. Taking Chinese characters as an example, the whole Chinese system contains about 7000 commonly used words. If the use of all commonly used words is equal probability, $s = log {2} 7000 \ approve 13 $, which means that the whole Chinese character system contains about 13 bits of information. But the frequency of 7000 words is a hypothetical situation. Only about 10% of Chinese characters are used in high-frequency life, so the number of bits of Chinese characters is generally only about 5 bits. It’s a 100000 word novel, containing about 500000 bits of information, or 0.0625 MB. (however, the actual size of the file depends on the specific compression algorithm, and the difference may be large.).

Log {2} \ frac {1} {P {I} – \ lambda (\ sum {I = 1} ^ {n} P {I} – 1) $and find i t s solution satisfying KKT condition
$$\left\{\begin{matrix} \frac{\partial L}{\partial p_{i}}=log_{2}\frac{1}{p_{i}}-\frac{1}{ln2}-\lambda=0 (i=1,2…,n) \\\frac{\partial L}{\partial \lambda}=\sum_{i=1}^{n}p_{i}-1=0 \end{matrix} \right.$$
It is easy to know that $p {1} = P {2} =… = P {n} = \ frac {1} {n} $is one of the solutions. Furthermore, according to the second-order condition $d {t} \ triangledown {x} {2} LD \ Leq 0 $for all feasible directions D, it is established that the above solution is the maximum value of the constrained function.

Therefore, when the equal probability of all samples appears, the entropy of the system reaches the maximum value; at the same time, when the equal probability of all samples appears, the more the number of samples, the greater the entropy is ($s = k \ cdot log {2} n $). This means that the more orderly a system is, the lower its entropy is. If the system variables are completely determined, that is, $p = 1 $, s = 0, then the entropy value reaches the minimum value; the higher the uncertainty of a system, the higher the entropy value is.

Back to the most essential question, what is information entropy (Shannon entropy), or what is the perceptual explanation of the concept of information entropy? We can think of information entropy as a measure of event information in a narrow sense. When the uncertainty of an event is greater, the more information we need to get the exact answer, that is, the amount of information of the event is directly related to its uncertainty. The entropy of information is consistent with the amount of information in numerical value. The purpose of obtaining information is to reduce the uncertainty, that is, to reduce the entropy of events. The larger the entropy of events, the more information is needed to reduce the uncertainty.

Relative entropy

From the basic concept of entropy, the relative entropy is derived here. Relative entropy, also known as KL divergence, is used to measure the difference between the two distributions.

Let $p (x) $and $Q (x) $be two probability distributions of the discrete random variable $x $, then the relative entropy of P to q is
$$D_{KL}(p||q)=\sum_{x}p(x)log_2\frac{p(x)}{q(x)}=\mathbb{E}_{x\sim p}(log_2p(x)-log_2q(x))$$

It is worth noting that relative entropy does not have symmetry, that is, $d {KL} (P | q) = $d {KL} (Q | P) $does not hold.

Cross entropy

The purpose of cross entropy is to measure the approximation degree of real distribution and simulated distribution. It is often used as a loss function of tasks such as machine learning and deep learning. If $p (x) $is the real data distribution and $Q (x) $is the simulation data distribution, then the cross entropy is
$$S_{CE}=\sum_{x}p(x)log_2\frac{1}{q(x)}$$

When the simulation data distribution is more and more approximate to the real distribution, the cross entropy value will gradually decrease, and the minimum value will be obtained when $Q (x) = $p (x) $. In the application of multi category semantic segmentation in deep learning, for a pixel, the real category is set as label value 1, and the other category is 0; when the network outputs the result, it outputs n result graphs normalized by softmax. When the same position of the label corresponding to the map is 1, the minimum value of cross entropy loss is 0, that is, $loss = – log \ frac {e ^ {x {label}}} {\ sum {I = 1} ^ {n} e {x {I}}}}}= 0$.

At the same time, we note that there is a relationship among entropy $s $, relative entropy $d {KL} $, and cross entropy $s {CE} $
$$D_{KL}=S_{CE}-S$$
So we can see that there is no essential difference between using relative entropy $d {KL} $and cross entropy $s {CE} $as loss function in deep learning related tasks.