[course notes] information theory of China University of science and Technology (6)

Time:2022-6-1

Information inference

This part belongs to the combination of information theory and statistics, which is similar to the inference of “hidden variables” in machine learning, i.eExtrapolate real information from observed values。 Compared with machine learning, which prefers to propose specific inference methods, information theory pays more attention to the nature of inference and the highest inference accuracy?

hypothesis test

Judge the truth by observation

In probability, it isSelect one of the probability distributions that best matches the observed random variables

  • Problem description

    \[\begin{aligned}
    &\mathcal{H}_{0}: \quad X \sim p_{0}(x) \text { ( “null”) }\\
    &\mathcal{H}_{1}: \quad X \sim p_{1}(x) \quad(\text { “alternative”) }
    \end{aligned}
    \]

  • Indicator variable\(\delta: X \mapsto\{0,1\}\), based on observed values\(x\)Which distribution does the discrimination come from

    • Determinate

      \[\begin{aligned}
      \delta(x) &=1 \quad \text { if } x \in X_{1} \\
      &=0 \quad \text { if } x \in X \backslash X_{1}=X_{1}^{c}
      \end{aligned}
      \]

    • Random

      \[\tilde{\delta}(x)=P(\delta=1 \mid X=x)
      \]

Next, how to design criteria? According to the hypothesis of whether there is a priori probability, it can be divided into Bayesian or Neyman Pearson hypothesis test

Bayes

Presupposition

  • Each assumptionPrior distribution

    \[\begin{aligned}
    \pi_{0} &=P\left(X \sim p_{0}\right) \\
    \pi_{1}=1-\pi_{0} &=P\left(X \sim p_{1}\right)
    \end{aligned}
    \]

  • There is a cost after judging right and wrong: the true distribution\(\mathcal{H}_{j}\)Judged as\(\mathcal{H}_{i}\)Cost of\(C_{i,j},i,j=0,1\)

  • Bayesian risk (deterministic judgment)

    • When the true distribution is\(\mathcal{H}_{j}\)Risk when

      \[R_{j}(\delta)=C_{1, j} p_{j}\left(X_{1}\right)+C_{0, j} p_{j}\left(X_{1}^{c}\right)
      \]

      among\(p_{j}\left(X_{1}\right)\)Indicates the probability of judging as 1 at this time

    • The risk after further considering the prior probability is

      \[r(\delta)=\pi_{0} R_{0}(\delta)+\pi_{1} R_{1}(\delta) \label{1}
      \]

  • Bayesian risk (random judgment)

    • Conditional risk

      \[R_{j}(\tilde{\delta})=C_{1, j} \sum_{x \in \mathcal{X}} \tilde{\delta}(x) p_{j}(x)+C_{0, j} \sum_{x \in X}[1-\tilde{\delta}(x)] p_{j}(x)
      \]

    • Bayesian risk

      \[r(\tilde{\delta})=\pi_{0} R_{0}(\tilde{\delta})+\pi_{1} R_{1}(\tilde{\delta}) \label{2}
      \]

Optimal solution

Deterministic judgment

The core objective is to indicate variables by design\(\delta\)To minimize Bayesian risk

Therefore, Bayesian risk\(\eqref{1}\)Expand and simplify

\[\begin{aligned}
r(\delta)&= \pi_{0} C_{1,0} p_{0}\left(X_{1}\right)+\pi_{0} C_{0,0} p_{0}\left(X_{1}^{c}\right) \\
& \quad+\pi_{1} C_{1,1} p_{1}\left(X_{1}\right)+\pi_{1} C_{0,1} p_{1}\left(X_{1}^{c}\right) \\
&= \pi_{0} C_{0,0}+\pi_{1} C_{0,1} \\
& \quad+\pi_{0}\left(C_{1,0}-C_{0,0}\right) p_{0}\left(X_{1}\right)+\pi_{1}\left(C_{1,1}-C_{0,1}\right) p_{1}\left(X_{1}\right) \\
&= \text { constant }+\sum_{x \in X_{1}}\left[\pi_{0}\left(C_{1,0}-C_{0,0}\right) p_{0}(x)+\pi_{1}\left(C_{1,1}-C_{0,1}\right) p_{1}(x)\right]
\end{aligned}
\]

among\(p_{0}\left(X_{1}^{c}\right)=1-p_{0}\left(X_{1}\right)\), the first line of the second equal sign is a constant, and the third equal sign comes from\(p_{1}\left(X_{1}\right)=\sum_{x \in X_{1}}p_{1}(x)\)

So what we have to do is change the sum range\(X_{1}\)To minimize the sum on the right

Since there is no way to change how much is negative, onlyLet the sum be negative, just bring it in, that is, satisfaction

\[\pi_{0}\left(C_{1,0}-C_{0,0}\right) p_{0}(x)+\pi_{1}\left(C_{1,1}-C_{0,1}\right) p_{1}(x) \leq 0 \quad \text { if } x \in X_{1}
\]

It is advisable to assume the relative size of cost, so the decision interval (likelihood ratio test) can be obtained\(L(x)=\frac{p_{1}(x)}{p_{0}(x)}\)

\[X_{1}=\left\{x \in X: \frac{p_{1}(x)}{p_{0}(x)} \geq \frac{\pi_{0}}{\pi_{1}} \frac{C_{1,0}-C_{0,0}}{C_{0,1}-C_{1,1}}\right\}
\]

When a special cost is taken, it is simplified as

\[X_{1}=\left\{x \in X: \frac{p_{1}(x)}{p_{0}(x)} \geq \frac{\pi_{0}}{\pi_{1}}\right\}
\]

Equivalent toConsidering the prior probability and the probability of occurrence in this distribution(the probability of occurrence of a priori distribution 1 multiplied by the probability of occurrence in this distribution\(x\)If the probability product is large, the probability of distribution 1 is very high.)

Random judgment

According to the above ideas, bring in\(\eqref{2}\)The Bayesian risk is

\[\begin{aligned}
r(\tilde{\delta})&= \pi_{0} R_{0}(\tilde{\delta})+\pi_{1} R_{1}(\tilde{\delta}) \\
&= \pi_{0} C_{0,0}+\pi_{1} C_{0,1} \\
&+\sum_{x \in X} \tilde{\delta}(x)\left[\pi_{0}\left(C_{1,0}-C_{0,0}\right) p_{0}(x)+\pi_{1}\left(C_{1,1}-C_{0,1}\right) p_{1}(x)\right]
\end{aligned}
\]

To minimize this value, it is still necessary to take in as long as the brackets are negative, and\(\tilde{\delta}(x)\)You can only take 0 or 1, which becomes a deterministic judgmentSame result

Neyman Pearson

Neither the prior probability nor the cost brought by each judgment is considered, as long asJust minimize the probability of error

Specifically, there are two kinds of errors, false alarm and missed detection

  • \(\mathcal{H}_{0}\) decided as \(\mathcal{H}_{1}\), its probability is denoted as \(P_{\mathrm{F}}(\tilde{\delta})\).
  • \(\mathcal{H}_{1}\) decided as \(\mathcal{H}_{0}\), its probability is denoted as \(P_{\mathrm{M}}(\tilde{\delta})\); Or study the detection efficiency\(P_{\mathrm{D}}(\tilde{\delta})=1-P_{\mathrm{M}}(\tilde{\delta})\)

Since it is impossible for both to be small, it is usuallyGuarantee one indicator and optimize another, i.e

\[\begin{aligned}
& \max _{\tilde{\delta}} P_{\mathrm{D}}(\tilde{\delta}) \\
\text { s.t. } \quad & P_{\mathrm{F}}(\tilde{\delta}) \leq \alpha
\end{aligned}
\]

The constraint of false alarm probability is also called significance level (that is, there is no blind JB alarm) (0.05 in life science)

The above optimization problem can be transformed into the inner product of the indicator variable and the probability distribution

\[\begin{aligned}
P_{\mathrm{F}}(\tilde{\delta})&=p_{0}(\delta=1)\\
&=\sum_{x \in X} P(\delta=1 \mid X=x) p_{0}(x)\\
&=\sum_{x \in X} \tilde{\delta}(x) p_{0}(x) .\\
P_{\mathrm{D}}(\tilde{\delta})&=p_{1}(\delta=1)\\
&=\sum_{x \in X} P(\delta=1 \mid X=x) p_{1}(x)\\
&=\sum_{x \in X} \tilde{\delta}(x) p_{1}(x) .
\end{aligned}
\]

Optimal solution

Neyman-Pearson Lemma

Under the criteria of Neyman Pearson test, the form of the optimal decision is

\[\begin{aligned}
\tilde{\delta}(x) &=1 \text { if } L(x)>\eta \\
&=0 \text { if } L(x)

among\(\eta \geq 0\)Need to meet\(P_{\mathrm{F}}(\tilde{\delta})=\alpha\)\(\gamma(x) \in[0,1]\)Can be set to a constant

prove:

Proof idea:Optimal meaning:If there is any other method of judgment\(\tilde{\delta}^{\prime}\)alsoIf it meets the requirements of false alarm, its detection efficiency can no longer be improvedThat is to meet\(P_{\mathrm{D}}(\tilde{\delta}) \geq P_{\mathrm{D}}\left(\tilde{\delta}^{\prime}\right)\)

Do a bad job

\[\begin{aligned}
P_{\mathrm{D}}(\tilde{\delta})-P_{\mathrm{D}}\left(\tilde{\delta}^{\prime}\right) &=\sum_{x \in \mathcal{X}} \tilde{\delta}(x) p_{1}(x)-\sum_{x \in \mathcal{X}} \tilde{\delta}^{\prime}(x) p_{1}(x) \\
&=\sum_{x \in \mathcal{X}}\left[\tilde{\delta}(x)-\tilde{\delta}^{\prime}(x)\right] p_{1}(x) .
\end{aligned}\label{4}
\]

about\(\eqref{3}\)In general, yes

When\(p_{1}(x)>\eta p_{0}(x), \tilde{\delta}(x)=1, \Rightarrow \tilde{\delta}(x)-\tilde{\delta}^{\prime}(x) \geq 0\);

When\(p_{1}(x)

After finishing, we get the inequality

\[\left[p_{1}(x)-\eta p_{0}(x)\right]\left[\tilde{\delta}(x)-\tilde{\delta}^{\prime}(x)\right] \geq 0, \quad \forall x \in X
\]

replace\(\eqref{4}\)The formula in is obtained

\[\begin{aligned}
P_{\mathrm{D}}(\tilde{\delta})-P_{\mathrm{D}}\left(\tilde{\delta}^{\prime}\right) & \geq \eta \sum_{x \in X}\left[\tilde{\delta}(x)-\tilde{\delta}^{\prime}(x)\right] p_{0}(x) \\
&=\eta[\underbrace{P_{\mathrm{F}}(\tilde{\delta})}_{=\alpha}-\underbrace{P_{\mathrm{F}}\left(\tilde{\delta}^{\prime}\right)}_{\leq \alpha}] \geq 0
\end{aligned}
\]

So this form is optimal.

  • For any other optimal explanation, hereAny other must still have some properties restrictedFor example, one here is\(\tilde{\delta}(x)\in [0,1]\)The other is false alarm probability\(P_{\mathrm{F}}\left(\tilde{\delta}^{\prime}\right)\le\alpha\)

significance

  • It is also a clever structural proof,
  • Whether Bayes or Neyman Pearson, the core is likelihood ratio