Information inference
This part belongs to the combination of information theory and statistics, which is similar to the inference of “hidden variables” in machine learning, i.eExtrapolate real information from observed values。 Compared with machine learning, which prefers to propose specific inference methods, information theory pays more attention to the nature of inference and the highest inference accuracy?
hypothesis test
Judge the truth by observation
In probability, it isSelect one of the probability distributions that best matches the observed random variables

Problem description
\[\begin{aligned}
&\mathcal{H}_{0}: \quad X \sim p_{0}(x) \text { ( “null”) }\\
&\mathcal{H}_{1}: \quad X \sim p_{1}(x) \quad(\text { “alternative”) }
\end{aligned}
\] 
Indicator variable\(\delta: X \mapsto\{0,1\}\), based on observed values\(x\)Which distribution does the discrimination come from

Determinate
\[\begin{aligned}
\delta(x) &=1 \quad \text { if } x \in X_{1} \\
&=0 \quad \text { if } x \in X \backslash X_{1}=X_{1}^{c}
\end{aligned}
\] 
Random
\[\tilde{\delta}(x)=P(\delta=1 \mid X=x)
\]

Next, how to design criteria? According to the hypothesis of whether there is a priori probability, it can be divided into Bayesian or Neyman Pearson hypothesis test
Bayes
Presupposition

Each assumptionPrior distribution
\[\begin{aligned}
\pi_{0} &=P\left(X \sim p_{0}\right) \\
\pi_{1}=1\pi_{0} &=P\left(X \sim p_{1}\right)
\end{aligned}
\] 
There is a cost after judging right and wrong: the true distribution\(\mathcal{H}_{j}\)Judged as\(\mathcal{H}_{i}\)Cost of\(C_{i,j},i,j=0,1\)

Bayesian risk (deterministic judgment)

When the true distribution is\(\mathcal{H}_{j}\)Risk when
\[R_{j}(\delta)=C_{1, j} p_{j}\left(X_{1}\right)+C_{0, j} p_{j}\left(X_{1}^{c}\right)
\]among\(p_{j}\left(X_{1}\right)\)Indicates the probability of judging as 1 at this time

The risk after further considering the prior probability is
\[r(\delta)=\pi_{0} R_{0}(\delta)+\pi_{1} R_{1}(\delta) \label{1}
\]


Bayesian risk (random judgment)

Conditional risk
\[R_{j}(\tilde{\delta})=C_{1, j} \sum_{x \in \mathcal{X}} \tilde{\delta}(x) p_{j}(x)+C_{0, j} \sum_{x \in X}[1\tilde{\delta}(x)] p_{j}(x)
\] 
Bayesian risk
\[r(\tilde{\delta})=\pi_{0} R_{0}(\tilde{\delta})+\pi_{1} R_{1}(\tilde{\delta}) \label{2}
\]

Optimal solution
Deterministic judgment
The core objective is to indicate variables by design\(\delta\)To minimize Bayesian risk
Therefore, Bayesian risk\(\eqref{1}\)Expand and simplify
r(\delta)&= \pi_{0} C_{1,0} p_{0}\left(X_{1}\right)+\pi_{0} C_{0,0} p_{0}\left(X_{1}^{c}\right) \\
& \quad+\pi_{1} C_{1,1} p_{1}\left(X_{1}\right)+\pi_{1} C_{0,1} p_{1}\left(X_{1}^{c}\right) \\
&= \pi_{0} C_{0,0}+\pi_{1} C_{0,1} \\
& \quad+\pi_{0}\left(C_{1,0}C_{0,0}\right) p_{0}\left(X_{1}\right)+\pi_{1}\left(C_{1,1}C_{0,1}\right) p_{1}\left(X_{1}\right) \\
&= \text { constant }+\sum_{x \in X_{1}}\left[\pi_{0}\left(C_{1,0}C_{0,0}\right) p_{0}(x)+\pi_{1}\left(C_{1,1}C_{0,1}\right) p_{1}(x)\right]
\end{aligned}
\]
among\(p_{0}\left(X_{1}^{c}\right)=1p_{0}\left(X_{1}\right)\), the first line of the second equal sign is a constant, and the third equal sign comes from\(p_{1}\left(X_{1}\right)=\sum_{x \in X_{1}}p_{1}(x)\)
So what we have to do is change the sum range\(X_{1}\)To minimize the sum on the right
Since there is no way to change how much is negative, onlyLet the sum be negative, just bring it in, that is, satisfaction
\]
It is advisable to assume the relative size of cost, so the decision interval (likelihood ratio test) can be obtained\(L(x)=\frac{p_{1}(x)}{p_{0}(x)}\)）
\]
When a special cost is taken, it is simplified as
\]
Equivalent toConsidering the prior probability and the probability of occurrence in this distribution(the probability of occurrence of a priori distribution 1 multiplied by the probability of occurrence in this distribution\(x\)If the probability product is large, the probability of distribution 1 is very high.)
Random judgment
According to the above ideas, bring in\(\eqref{2}\)The Bayesian risk is
r(\tilde{\delta})&= \pi_{0} R_{0}(\tilde{\delta})+\pi_{1} R_{1}(\tilde{\delta}) \\
&= \pi_{0} C_{0,0}+\pi_{1} C_{0,1} \\
&+\sum_{x \in X} \tilde{\delta}(x)\left[\pi_{0}\left(C_{1,0}C_{0,0}\right) p_{0}(x)+\pi_{1}\left(C_{1,1}C_{0,1}\right) p_{1}(x)\right]
\end{aligned}
\]
To minimize this value, it is still necessary to take in as long as the brackets are negative, and\(\tilde{\delta}(x)\)You can only take 0 or 1, which becomes a deterministic judgmentSame result。
Neyman Pearson
Neither the prior probability nor the cost brought by each judgment is considered, as long asJust minimize the probability of error。
Specifically, there are two kinds of errors, false alarm and missed detection
 \(\mathcal{H}_{0}\) decided as \(\mathcal{H}_{1}\), its probability is denoted as \(P_{\mathrm{F}}(\tilde{\delta})\).
 \(\mathcal{H}_{1}\) decided as \(\mathcal{H}_{0}\), its probability is denoted as \(P_{\mathrm{M}}(\tilde{\delta})\); Or study the detection efficiency\(P_{\mathrm{D}}(\tilde{\delta})=1P_{\mathrm{M}}(\tilde{\delta})\)
Since it is impossible for both to be small, it is usuallyGuarantee one indicator and optimize another, i.e
& \max _{\tilde{\delta}} P_{\mathrm{D}}(\tilde{\delta}) \\
\text { s.t. } \quad & P_{\mathrm{F}}(\tilde{\delta}) \leq \alpha
\end{aligned}
\]
The constraint of false alarm probability is also called significance level (that is, there is no blind JB alarm) (0.05 in life science)
The above optimization problem can be transformed into the inner product of the indicator variable and the probability distribution
P_{\mathrm{F}}(\tilde{\delta})&=p_{0}(\delta=1)\\
&=\sum_{x \in X} P(\delta=1 \mid X=x) p_{0}(x)\\
&=\sum_{x \in X} \tilde{\delta}(x) p_{0}(x) .\\
P_{\mathrm{D}}(\tilde{\delta})&=p_{1}(\delta=1)\\
&=\sum_{x \in X} P(\delta=1 \mid X=x) p_{1}(x)\\
&=\sum_{x \in X} \tilde{\delta}(x) p_{1}(x) .
\end{aligned}
\]
Optimal solution
NeymanPearson Lemma
Under the criteria of Neyman Pearson test, the form of the optimal decision is
\tilde{\delta}(x) &=1 \text { if } L(x)>\eta \\
&=0 \text { if } L(x)
among\(\eta \geq 0\)Need to meet\(P_{\mathrm{F}}(\tilde{\delta})=\alpha\)，\(\gamma(x) \in[0,1]\)Can be set to a constant
prove:
Proof idea:Optimal meaning:If there is any other method of judgment\(\tilde{\delta}^{\prime}\)alsoIf it meets the requirements of false alarm, its detection efficiency can no longer be improvedThat is to meet\(P_{\mathrm{D}}(\tilde{\delta}) \geq P_{\mathrm{D}}\left(\tilde{\delta}^{\prime}\right)\)
Do a bad job
P_{\mathrm{D}}(\tilde{\delta})P_{\mathrm{D}}\left(\tilde{\delta}^{\prime}\right) &=\sum_{x \in \mathcal{X}} \tilde{\delta}(x) p_{1}(x)\sum_{x \in \mathcal{X}} \tilde{\delta}^{\prime}(x) p_{1}(x) \\
&=\sum_{x \in \mathcal{X}}\left[\tilde{\delta}(x)\tilde{\delta}^{\prime}(x)\right] p_{1}(x) .
\end{aligned}\label{4}
\]
about\(\eqref{3}\)In general, yes
When\(p_{1}(x)>\eta p_{0}(x), \tilde{\delta}(x)=1, \Rightarrow \tilde{\delta}(x)\tilde{\delta}^{\prime}(x) \geq 0\);
When\(p_{1}(x)
After finishing, we get the inequality
\]
replace\(\eqref{4}\)The formula in is obtained
P_{\mathrm{D}}(\tilde{\delta})P_{\mathrm{D}}\left(\tilde{\delta}^{\prime}\right) & \geq \eta \sum_{x \in X}\left[\tilde{\delta}(x)\tilde{\delta}^{\prime}(x)\right] p_{0}(x) \\
&=\eta[\underbrace{P_{\mathrm{F}}(\tilde{\delta})}_{=\alpha}\underbrace{P_{\mathrm{F}}\left(\tilde{\delta}^{\prime}\right)}_{\leq \alpha}] \geq 0
\end{aligned}
\]
So this form is optimal.
 For any other optimal explanation, hereAny other must still have some properties restrictedFor example, one here is\(\tilde{\delta}(x)\in [0,1]\)The other is false alarm probability\(P_{\mathrm{F}}\left(\tilde{\delta}^{\prime}\right)\le\alpha\)
significance
 It is also a clever structural proof,
 Whether Bayes or Neyman Pearson, the core is likelihood ratio