Algorithm engineering lion 5. Exponential distribution family


1. Definition

Exponential distribution family refers to a class of distribution functions with specific forms, which are as follows:
$$p (Y | / ETA) = B (y) e ^ {ETA ^ TT (y) – A (/ ETA)} = \ dfrac {B (y) e ^ {ETA ^ TT (y)}} {e ^ {a (ETA)}} begin {cases} ETA: parameter vector / natural parameter, usually real number \ \ A: logarithmic partition function / logarithm regularization \ \ t (y): sufficient statistics, usually t (y) = y \ \ B: bottom observation value / end {cases}$$
Exponential distribution family this form is given a, B, t defines a probability distribution set with parameter η

2. Logarithmic regularization

The above formula is transformed into:
For both sides, y integral at the same time:
$$\int P(y|\eta)e^{a(\eta)}dy=\int b(y)e^{\eta^TT(y)}dy$$
On the left, the integral of just conditional probability is 1, which is reduced to:
$$e^{a(\eta)}=\int b(y)e^{\eta^TT(y)}dy$$
$$a(\eta)=\ln\int b(y)e^{\eta^TT(y)}dy$$
Now it’s clear at a glance that logarithm is regularized

3. Common exponential distribution family

Normal distribution – total noise
Bernoulli distribution LR (01)
Beta distribution
Dirichlet distribution

4. Examples of derivation of exponential distribution family

Gaussian distribution

Its distribution is as follows: {2} {2} {2} {2} {2} {2} {2} {2} {2}} {2} {2}}
$$P(y|\eta)=\dfrac{1}{\sqrt{2\pi}}e^{-\log\sigma}\cdot e^{-\dfrac{x^2}{2\sigma^2}}=\dfrac{1}{\sqrt{2\pi}}e^{-\dfrac{1}{2\sigma^2}x^2-\log\sigma}$$
This is the form of exponential distribution family

Binomial distribution

$$ \begin{aligned} P(y|\eta) & = \large\phi^y(1-\phi)^{1-y} \\\ & = \large e^{\normalsize{y\log\phi+(1-y)\log(1-\phi)}} \\\ & =\large e^{\large{\log\frac{\phi}{1-\phi}y+\log(1-\phi)}}\end{aligned} $$

5. Maximum entropy

The exponential distribution family satisfies the idea of maximum entropy, that is, the distribution derived from the empirical distribution in the form of maximum entropy is the exponential distribution family.
For any function, the empirical expectation is $E_ {\tilde{P}}(f(x))=\Delta$。 therefore:
$$max\{H(P)\}=min\{\sum\limits_{k=1}^{K}p_k\log p_k\},\quad s.t.\sum\limits_{k=1}^{K}p_k=1,E_{\tilde{P}}(f(x))=\Delta$$
The generalized Lagrange function is constructed
$$L=\sum\limits_{k=1}^{K}p_k\log p_k+\lambda_0(1-\sum\limits_{k=1}^{K}p_k)+\lambda^T(\Delta-E_pf(x))$$
The derivative of P (x) is as follows:
$$\frac{\partial L}{\partial P(x)}=\sum\limits_{k=1}^{K}\log P(x)+1-\lambda_0-\lambda^Tf(x)=0$$
The solution is as follows:

6. Generalized linear model (GLM)

The generalized linear model includes linear model, LR and softmax. The reason why we need to mention the generalized linear model is that it is derived from the exponential distribution family

  • Suppose y follows the exponential distribution family with X, θ as parameters and η as natural parameters
  • Learning: $H (x) = e (t (y) | x)$
  • The natural parameter is related to X-ray$
The LR was derived from Bernoulli distribution

The regression model w ^ TX / rightarrow / ETA = w ^ TX derives the connection function G ^ {- 1} (/ ETA) – rightarrow generalized linear model H (x) = G ^ {- 1} (ETA)$$

$$ \begin{aligned} P(y|\eta) & = \large\phi^y(1-\phi)^{1-y} \\\ & = \large e^{\normalsize{y\log\phi+(1-y)\log(1-\phi)}} \\\ & =\large e^{\large{\log\frac{\phi}{1-\phi}y+\log(1-\phi)}}\end{aligned} $$

$$\Rightarrow T(y)=y,\eta={\log\frac{\phi}{1-\phi}} \\\ \Rightarrow\phi=\dfrac{1}{1+e^{-\eta}} \\\ \Rightarrow h(x)=E(T(y)|x)=E(y|x)=\phi=\dfrac{1}{1+e^{-w^Tx}}$$