Frequency school and Bayes school are two different schools. Frequency school thinks that the probability of events is completely determined by the existing data;**The Bayesian school thinks that the probability of the occurrence of the event itself conforms to a certain probability distribution, and this distribution is determined subjectively, which is also called a priori distribution.**The representative of frequency school is maximum likelihood estimation (MLE), while the representative of Bayesian school is maximum a posteriori estimation and Bayesian estimation.

Let’s review the Bayesian formula first

$$P(\theta|X)=\dfrac{P(X|\theta)P(\theta)}{\sum_iP(X|\theta_i)P(\theta_i)}$$

- $p (x | – theta) $: likelihood term, MLE is the maximum
- $p (< theta) $: a priori probability, which is needed for Bayesian estimation / maximum a posteriori estimation
- $\sum_ iP(X|\theta_ i)P(\theta_ i) $: normalization factor, making it a probability. Intuitively speaking, it is a full probability formula
- $p ([theta|x) $: posterior probability

#### 1. Frequency faction maximum likelihood estimation MLE

MLE maximization $l (< theta|d) = < argmax / limits_ \theta\prod\limits_ {i=1}^{n} P(x_ MLE is based on the known data, without any prior knowledge.**When there is a large amount of data, $\ hat {theta}_ The estimation of {MLE} $will be very good, but when the amount of data is very small, the result may be unreasonably unreliable**。

##### Example – coin toss

Let X be the probability of coin face up, which is $- theta $, 1 for the front and 0 for the back, then the probability distribution of a coin toss event is $p (x = x)_ i|\theta)\begin{cases} \theta,x_ I = 1 (front) \ \ 1 – \ \ theta, X_ If I = 0 (reverse side), end {cases} $, then:

$$P(X=x_i|\theta)=\theta^{x_i}(1-\theta)^{1-x_i}$$

Assuming n coin tosses, MLE maximizes the following likelihood function:

$$\hat{\theta}_{MLE}=\argmax\limits_{\theta}\prod\limits_{i=1}^{n} P(x_i|\theta)=\argmin\limits_{\theta}-\sum\limits_{i=1}^{n}log P(x_i|\theta)$$

The derivation of the above formula is as follows:

$$\sum\limits_{i=1}^{n}(\dfrac{x_i}{\theta}-\dfrac{1-x_i}{1-\theta})=0$$

Through differentiation, we can get the following result:

$$\hat{\theta}_{MLE}=\dfrac{\sum\limits_{i=1}^{n}x_i}{n}$$

If the coin is flipped twice and the reverse side is all up, the probability of the coin facing up is 0

However, once more data, the results will be more and more accurate

#### 2. Bayes school

The difference between Bayes and freguency is that they use**Prior probability**That is to say, it is considered that the values of parameters are in accordance with a certain probability distribution, which is artificially selected, so it is called a priori. There are two kinds of Bayes: maximum a posteriori estimation and Bayes estimation**Bayesian estimation needs to find out the specific probability distribution and get the optimal estimation of parameters through Bayesian decision, so as to minimize the overall expected risk**。

##### Maximum a posteriori estimation map

The maximum a posteriori estimation, similar to MLE, is a kind of point estimation. The likelihood term is followed by a priori probability.

$$\hat{\theta}_{MAP}=\argmax\limits_{\theta}\prod\limits_{i=1}^{n} P(x_i|\theta)\cdot P(\theta)=\argmin\limits_{\theta}-\sum\limits_{i=1}^{n}log P(x_i|\theta)\cdot P(\theta)$$

##### Bayesian estimation

Bayesian estimation needs to work out the parameter distribution, that is, to work out the posterior probability $p (| x) $in the Bayesian formula, and then make a group decision. If the formula is used, the parameters of Bayesian estimation are as follows:

$$\hat{\theta}_{Bayes}=\int\theta P(\theta|X)d\theta$$

Of course, this is calculated assuming that the loss function is a mean square loss.

#### 3. Comparison

We have estimated the parameter $- theta $in three ways. At this time, if new data is coming, the three methods are as follows:

$$begin {cases} MLE: directly predict P (x = x ^ {new} | – theta) according to parameters, parameters are known \ \ map: directly predict P (x = x ^ {new} | – theta) according to parameters, parameters are known \ \ Bayes estimation: P (x ^ {new} | x) = int p (| theta | x) P (x ^ {new} | – x, | theta) d / theta = – int p (x ^ {new}, | x) d / theta, first calculate all the | theta according to x, and then integrate | cases}$$

#### 4. Probability distribution of probability

The probability distribution of probability is actually the distribution of decimals (between 0 and 1). The reason why we want to talk about the probability distribution of probability is that it has two advantages. One is that the probability distribution of probability can be modeled from it;**The second point is that a priori can be used to form a conjugate distribution with likelihood, so that a posteriori probability and a priori probability have the same distribution.**

##### beta distribution

$$P(\theta|\alpha,\beta)=\dfrac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{B(\alpha,\beta)}$$

Where α and β are two control parameters and denominator is a standardized function, which makes it a probability distribution. Beta distribution has the following advantages

- The shape changes with α and β, and the probability peak can be controlled by parameters
- The expected value of random variable is the peak value. Suppose α = β = 100, then the peak is at 0.5, α = 100, β = 200, and the peak is at 0.333
- When the likelihood is Bernoulli distribution, beta distribution and its conjugate distribution
- Beta distribution is suitable for binary classification modeling

##### Dirichlet distribution

Dirichlet distribution is similar to beta distribution, but it models the probability distribution of multi classification parameters and forms conjugate distribution with polynomial distribution

$$P(\theta_1,\theta_2,…,\theta_k|\alpha_1,\alpha_2,…,\alpha_k)=\dfrac{1}{B(\alpha_1,\alpha_2,…,\alpha_k)}\prod\limits_{i=1}^{k}\theta_i^{\alpha_i-1}$$

#### 5. Conjugate distribution and conjugate prior

As I mentioned before, let’s say that we tossed a coin m times, where $m_ \Alpha $times up, $m_ \If beta $times down, then likelihood term $p (x | – theta) = – theta ^ {m}_ \alpha}(1-\theta)^{m_ {_ \beta}}$。 If at this point we think the probability of the coin going up**no**It should be simply said that it is**Count times**It should be**It satisfies certain probability distribution**Then suppose we choose beta distribution as a priori, according to Bayesian formula, the posterior probability $p (| x) $can be expressed as:

$$P(\theta|X)\propto P(X|\theta)\cdot P(\theta)$$

If beta distribution B (α, β) is selected a priori, the posterior probability can be expressed as:

$$\begin{aligned} P(\theta|X) & \propto \theta^{m_\alpha}(1-\theta)^{m_{_\beta}}\cdot \theta^{\alpha-1}(1-\theta)^{\beta-1} \\ & \propto \theta^{m_\alpha+\alpha-1}(1-\theta)^{m_{_\beta}+\beta-1} \end{aligned}$$

It will be found that the posterior distribution also conforms to the beta distribution, which is similar to the prior distribution, but the parameters change,**In this case, the prior distribution and the posterior distribution are called conjugate distribution, and the prior distribution is called conjugate prior of likelihood**。

Conjugate distribution has many advantages, the most important of which is the convenience of calculation.**When new observation data come, we only need to modify parameters α and β to make them new prior distribution**。

#### 6. The function of priors

##### Parameters of beta distribution

With beta distribution, theta’s prediction is no longer the number of times, but the Bayesian estimation, which is obtained by integral. As mentioned earlier, the posterior probability at this time belongs to beta distribution, and the expectation of beta distribution is $- dfrac {alpha} {alpha + – beta} $. For coin tossing, the estimation (expectation) of parameter becomes $- dfrac {alpha + M_ \alpha}{\alpha+m_ \alpha+\beta+m_ \beta}$。 Suppose we set α = β = 100, that is, the probability that the coin is tossed face up is most likely to be 0.5,**Even if the first few coin tosses are all negative, then our conclusion will not have much problem**。

In addition, α and β can be historical data or artificially set initial values. In other words, α and β are like prior strength. When we set α and β to be large (such as 10000), we need enough data to correct the prior (if the prior is incorrect). If the setting is small (10), we only need a small amount of data to modify the prior, or we are not sure about the probability distribution of parameters.

##### The function of priors

**A priori is to prevent inferring too unreliable results when the data is small. In other words, a priori is to correct the deviation when the data is small**. Specifically, if Bayesian does parameter estimation or prediction (the probability of predicting the new coin face up is equivalent to the probability parameter of estimating the coin face up), then:

$$\hat{\theta}_{Bayes}=\int\theta P(\theta|X)d\theta=\dfrac{\alpha+m_\alpha}{\alpha+m_\alpha+\beta+m_\beta}$$

When MLE is used for prediction, the likelihood function is as follows:

$$\prod\limits_{i=1}^{n} P(x_i|\theta)=\theta^{m_\alpha}(1-\theta)^{m_{_\beta}}$$

The parameters are estimated as follows:

$$\hat{\theta}_{MLE}=\argmax\limits_{\theta}\prod\limits_{i=1}^{n} P(x_i|\theta)=\argmin\limits_{\theta}-\sum\limits_{i=1}^{n}log P(x_i|\theta)$$

After derivation, the optimal solution is obtained as follows: θ is $- dfrac {m}_ \alpha}{m_ \alpha+m_ \beta}$

Comparing the two, we can find that,**When there are few data in the early stage, the prior dominates the parameter estimation, so that the prediction is not too bad; with the increase of data, the prior knowledge gradually weakens, and the importance of data gradually increases. Therefore, when the amount of data is relatively small, the prior can often bring benefits**。

#### 7. Laplace smoothing

Laplace smoothing is a method that will be used in naive Bayes. The theory is just written in this article. I’ll write it here by the way.

In the multi classification scenario, we need to predict the probability of each category. Bayesian estimation is as follows:

$$P(\theta_1,\theta_2,…,\theta_k|X)\propto P(X|\theta_1,\theta_2,…,\theta_k)P(\theta_1,\theta_2,…,\theta_k)$$

Where, the likelihood term is a polynomial distribution: $p (x | theta)_ 1,\theta_ 2,…,\theta_ k)=\theta_ 1^{m_ 1}\theta_ 2^{m_ 2}…\theta_ k^{m_ k} A priori is Dirichlet distribution: $p (\ theta)_ 1,\theta_ 2,…,\theta_ k)=\theta_ 1^{\alpha_ 1-1}\theta_ 2^{\alpha_ 2-1}…\theta_ k^{\alpha_ K-1} $, then the posterior distribution and the prior distribution form a conjugate distribution

$$P(\theta_1,\theta_2,…,\theta_k|X)\propto \theta_1^{m_1+\alpha_1-1}\theta_2^{m_2+\alpha_2-1}…\theta_k^{m_k+\alpha_k-1}$$

As a prediction, X belongs to category J:

$$\begin{aligned}P(Y=c_j|X) & =\int P(Y=j,\theta_j|X)d\theta_j \\ & =\int P(Y=j,|X,\theta_j)P(\theta_j|X)d\theta_j \\ &=\int \theta_jP(\theta_j|X)d\theta_j\end{aligned}$$

We see that the transformation from the second step to the third step is due to $- theta_ J $itself represents the probability of belonging to class J

Therefore, prediction is the process of seeking expectation, and Dirichlet’s expectation is as follows:

$$P(Y=c_j|X)=\dfrac{m_j+\alpha_j}{\sum\limits_{i=1}^{k}(m_i+\alpha_i)}=\dfrac{m_j+\alpha_j}{N+\sum\limits_{i=1}^{k}\alpha_i}$$

When we simply consider that the probability of each category is the same, that is, α is λ, the formula is reduced to $p (y = C)_ j|X)=\dfrac{m_ j+\lambda}{N+k\lambda}$。 When λ is 1, it is called Laplace smoothing. When λ is 0, it is MLE.

be careful,**Naive Bayesian parameter estimation is Bayesian estimation, while label training is maximum a posteriori estimation**。

#### 8. Bayesian inference

Bayesian inference is the process of updating data cognition. After the new data comes, a posteriori is calculated, and then a priori is updated by a posteriori.

In other words, Bayesian inference is a process of continuously updating the cognition of unknown variables by repeatedly using Bayesian theorem

#### 9. Others

Structural risk MLE

Expected risk Bayes

Empirical risk map