The basic principle of naive Bayes is to use prior probability to calculate posterior probability. For prior probability and posterior probability, please refer to the simplest understanding of prior probability and posterior probability.

1. Suppose there is a data set X, including X1 to xn pieces of data. Each piece of data contains M features (m dimension), which can be expressed as follows:

$$

X_1:X_1^1、X_1^2、X_1^3….X_1^m

$$

2. Data set output y has a total of K categories, expressed as 1 to K.

3. Now, we can get the probability p (y = 1), P (y = 2)… P (y = k) of output y according to frequency statistics

4. Our problem is to find out the classification of test sets (assuming that there are j), namely

$$

max[P(y=1|X_{test1})、P(y=2|X_{test1})、P(y=3|X_{test1})…P(y=k|X_{test1})]

$$

reach

$$

max[P(y=1|X_{testn})、P(y=2|X_{testn})、P(y=3|X_{testn})…P(y=k|X_{testn})]

$$

If the probability of test set data 1 is the highest when y = 1, then the output of data 1 is class 1, and the same can be said for others.

5. For the above objective formula, we can substitute the conditional probability formula to calculate, namely:

$$

P(y=1|X_{test1}) = \frac{P(X_{test1}|y=1)*P(y=1)}{P(X_{test1})}

$$

We know the value of P (y = 1), for the denominator, because the denominator is in every formula

$$

P(y=1|X_{test1})、P(y=2|X_{test1})、P(y=3|X_{test1})…P(y=k|X_{test1})

$$

All of them appear, so we can omit no calculation, so the last key is calculation

$$

P(X_{test1}|y=1)=P(X_{test1}^1,X_{test1}^2,X_{test1}^3..X_{test1}^m|y=1)

$$

6. Naive Bayes algorithm assumes that M features are independent of each other (the actual situation is not independent, here is just for the convenience of operation, so there will be errors in this step of calculation). The formula can be further deduced as follows:

$$

P(X_{test1}^1,X_{test1}^2,X_{test1}^3..X_{test1}^m|y=1)=P(X_{test1}^1|y=1)*P(X_{test1}^2|y=1)*P(X_{test1}^3|y=1)…P(X_{test1}^m|y=1)

$$

7. Now the problem has been transformed into a solution

$$

P(X_{test1}^1|y=1)

$$

$$

For the solution of the above formula, it needs to be divided into three cases. The first case is that the value of the first dimension feature of data set X (taking the first dimension feature as an example) is discrete, then the first dimension feature conforms to the polynomial distribution:

$$

But if the molecule is 0, then:

$$

P(X_{test1}^1|y=1)=0

$$

$$

P(X_{test1}^1,X_{test1}^2,X_{test1}^3..X_{test1}^m|y=1)=0*P(X_{test1}^2|y=1)*P(X_{test1}^3|y=1)…P(X_{test1}^m|y=1)=0

$$

It is unfair to conclude that the current category probability is 0 just because a feature value does not exist, so Laplace smoothing is introduced here:

$$

The second case is that the value of the first dimension feature of data set X (taking the first dimension feature as an example) is discrete and sparse, so we only care whether the value of the first dimension feature exists (not 0). If the value is recorded as 1, the value is not recorded as 0. Then the first Vitt sign accords with Bernoulli distribution

$$

$$

The third case is that the value of the first dimension feature of data set X (taking the first dimension feature as an example) is continuous. Naive Bayes assumes that the prior probability of the first dimension feature is normal distribution:

$$

P(X_{test1}^1|y=1) = \frac{1}{\sqrt{2\pi\sigma^2_1}}exp(-\frac{(X_{test1}^1-\mu_1)^2}{2\sigma^2_1})

$$

$$

8. After the y = 1 of xtest1 is deduced, the y = 2 to K can be inferred, and the final output of data is the value of Y with the highest probability. Then the output of other test sets is analogized, and the algorithm is finished.

Reference blog: https://www.cnblogs.com/pinar