As mentioned in the previous article, the assumption of many models is that the data is normally distributed. This article mainly talks about how to judge whether the data conforms to the normal distribution. It can be divided into two methods: descriptive statistical method and statistical test method.
01. Describe the statistical method
Descriptive statistics is to use descriptive figures or charts to judge whether the data conform to normal distribution. The common methods are Q-Q diagram, P-P diagram, histogram, stem and leaf diagram.
1.1 Q-Q diagram
This Q-Q is not used to chat QQ, q is the abbreviation of quantity, that is quantile. Quantile is to sort the data from small to large, and then cut it into 100 copies to see the values at different positions. For example, the median is the value in the middle.
The x-axis of Q-Q graph is the quantile, and the y-axis is the sample value corresponding to the quantile. X-Y is the form of scatter plot, through which a straight line can be fitted. If the straight line is a straight line from the lower left corner to the upper right corner, it can be judged that the data conforms to the normal distribution, otherwise it can not.
What is the relationship between the fitted line and the normal distribution? Why can we judge whether the data conforms to the normal distribution according to this straight line.
Let’s first think about the characteristics of normal distribution. The x-axis of normal distribution is the sample value. From left to right, x increases gradually, and the y-axis is the probability of each sample value. The probability increases first and then decreases, and reaches the highest in the middle.
The y-axis in the Q-Q diagram can be understood as the x-axis in the normal distribution. If the fitted straight line is 45 degrees, the numerical distribution on both sides of the median can be ensured to be the same, that is, the normal distribution is based on the symmetry of the median.
In Python, you can use the following code to draw a Q-Q diagram:
from scipy import stats fig = plt.figure() res = stats.probplot(x, plot=plt) plt.show()
P-P chart is similar to Q-Q chart, the difference is that the Y-axis of the former is the sample value corresponding to the specific quantile, while the latter is the cumulative probability.
Histogram is divided into two kinds, one is frequency distribution histogram, the other is frequency distribution histogram. Frequency is the number of sample values. Frequency is the ratio of the number of a certain value to the total number of all sample values.
In Python, we can use the following code to draw the frequency distribution histogram:
import matplotlib.pyplot as plt plt.hist(x,bins = 10)
You can use the following code to draw the frequency distribution histogram:
import seaborn as sns sns.distplot(x)
Similar to the histogram is the stem and leaf diagram, which is similar to the tabular form to show the frequency of each value.
02. Statistical test method
After describing the statistical method, let’s take a look at the statistical test method. The main methods of statistical test are sw test, KS test, ad test and W test.
In SW test, s is skewness and W is kurtosis. We have talked about the relationship between kurtosis, skewness and normality in the previous article.
2.1 KS TEST
KS test is based on the cumulative distribution function of samples. It can be used to judge whether a sample set conforms to a known distribution or to test the significant difference between two samples.
If it is to judge whether a sample conforms to a known distribution, such as normal distribution, it is necessary to first calculate the cumulative distribution function of standard normal distribution, and then calculate the cumulative distribution function of sample set. There are different differences between two functions at different values. We just need to find the point d with the biggest difference. Then, based on the sample number and significance level of the sample set, the boundary value of the difference (similar to the boundary value of t-test) is found. Judging the relationship between boundary value and D, if D is less than the boundary value, the distribution of samples can be considered to conform to the known distribution, otherwise it can not.
Pdf (probability density function): probability density function.
CDF (cumulative distribution function): cumulative distribution function, which is the integral of probability density function.
There are ready-made packages in python that can be directly used for KS verification
from scipy.stats import kstest kstest(x,cdf = "norm")
X is the sample set to be tested, and CDF is used to indicate the known distribution types to be judged, including ‘norm’, ‘expon’, ‘logistic’, ‘Gumbel’, ‘Gumbel’_ l’, gumbel_ r’,
The value of ‘extreme 1’ can be selected, where norm means normal distribution test.
Kstest returns two values: D and the corresponding P_ Value.
2.2 ad test
Ad test is modified on the basis of KS test. KS test only considers the point with the largest difference between two distributions, but it is easily affected by outliers. The ad test takes into account the difference at each point of the distribution.
In Python, you can use the following code:
from scipy.stats import anderson anderson(x, dist='norm')
X is the sample set to be tested, and dist is used to indicate the type of known distribution. The optional values are consistent with those in KS test.
The above code will return three results: the first is the statistical value, the second is the evaluation value, and the third is the significance level corresponding to each evaluation value
What is the relationship between AD test and Anderson? Anderson invented the ad test.
2.3 W test
W test (Shapiro Wilk for short) is based on the correlation of two distributions to judge, and will get a value similar to Pearson correlation coefficient. The larger the value is, the more relevant the two distributions are and the more consistent they are with a certain distribution.
The implementation code in Python is as follows:
from scipy.stats import shapiro shapiro(x)
The above code returns two results: the W value and its corresponding p value_ value。
Shapiro is specially used for normality test, so there is no need to specify the type of distribution. Shapiro is not suitable for the test of normality with more than 5000 samples.
03. Treatment of non normal data
Generally, the data is either normal or skew. If the skewness is not serious, the data can be converted by taking the square root. If the skewness is serious, the data can be converted logarithmically.