Today we will talk about skewness distribution in statistics. Before we talk about skewness distribution, we will look at the normal distribution first. Below, Zhang Tu has appeared in official account many times, which is the legendary normal distribution.
The horizontal axis in this figure is the specific value of random variable x, and the center point of normal distribution is the mean value of random variable x μ， Take the mean value as the center, and then spread to both sides. Since it is the mean value, there must be values larger than the mean value and points smaller than the mean value. We use the standard deviation σ Represents the degree of dispersion of the data set, that is, the mean distance μ The distance.
The vertical axis is the probability density corresponding to X. We should all know that probability is used to represent the probability of a value or a situation. The probability density is equal to the probability of an interval (the value range of random variable x) divided by the length of the interval.
The area enclosed by the horizontal axis and the vertical axis represents the probability of this interval corresponding to the horizontal axis X.
Seven points are marked on the x-axis: μ、 u+ σ、 u- σ、 u+2 σ、 u-2 σ、 u+3 σ、 u-3 σ， That is, the distance from the mean value is 1, 2, 3, or more than 3 times the standard deviation.
It can be seen that 64.2% (34.1 + 34.1) of the data are concentrated in (U)- σ, u+ σ) Between, 27.2% of the data are located in (U ± σ, u±2 σ) Between, 4.2% of the data are located in (U ± 2) σ, u±3 σ)， The rest is (U ± 3) σ， ∞). It shows that most of the data are concentrated near the average value, and many things in our life conform to the normal distribution, which is one reason why the average value can be used to replace the overall level, such as average height, average salary, etc.
Although the data is normally distributed in most cases, there are also cases where the data is not normally distributed. At this time, it is biased distribution. There are two kinds of biased distribution, left and right. The long tail is on which side. The long tail of the first picture below is on the left, and the long tail of the last picture is on the right.
If it is left biased, it means that most of them are concentrated on the right, that is, mode > median > mean; If it is right deviation, it means that most of them are concentrated on the left, that is, mode < median < mean.
We can use the skewness coefficient to measure the specific deviation degree. If the skewness coefficient is greater than 0, it will be right deviation, and if it is less than 0, it will be left deviation. The greater the value, the more deviation.
In Python, to calculate the skewness coefficient of a column, you can use the following code:
#Calculate the skewness coefficient of col column df["col"].skew()
Because in reality, many data conform to the normal distribution, and many models assume that the data obey the normal distribution. For example, in the analysis of variance, it is assumed that the data obey the normal distribution. If your data is skew distributed, you can convert the data from skew data to normal data. The common conversion is to take logarithm of the original data.
In Python, you can use the following code to take logarithms of data.
#Logarithm of X import math math.log( x )
Why take logarithm of variables in Statistics: