Data science statistics: what is skewness?

Time:2020-11-25

By Abhishek Sharma
Compile | VK
Source | analytics vidhya

summary

  • Skewness is an important statistical concept in data science and analysis

  • Understand what skewness is and why it’s important for you as a data science professional

introduce

The concept of skewness has been integrated into our way of thinking. When we see an image, our brain intuitively distinguishes the patterns in the diagram.

As you may already know, more than 50% of India’s population is under the age of 25, and more than 65% of the population is under 35.

If you plot the age distribution of India’s population, you will find that there is a bulge on the left side of the distribution and a relatively flat one on the right side. In other words, we can say that there is a bias towards the end.

So even if you haven’t read the biases of data science or analysis professionals, you’ve certainly interacted with the concept informally.

In statistics, this is actually a fairly simple topic, but many people glance at it in their rush to learn other seemingly complex data science concepts. For me, it was a mistake.

Skewness is a basic statistical concept that everyone in the field of data science and analysis needs to know. This is something we can’t escape. I’m sure you’ll understand that at the end of this article.

Here, we will discuss the concept of tilt in the simplest way. You’ll learn about skewness, its types, and its importance in the field of data science.

So, fasten your seat belt, because you’ll learn a concept that you’ll value throughout your career in data science.

catalog

  • What is skewness?

  • Why is skewness important?

  • What is normal distribution?

  • Understanding normal skew distribution

  • Understanding negative skewness distribution

What is skewness?

Skewness is a measure of asymmetry of ideal symmetric probability distribution, which is given by the third-order standard moment. If that sounds too complicated, don’t worry! I’ll explain it to you.

In short, skewness measures the degree to which the probability distribution of random variables deviates from the normal distribution. Now, you might wonder, why am I talking about normal distribution here?

A normal distribution is a probability distribution without any skewness. You can look at the picture below, which shows a symmetric distribution, basically a normal distribution, and you can see that the dotted line is symmetrical on both sides. In addition, there are two types of skewness:

  • Positive skewness

  • Negative skewness

The probability distribution of tail on the right is positive skew distribution, and the probability distribution of tail on left is negative skew distribution. If you find the numbers above confusing, that’s OK. We will learn more about this later.

Before that, it’s important for us as professionals to understand why data is such an important concept.

Why is skewness important

Now, we know that skewness is a measure of asymmetry, and its types are distinguished by the tail side of the probability distribution. But why is it important to know the skewness of the data

Firstly, the linear model assumes that the distribution of independent variables and target variables is similar. Therefore, understanding the skewness of data helps us to create better linear models.

Second, let’s look at the distribution below. It’s the horsepower distribution of the car:

You can clearly see that the distribution above is positively skewed. Now, suppose you want to use this as a feature of the model, which predicts the mpg of the car.

Because our data is positively skewed here, which means it has more low value data points, that is, less powerful cars.

Therefore, when we train our model based on these data, it will perform better in predicting mpg of low horsepower vehicles than those of high horsepower vehicles.

In addition, skewness tells us the direction of the outliers. You can see that our distribution is positively skewed, and most of the outliers are on the right side of the distribution.

Note: skewness does not tell us the number of outliers. It just tells us the direction.

Now that we know why skewness is important, let’s take a look at the distribution I showed you.

What is symmetric / normal distribution

Yes, we’re back to normal.

The normal distribution is used as a reference to determine the skewness of the distribution. As I mentioned earlier, an ideal normal distribution is a probability distribution with almost no skewness. It’s almost perfectly symmetrical. Therefore, the skewness of the normal distribution is zero.

But why is it almost completely symmetric rather than absolutely symmetric?

This is because, in fact, there is no real data that fully conforms to the normal distribution. Therefore, the value of skewness is not exactly zero; it is almost zero. Although the zero value is used as a reference to determine the skewness of the distribution.

As you can see in the figure above, the same line represents the average, the median and the mode. This is because the mean, median and mode of the complete normal distribution are equal.

So far, we have used probability or frequency distribution to understand the skewness of normal distribution. Now, let’s understand it in a boxplot, because this is the most common way to observe distribution in data science.

The above figure is a box diagram of symmetrical distribution. You will notice that the distance between Q1 and Q2 is equal, that is:

But this is not enough to conclude whether the distribution is skewed. We also look at the length of a line; if they are equal, then we can say that the distribution is symmetric, that is, it is not skewed.

Now that we’ve discussed skewness in a normal distribution, it’s time to look at the two types of skewness that we discussed earlier. Let’s start with positive skewness.

Understanding normal skew distribution

The normal skew distribution is the distribution of the tail on the right. The skewness of normal skewness distribution is greater than zero. You may have seen by looking at this number that the average is the largest, then the median, then the mode.

Why is that?

Well, the answer is that the tail of the distribution is on the right; it causes the average to be greater than the median, and the average eventually moves to the right. In addition, the mode appears at the highest frequency of the distribution, that is, to the left of the median. So,Mode.

In the frame diagram above, you can see that Q2 is close to Q1. This represents a normal skew distribution. According to the quartile, it can be obtained by the following formula:

In this case, it is easy to determine whether the data is skewed. But what if we have such a picture:

Here, q2-q1 and q3-q2 are equal, but the distribution is positive skew. Those of you who have a keen eye will notice that the length of the right line is greater than the length of the left line. From this, we can conclude that the data is positive skew.

Therefore, the first step is always to check the equality of q2-q1 and q3-q2. If this is equal, then we look for the length of the line.

Understanding negative skewness distribution

As you may have guessed, the negative skewness is the distribution with the tail on the left. The skewness of negative skewness distribution is less than zero. You can also see it in the picture aboveMean value.

In the boxplot, the relationship between the quartiles of negative skewness is given by the following formula:

Similar to what we did before, if q3-q2 and q2-q1 are equal, then we look for the length of the line. If the length of the left line is larger than the length of the right line, we can say that the data is negative skew.

How do we convert skewed data

Since you know how much skew data will affect the predictive power of machine learning models, it’s better to convert skewed data into normal distribution data. Here are some ways to convert skewed data:

  • Power transformation

  • Log transform

  • Exponential transformation

notes: the choice of conversion depends on the statistical characteristics of the data.

ending

In this paper, we discuss the concept of skewness, its types and its importance in the field of data science. We’ve talked about skewness at a conceptual level, but if you want to go deeper into it, you can explore its mathematical part next.

Link to the original text: https://www.analyticsvidhya.com/blog/2020/07/what-is-skewness-statistics/

Welcome to visit pan Chuang AI blog station:
http://panchuang.net/

Sklearn machine learning Chinese official document:
http://sklearn123.com/

Welcome to pay attention to pan Chuang blog resource collection station:
http://docs.panchuang.net/