Data visualization: 4. The secret of scatter diagram

Time:2020-2-13

Reference source: vitu.ai

In the previous article, you have learned how to draw histogram and thermodynamic diagram. Next, let’s learn scatter diagram, which is the best tool to study the relationship between two variables

Set up your notebook

Let’s just set it up at the beginning

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Print ("setup complete")

Select data set

In this article, we will use the data of different groups of people to buy insurance, and study why some people prefer to buy insurance than others

Click here to download the dataset

Open it in Excel as follows:

Data visualization: 4. The secret of scatter diagram

Data column meaning

Age: age of main beneficiary

Sex: Insurance contractor’s gender

BMI: body mass index, which provides an index to measure whether the weight is heavy or light relative to the height

Children: how many children are there

Smoker: smoking or not

Region: Region

Charges: premium

Let’s upload the CSV file to VITU’s dataset space

Data visualization: 4. The secret of scatter diagram

Next, we use panda to load this file:

# Path of the file to read
insurance_filepath = "insurance.csv"

# Read the file into a variable insurance_data
insurance_data = pd.read_csv(insurance_filepath)

It’s time to check the data

I’m used to printing the first five lines of dataset

insurance_data.head()

Scatter chart on the stage

Let’s use sns.scatterplot to create a new scatter plot

sns.scatterplot(x=insurance_data['bmi'], y=insurance_data['charges'])

Data visualization: 4. The secret of scatter diagram

It can be seen from the scatter chart that BMI and premium charges have a good correlation. People with high BMI tend to buy insurance with higher coverage, which is also well understood. People with high BMI have a higher risk of disease

Let’s take a look at the return quantification and change the command to sns.regplot

sns.regplot(x=insurance_data['bmi'], y=insurance_data['charges'])

Data visualization: 4. The secret of scatter diagram

Scatter with color classification

We can also use the scatter diagram with color classification to see the relationship among the three variables

For the data set of the above study, let’s see if smoking has an impact on BMI and premiums

sns.scatterplot(x=insurance_data['bmi'], y=insurance_data['charges'], hue=insurance_data['smoker'])

Data visualization: 4. The secret of scatter diagram

It’s interesting to see that smokers pay more than nonsmokers. Let’s take a look at sns.lmplot

sns.lmplot(x="bmi", y="charges", hue="smoker", data=insurance_data)

Data visualization: 4. The secret of scatter diagram

You will notice that after the regression, the slope of the regression line of smokers is steeper than that of non-smokers, which means that the correlation between BMI and premium is stronger in smokers, and the higher BMI in smokers, the more premium they pay

Original address: data visualization [from programming white to drawing big]: 4. The secret of scatter diagram