Memos for pandas data visualization

Time:2021-4-11

By Rashida nasrin sucky
Compile VK
Source: towards Data Science

We use Python’s pandas library mainly for data operation in data analysis, but we can also use pandas for data visualization. You don’t even need to import the Matplotlib library for this.

Pandas itself can use Matplotlib on the back end and visualize it for you. It makes it very easy to draw graphs using data frame columns. Pandas uses a higher level API than Matplotlib. As a result, it can draw with fewer lines of code.

I’ll start with a basic drawing using random data, and then move on to a more advanced drawing with real data sets.

In this tutorial, I’ll use the jupyter notebook environment. If you don’t have it installed, you can simply use Google’s colab notebook. You don’t even need to install pandas on it. It has been installed for us.

If you want to install notebook, it’s a good idea.

It’s a great package for data scientists, and it’s free.

To install panda, use:

pip install pandas

Or on your anaconda

conda install pandas

So you’re ready

Panda visualization

We’ll start with the basics.

Straight line diagram

First, import panda. Then, let’s do a basic series with panda and draw a straight line diagram.

import pandas as pd
a = pd.Series([40, 34, 30, 22, 28, 17, 19, 20, 13, 9, 15, 10, 7, 3])
a.plot()

The most basic and simple picture is ready! Look how easy it is. We can improve it.

I will add:

Change the size of a graph to make it larger,

Changed default blue

show heading

Change the default font size for these numbers on the axis

a.plot(figsize=(8, 6), color='green', title = 'Line Plot', fontsize=12)

In this tutorial, we will learn more style skills.

Area map

I’m going to draw an area map here with the same data a,

I can use the. Plot method and pass a parameter type to specify the type of drawing I want, for example:

a.plot(kind='area')

Or I can write like this

a.plot.area()

I mentioned both methods above

Area maps are more meaningful and look better when there are multiple variables in them. So I’m going to make more series, make a data frame, and draw an area map from it.

b = pd.Series([45, 22, 12, 9, 20, 34, 28, 19, 26, 38, 41, 24, 14, 32])
c = pd.Series([25, 38, 33, 38, 23, 12, 30, 37, 34, 22, 16, 24, 12, 9])
d = pd.DataFrame({'a':a, 'b': b, 'c': c})

Let’s plot the data box “d” as an area map,

d.plot.area(figsize=(8, 6), title='Area Plot')

You don’t have to accept these default colors. Let’s change these colors and add some styles.

d.plot.area(alpha=0.4, color=['coral', 'purple', 'lightgreen'],figsize=(8, 6), title='Area Plot', fontsize=12)

The alpha parameter adds some translucent look to the drawing.

It seems to be very useful when we have overlapping area plots, histograms, or dense scatter plots.

plot()Eleven types of drawing can be performed:

  1. line
  2. area
  3. bar
  4. barh
  5. pie
  6. box
  7. hexbin
  8. hist
  9. kde
  10. density
  11. scatter

I want to show the usage of all these different graphs. To do this, I will use the Centers for Disease Control and Prevention’s NHANES dataset. I downloaded the dataset and put it in the same folder as the Jupiter notebook. Please download the dataset at any time and follow:https://github.com/rashida048/Datasets/blob/master/nhanes_2015_2016.csv

Import the dataset here:

df = pd.read_csv('nhanes_2015_2016.csv')
df.head()

This data set has 30 columns and 5735 rows.

Before you start drawing, it’s important to check the columns of the dataset:

df.columns

Output:

Index(['SEQN', 'ALQ101', 'ALQ110', 'ALQ130', 'SMQ020', 'RIAGENDR', 'RIDAGEYR', 'RIDRETH1', 'DMDCITZN', 'DMDEDUC2', 'DMDMARTL', 'DMDHHSIZ', 'WTINT2YR', 'SDMVPSU', 'SDMVSTRA', 'INDFMPIR', 'BPXSY1', 'BPXDI1', 'BPXSY2', 'BPXDI2', 'BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML', 'BMXARMC', 'BMXWAIST', 'HIQ210', 'DMDEDUC2x', 'DMDMARTLx'], dtype='object')

The name of the column can look strange. But don’t worry. I’ll continue to explain what columns mean. We don’t use all the columns. We’re going to practice these charts with some of them.

histogram

I’m going to use the weight of the population to make a basic histogram

df['BMXWT'].hist()

As a reminder, the histogram provides the frequency distribution. The figure above shows about 1825 people weighing 75. The biggest weight is between 49 and 99.

What if I want to put several histograms on one graph?

I’ll use weight, height and body mass index (BMI) to draw three histograms in one graph.

df[['BMXWT', 'BMXHT', 'BMXBMI']].plot.hist(stacked=True, bins=20, fontsize=12, figsize=(10, 8))

But if you want three different histograms, you can use just one line of code, like this:

df[['BMXWT', 'BMXHT', 'BMXBMI']].hist(bins=20,figsize=(10, 8))

It can be more dynamic!

We have blood pressure data in the ‘bpxsy1’ column and education data in the ‘dmdec2’ column. If we want to check the blood pressure distribution of people with each education level, we can also use one line of code.

But before that, I want to replace the value of the ‘dmdeduc2’ column with a more meaningful string value:

df["DMDEDUC2x"] = df.DMDEDUC2.replace({1: "less than 9", 2: "9-11", 3: "HS/GED", 4: "Some college/AA", 5: "College", 7: "Refused", 9: "Don't know"})

Now do the histogram

df[['DMDEDUC2x', 'BPXSY1']].hist(by='DMDEDUC2x', figsize=(18, 12))

Look! We only need one line of code to get the blood pressure level distribution of each education level!

Bar chart

Now let’s look at how blood pressure changes with marital status. This time I’m going to make a bar chart. As before, I’ll replace the value of the “dmartl” column with a more meaningful string.

df["DMDMARTLx"] = df.DMDMARTL.replace({1: "Married", 2: "Widowed", 3: "Divorced", 4: "Separated", 5: "Never married", 6: "Living w/partner", 77: "Refused"})

In order to draw the bar graph, we need to preprocess the data. That is to say, the data are grouped according to different marital status, and the average value of each group is taken. Here I use the same line of code for data and drawing.

df.groupby('DMDMARTLx')['BPXSY1'].mean().plot(kind='bar', rot=45, fontsize=10, figsize=(8, 6))

Here we use the “rot” parameter to rotate the X mark 45 degrees. Otherwise, they will be too chaotic.

You can even it out if you want,

df.groupby('DMDEDUC2x')['BPXSY1'].mean().plot(kind='barh', rot=45, fontsize=10, figsize=(8, 6))

I want to draw a bar chart with multiple variables. We have a column with the ethnic origin of the population. It would be interesting to see if people’s weight, height, and BMI would change with ethnic origin.

To draw this graph, we need to group the three columns (weight, height, and BMI) by ethnic origin and average them.

df_bmx = df.groupby('RIDRETH1')['BMXWT', 'BMXHT', 'BMXBMI'].mean().reset_index()

This time I don’t have the data to change the ethnic lineage. I keep the number constant. Let’s start now,

df_bmx.plot(x = 'RIDRETH1', 
            y=['BMXWT', 'BMXHT', 'BMXBMI'], 
            kind = 'bar', 
            color = ['lightblue', 'red', 'yellow'], 
            fontsize=10)

It seems that the fourth race is a little higher than the others. But there was no significant difference between them.

We can also add different parameters (weight, height and body mass index) together.

df_bmx.plot(x = 'RIDRETH1', 
            y=['BMXWT', 'BMXHT', 'BMXBMI'], 
            kind = 'bar', stacked=True,
            color = ['lightblue', 'red', 'yellow'], 
            fontsize=10)

Pie chart

I want to see if there is a relationship between marital status and education.

I need to group marital status by education level, and count the population in each marital status group by education level. That sounds too wordy, doesn’t it? Let’s see:

df_edu_marit = df.groupby('DMDEDUC2x')['DMDMARTL'].count()
pd.Series(df_edu_marit)

Using this series, you can easily draw a pie chart:

ax = pd.Series(df_edu_marit).plot.pie(subplots=True, label='',
     labels = ['College Education', 'high school', 
     'less than high school', 'Some college',
     'HS/GED', 'Unknown'],
     figsize = (8, 6),
     colors = ['lightgreen', 'violet', 'coral', 'skyblue', 'yellow', 'purple'], autopct = '%.2f')

Here I add some style parameters. Please feel free to try more style parameters.

Box line diagram

For example, I’ll use body mass index, leg and arm length data to create a box plot.

color = {'boxes': 'DarkBlue', 'whiskers': 'coral', 
         'medians': 'Black', 'caps': 'Green'}
df[['BMXBMI', 'BMXLEG', 'BMXARML']].plot.box(figsize=(8, 6),color=color)

Scatter plot

For a simple scatter plot, I want to see if there is any relationship between body mass index (“bmxbmi”) and blood pressure (“bpxsy1”).

df.head(300).plot(x='BMXBMI', y= 'BPXSY1', kind = 'scatter')

I only use 300 data, because if I use all the data, the scatter plot becomes too dense to understand. But you can use the alpha parameter to make it translucent.

Now, let’s draw a slightly more advanced scatter diagram with the same line of code.

This time I’ll add some color shadows. I’m going to plot a scatter plot with the weight on the x-axis and the height on the y-axis.

I’ll add the leg length. But the length of the leg is shaded. If the leg is longer, the shadow will be darker, otherwise it will be lighter.

df.head(500).plot.scatter(x= 'BMXWT', y = 'BMXHT', c ='BMXLEG', s=50, figsize=(8, 6))

It shows the relationship between weight and height. You can see if there is any relationship between leg length and height and weight.

Another way to add a third parameter is to increase the particle size. Here, I put the height on the x-axis, the weight on the y-axis, and the body mass index as an indicator of particle size.

df.head(200).plot.scatter(x= 'BMXHT', y = 'BMXWT', 
                          s =df['BMXBMI'][:200] * 7, 
                          alpha=0.5, color='purple',
                         figsize=(8, 6))

The small dots here indicate a lower BMI, and the larger dots indicate a higher BMI.

hexagon

This is another beautiful visual effect, the point is hexagon. When the data is too dense, it’s useful to put it in a box. As you can see, in the first two diagrams, I only used 500 and 200 data, because if I put all the data into the dataset, the drawing becomes too dense to understand or get any information from.

In this case, using spatial distribution is very useful. I’m using hexbin, and the data will be represented in hexagons. Each hexagon is a box that represents the density of the box. Here is a basic example of hexpin.

df.plot.hexbin(x='BMXARMC', y='BMXLEG', gridsize= 20)

Here, darker colors represent higher data density, while lighter colors represent lower data density.

Does that sound like a histogram? Right? It is represented by color, not histogram.

If we add an extra parameter ‘C’, the distribution will change. It’s not like a histogram anymore.

Parameter “C” specifies the position of each (x, y) coordinate, accumulates each hexagon box, and then uses reduce_ C_ Function to reduce. If reduce is not specified_ C_ Function, which is used by default np.mean . You can define it as np.mean , np.max , np.sum , np.std wait

For more information, see the documentation:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.hexbin.html

Here is an example:

df.plot.hexbin(x='BMXARMC', y='BMXLEG', C = 'BMXHT',
                         reduce_C_function=np.max,
                         gridsize=15,
                        figsize=(8,6))

The dark color of the hexagon means, np.max There is a higher value that you can see I use np.max As reduce_ C_ function。 We can use color mapping instead of color shading

df.plot.hexbin(x='BMXARMC', y='BMXLEG', C = 'BMXHT',
                         reduce_C_function=np.max,
                         gridsize=15,
                        figsize=(8,6),
                        cmap = 'viridis')

It looks beautiful, doesn’t it? And there’s a lot of information.

Some advanced visualizations

I explained above some of the basic graphics that people use to process data in their daily lives. But data scientists need more. The panda library also has some more advanced visualizations. It can provide more information in a single line of code.

Scatter matrix

Scatter matrix is very useful. It provides a lot of information in a graph. It can be used in general data analysis or feature engineering in machine learning. Let’s start with an example. I’ll explain later.

from pandas.plotting import scatter_matrix

scatter_matrix(df[['BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML']], alpha = 0.2, figsize=(10, 8), diagonal = 'kde')

I use five features here. I got the relationship between all five variables. In the diagonal, it gives the density map of each individual feature. In my next example, we’ll talk more about density maps.

KDE or density map

KDE graph or kernel density graph is constructed to provide the probability distribution of sequence or column in data frame. Let’s look at the probability distribution of the weighted variable (“bmxwt”).

df['BMXWT'].plot.kde()

You can see several probability distributions in a graph. Here, I give the probability distribution of height, weight and BMI in the same graph:

df[['BMXWT', 'BMXHT', 'BMXBMI']].plot.kde(figsize = (8, 6))

You can also use the other style parameters described earlier. I like to keep it simple.

Parallel_coordinates

This is a good way to display multidimensional data. It clearly shows the clusters, if any. For example, I want to see if there are any differences in height, weight and body mass index between men and women. Let’s check it.

from pandas.plotting import parallel_coordinates

parallel_coordinates(df[['BMXWT', 'BMXHT', 'BMXBMI', 'RIAGENDR']].dropna().head(200), 'RIAGENDR', color=['blue', 'violet'])

You can see significant differences in weight, height and BMI between men and women. Here, one is a man, two is a woman.

Bootstrap_plot

This is a very important research and statistical analysis chart. This will save a lot of statistical analysis time. Bootstrap_ Plot is used to evaluate the uncertainty of a given data set.

This function gets a random sample of the specified size. Then calculate the mean, median and median of the sample. This process is repeated a specified number of times.

Here I’ve created one with BMI dataBootstrap_plot

from pandas.plotting import bootstrap_plot

bootstrap_plot(df['BMXBMI'], size=100, samples=1000, color='skyblue')

Here, the sample size is 100 and the sample number is 1000. Therefore, we randomly selected 100 data samples to calculate the mean, median and median. This process is repeated 1000 times.

For statisticians and researchers, this is an extremely important process, but also a time-saving process.

conclusion

I want to make a memo for pandas data visualization. However, if you use Matplotlib and Seaborn, there are more options or visualization types. But if you work with data, we use these basic types of visualization in our daily lives. Using panda for this visualization will make your code simpler and save a lot of code.

Link to the original text:https://towardsdatascience.com/an-ultimate-cheat-sheet-for-data-visualization-in-pandas-4010e1b16b5c

Welcome to panchuang AI blog:
http://panchuang.net/

Sklearn machine learning official Chinese document:
http://sklearn123.com/

Welcome to pancreato blog Resource Hub:
http://docs.panchuang.net/