Crosstab function of pandas

Time:2021-7-27

By bex t
Compile VK
Source: towards Data Science

introduce

I like the course “intermediate data visualization with Seaborn” on datacamp very much. It teaches novices great charts and methods. But when it comes to heat map, the teacher of the course somehow introduced a new pandas function crosstab. Then, quickly say, “crosstab is a useful function for calculating crosstab…”

I don’t understand right there. Obviously, my first reaction was to look at the documentation of the function. At first I thought I could handle any document in Matplotlib, but… I was wrong

After my practice, I know that others will struggle. So I wrote a whole article here.

In the last part of this article, I discussed why some courses don’t teach you advanced functions such as crosstab. Because it is difficult to use such a function without specific environment, while maintaining the beginner level of the example.

In addition, most courses use small or toy data sets. In more complex data science environments, the benefits of these complex functions are more obvious and are often used by more experienced pandas users.

In this article, I will teach you how to use crosstab and how to select it among other similar functions.

catalogue

  • brief introduction

  • set up

  • Crosstab Basics

  • Pandas crosstab() and pivot_ Comparison between table() and groupby()

  • Further customization of pandas crosstab()

  • Pandas crosstab(), multiple groups

You can download the notebook of this article on this GitHub repo:https://github.com/BexTuychiev/medium_stories/tree/master/hardest_of_pandas2

set up

#Import the necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

#Ignore warning
import warnings
warnings.filterwarnings('ignore')

#Enable multi cell output
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

For the sample data, I’ll use Seaborn’s built-in diamonds dataset. It is large enough and has some variables that can be used with crosstab():

diamonds = sns.load_dataset('diamonds')
diamonds.head()

Crosstab () Basics

Like many functions that calculate group summary statistics, crosstab () can process classified data. It can be used to group two or more variables and perform calculations for a given value for each group. Of course, use group by () or pivot_ Table () can do this, but as we’ll see later, crosstab () brings many benefits to your daily workflow.

The function accepts two or more lists, pandas series, or dataframe, and returns the frequency of each combination by default. I always like to start with an example so that you can better understand the definition, and then I will continue to explain the grammar.

Crosstab () always returns a data frame. Here is an example. Dataframe is a cross table of two variables in diamonds: cut and color. A crosstab indicates that a variable is taken and its group is displayed as index, and another variable is taken and its group is displayed as columns.

pd.crosstab(index=diamonds['cut'], columns=diamonds['color'])

The grammar is quite simple. Index is used to group variables and display them as indexes (rows), as well as columns. If no aggregate function is given, each cell calculates the number of observations in each combination. For example, the cell in the upper left corner tells us that there are 2834 diamonds with color code D and ideal cutting,.

Next, we’ll look at the average price of each portfolio. Crosstab () provides the values parameter to introduce the third numeric variable to be aggregated:

pd.crosstab(index=diamonds['cut'],
            columns=diamonds['color'],
            values=diamonds['price'],
            aggfunc=np.mean).round(0)

Now, each cell contains the average price of the cut and color combination. To illustrate that we want to calculate the average price, we pass the price column to values. Note that values and aggfunc must always be used together. Otherwise, you will get an error. I also use round () to round the answer.

Although it is a bit advanced, when you pass the crosstab () table to Seaborn’s heat map, you will make full use of the advantages of the crosstab () table. Let’s see the above table in the heat map:

cross = pd.crosstab(index=diamonds['cut'],
                    columns=diamonds['color'],
                    values=diamonds['price'],
                    aggfunc=np.mean).round(0)
sns.heatmap(cross, cmap='rocket_r', annot=True, fmt='g');

Seaborn can automatically convert the crosstab () table into a heat map. I set the comment to true and display the heat map with a color bar. Seaborn also adds styles to column and index names (FMT =’g ‘displays numbers as integers instead of scientific counts).

Heat maps are easier to interpret. You don’t want your end users to see a table full of numbers. So I’ll put each crosstab () result into the heat map when needed. To avoid repetition, I created a useful function:

def plot_heatmap(cross_table, fmt='g'):
    fig, ax = plt.subplots(figsize=(8, 5))
    sns.heatmap(cross_table,
                annot=True,
                fmt=fmt,
                cmap='rocket_r',
                linewidths=.5,
                ax=ax)
    plt.show();

Pandas crosstab() and pivot_ Comparison between table() and groupby()

Before we move on to more interesting things, I think I need to clarify the difference between the three functions that calculate group summary statistics.

I introduced pivot in the first part of this article_ The difference between table () and group by (). For crosstab (), the difference between the three lies in the syntax and the shape of the result. Let’s use these three methods to calculate:

#Using groupby()
>>> diamonds.groupby(['cut', 'color'])['price'].mean().round(0)

cut        color
Ideal      D        2629.0
           E        2598.0
           F        3375.0
           G        3721.0
           H        3889.0
           I        4452.0
           J        4918.0
Premium    D        3631.0
           E        3539.0
           F        4325.0
           G        4501.0
           H        5217.0
           I        5946.0
           J        6295.0
Very Good  D        3470.0
           E        3215.0
           F        3779.0
           G        3873.0
           H        4535.0
           I        5256.0
           J        5104.0
Good       D        3405.0
           E        3424.0
           F        3496.0
           G        4123.0
           H        4276.0
           I        5079.0
           J        4574.0
Fair       D        4291.0
           E        3682.0
           F        3827.0
           G        4239.0
           H        5136.0
           I        4685.0
           J        4976.0
Name: price, dtype: float64

#Using pivot_ table()
diamonds.pivot_table(values='price',
                     index='cut',
                     columns='color',
                     aggfunc=np.mean).round(0)
#Use crosstab ()
pd.crosstab(index=diamonds['cut'],
            columns=diamonds['color'],
            values=diamonds['price'],
            aggfunc=np.mean).round(0)

The above is pivot_ Output of table

The above is the output of the crosstab

I think you already know your favorite. Grouppy () returns one sequence and the other two return the same data frames. However, the groupby series can be converted to the same data frame, as shown below:

grouped = diamonds.groupby(['cut', 'color'])['price'].mean().round(0)
grouped.unstack()

If you don’t know pivot_ The syntax of table () and unstack (), I strongly recommend that you read the first part of this article.

When it comes to speed, crosstab () is better than pivot_ Table() is fast, but much slower than groupby():

%%timeit
diamonds.pivot_table(values='price',
                     index='cut',
                     columns='color',
                     aggfunc=np.mean)
11.5 ms ± 483 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
pd.crosstab(index=diamonds['cut'],
            columns=diamonds['color'],
            values=diamonds['price'],
            aggfunc=np.mean)
10.8 ms ± 344 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
diamonds.groupby(['cut', 'color'])['price'].mean().unstack()
4.13 ms ± 39.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

As you can see, even with the unstack () link, groupby () is three times faster than the other two. This means that if you only want to group and calculate summary statistics, you should use the same groupby (). When I link other methods, such as simple round (), the speed difference is even greater.

The rest of the comparison is mainly about pivot_ Table() and crosstab(). As you can see, the shape of the result of the two functions is the same. The first difference between the two is that crosstab () can handle any data type.

It can accept any array like object, such as list, numpy array, and pandas series. However, pivot_ Table() is only valid for dataframe. In a helpful stackoverflow, I found that if crosstab () is used on the data frame, it will call pivot in the background_ table()。

Next are the parameters. Some parameters exist in only one parameter, and vice versa. The first and most popular is the normalize of crosstab (). Normalize accepts the following options (from the document):

  • If all or true is passed, all values are normalized.

  • If you pass index, each row is normalized.

  • If you pass columns, each column is normalized.

Let’s take a simple example:

cross = pd.crosstab(index=diamonds['cut'],
                    columns=diamonds['color'],
                    normalize='all')
plot_heatmap(cross, fmt='.2%')

If all is passed, pandas calculates the percentage of the total amount for each cell:

#Prove that all values add up to about 1
>>> pd.crosstab(diamonds['cut'], 
                diamonds['color'], 
                normalize='all').values.sum()
                
1.0000000000000002

If you pass index or columns, do the same by column or row:

cross = pd.crosstab(diamonds['cut'], 
                    diamonds['color'], 
                    normalize='index')
plot_heatmap(cross, fmt='.2%')

The above is normalized by line

cross = pd.crosstab(diamonds['cut'], diamonds['color'], normalize='columns')
plot_heatmap(cross, fmt='.2%')

The above is normalized by column

In crosstab (), you can also use row and column names to change indexes and column names directly within functions. You do not have to do it manually later. These two parameters are very useful when we group multiple variables at once, as you will see later.

Parameter fill_ Value exists only in pivot_ Table(). Sometimes, when you group by many variables, there will inevitably be inconsistencies. In pivot_ In table (), you can use fill_ Value change them to custom values:

diamonds.pivot_table(index='color', 
                     columns='cut', 
                     fill_value=0)

However, if you use crosstab (), you can achieve the same effect by linking fillna () on the dataframe:

pd.crosstab(diamonds['cut'], diamonds['color']).fillna(0)

Further customization of pandas crosstab()

The other two useful parameters for crosstab () are margins and margins_ Name (both exist in pivot)_ Table (). When set to true, the boundary calculates the sum of each row and column. Let’s take an example:

pd.crosstab(index=diamonds['cut'], 
            columns=diamonds['clarity'],  
            margins=True)

Pandas automatically adds the last row and last column, and the default name is all. margins_ Name can control the name:

pd.crosstab(index=diamonds['cut'],
            columns=diamonds['clarity'],
            margins=True,
            margins_name='Total Number')

The lower right cell will always contain the total number of observations, or 1 if normalize is set to true:

pd.crosstab(index=diamonds['cut'],
            columns=diamonds['clarity'],
            margins=True,
            margins_name='Total Percentage',
            normalize=True)

Note that if margins is set to true, the heat map is useless.

Pandas crosstab(), multiple groups

For the index and columns parameters, multiple variables can be passed. The result will be a data frame with multiple indexes. This time we insert all classification variables:

pd.crosstab(index=[diamonds['cut'], diamonds['clarity']],
            columns=diamonds['color'])

For index, I passed color and cut. If I pass them to columns, the result will be a data frame with 40 columns. If you notice, multi-level indexes are named cut and clear as expected. In the case of multi-level indexes or column names, crosstab () has convenient parameters to change their names:

pd.crosstab(index=[diamonds['cut'], diamonds['clarity']],
            columns=diamonds['color'], 
            rownames=['Diamond Cut', 'Clarity']).head()

Pass a list of names to change the index name to the row name. This process is the same for colnames that control column names.

One thing that surprises me is that if you pass multiple functions to aggfunc, pandas will throw an error. Similarly, the guys on stackoverflow think this is a bug that has not been solved for more than 6 years.

Finally, note that in pivot_ Both table () and crosstab () have a dropna parameter. If it is set to true, the columns or rows containing all Nan will be deleted.

Original link:https://towardsdatascience.com/meet-the-hardest-functions-of-pandas-part-ii-f8029a2b0c9b

Welcome to panchuang AI blog:
http://panchuang.net/

Official Chinese document of sklearn machine learning:
http://sklearn123.com/

Welcome to panchuang blog resources summary station:
http://docs.panchuang.net/