Reshaping the dataframe with pandas melt()

Time:2022-6-12

Reshaping dataframe is an important and indispensable skill in data science. In this article, we will explore pandas melt() and how to use it for data processing.

The simplest melt

The simplest melt () does not require any parameters. It turns all columns into rows (displayed as column variables) and lists all associated values in the new column values.

df_wide.melt()

Reshaping the dataframe with pandas melt()

This output usually doesn’t make much sense, so the general use case at least specifies the ID_ Vars parameter. For example, ID_ Vars =’country’tells pandas to keep country as one column and convert all other columns to rows.

df_wide.melt(
    id_vars='Country',
)

Reshaping the dataframe with pandas melt()

The number of rows is now 15, because each value in the country column has five values (3 x 5 = 15).

Show custom name

Variables and values are column names. We can use VaR_ Name and value_ The name parameter specifies a custom name:

df_wide.melt(
    id_vars='Country',
    var_name='Date',
    value_name='Cases'
)

Reshaping the dataframe with pandas melt()

Specify multiple IDS

One of the most useful features of melt () is that we can specify multiple IDs to keep them as columns. For example, if we want to keep country, lat, and long as columns for better reference:

df_wide.melt(
    id_vars=['Country', 'Lat', 'Long'],
    var_name='Date',
    value_name='Cases'
)

Reshaping the dataframe with pandas melt()

Specify the column for melt

Pandas’ melt() function converts all other columns (except those specified in id_vars) to rows by default. In the actual project, we may only care about some columns. For example, if we only want to view the values on “24/01/2020” and “25/01/2020”:

df_wide.melt(
    id_vars=['Country', 'Lat', 'Long'],
    value_vars=["24/01//2020", "25/01//2020"],
    var_name='Date',
    value_name='Cases'
)

Reshaping the dataframe with pandas melt()

Pandas melt

We can also call melt () directly from the pandas module rather than from the dataframe. But these are the same.

Reshaping the dataframe with pandas melt()

Reshaping covid-19 time series data

With what we have learned so far, let’s take a look at a real-world problem: covid-19 time series data provided by CSSE GitHub of Johns Hopkins University.

Reshaping the dataframe with pandas melt()

There are two problems:

  • Confirmation, death and recovery are saved in different CSV files. It is not easy to draw them in a single graph.
  • The dates are displayed as column names, which make it difficult to perform daily calculations, such as calculating new cases, new deaths and new rehabilitation per day.

Let’s reshape three datasets and merge them into one dataframe.

Read dataset

confirmed_df = pd
    .read_csv('time_series_covid19_confirmed_global.csv')
deaths_df = pd
    .read_csv('time_series_covid19_deaths_global.csv')
recovered_df = pd
    .read_csv('time_series_covid19_recovered_global.csv')

Transform them from wide format to long format

By running confirmed_ df. columns、deaths_ df. Columns and recovered_ df. Columns, which should output the same results as below:

Reshaping the dataframe with pandas melt()

Note that the columns are all dates starting from column 4, and get the confirmed date list df columns [4:]

Reshaping the dataframe with pandas melt()

Before merging, we need to use melt() to reverse perspective the dataframes from the current wide format to the long format. In other words, we convert all date columns to values. Use province / state, country / region, latitude, longitude as identifier variables. We will merge them later.

confirmed_df_long = confirmed_df.melt(
    id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], 
    value_vars=dates, 
    var_name='Date', 
    value_name='Confirmed'
)
deaths_df_long = deaths_df.melt(
    id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], 
    value_vars=dates, 
    var_name='Date', 
    value_name='Deaths'
)
recovered_df_long = recovered_df.melt(
    id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], 
    value_vars=dates, 
    var_name='Date', 
    value_name='Recovered'
)

All results are in the new long format. All of these are sorted by date and country because the original data has been sorted by country and the date columns have been sorted in ASC order.

This is confirmed_ df_ Long example

Reshaping the dataframe with pandas melt()

Finally, we use merge() to merge the three dataframes one by one:

full_table = confirmed_df_long.merge(
  right=deaths_df_long, 
  how='left',
  on=['Province/State', 'Country/Region', 'Date', 'Lat', 'Long']
)

full_table = full_table.merge(
  right=recovered_df_long, 
  how='left',
  on=['Province/State', 'Country/Region', 'Date', 'Lat', 'Long']
)

Now we have a complete table with confirmed, deaths, and recovered columns:

Reshaping the dataframe with pandas melt()

summary

In this article, we introduced five use cases and a practical example, which used pandas’ melt () method to reshape the dataframe from a wide format to a long format. It is very convenient and is one of the most popular methods in data preprocessing and exploratory data analysis.

Reshaping data is an important and indispensable skill in data science. I hope you enjoy this article and learn something new and useful.

Article code:https://github.com/BindiChen/…

Author: B. Chen