Reshaping dataframe is an important and indispensable skill in data science. In this article, we will explore pandas melt() and how to use it for data processing.
The simplest melt
The simplest melt () does not require any parameters. It turns all columns into rows (displayed as column variables) and lists all associated values in the new column values.
This output usually doesn’t make much sense, so the general use case at least specifies the ID_ Vars parameter. For example, ID_ Vars =’country’tells pandas to keep country as one column and convert all other columns to rows.
df_wide.melt( id_vars='Country', )
The number of rows is now 15, because each value in the country column has five values (3 x 5 = 15).
Show custom name
Variables and values are column names. We can use VaR_ Name and value_ The name parameter specifies a custom name:
df_wide.melt( id_vars='Country', var_name='Date', value_name='Cases' )
Specify multiple IDS
One of the most useful features of melt () is that we can specify multiple IDs to keep them as columns. For example, if we want to keep country, lat, and long as columns for better reference:
df_wide.melt( id_vars=['Country', 'Lat', 'Long'], var_name='Date', value_name='Cases' )
Specify the column for melt
Pandas’ melt() function converts all other columns (except those specified in id_vars) to rows by default. In the actual project, we may only care about some columns. For example, if we only want to view the values on “24/01/2020” and “25/01/2020”:
df_wide.melt( id_vars=['Country', 'Lat', 'Long'], value_vars=["24/01//2020", "25/01//2020"], var_name='Date', value_name='Cases' )
We can also call melt () directly from the pandas module rather than from the dataframe. But these are the same.
Reshaping covid-19 time series data
With what we have learned so far, let’s take a look at a real-world problem: covid-19 time series data provided by CSSE GitHub of Johns Hopkins University.
There are two problems:
- Confirmation, death and recovery are saved in different CSV files. It is not easy to draw them in a single graph.
- The dates are displayed as column names, which make it difficult to perform daily calculations, such as calculating new cases, new deaths and new rehabilitation per day.
Let’s reshape three datasets and merge them into one dataframe.
confirmed_df = pd .read_csv('time_series_covid19_confirmed_global.csv') deaths_df = pd .read_csv('time_series_covid19_deaths_global.csv') recovered_df = pd .read_csv('time_series_covid19_recovered_global.csv')
Transform them from wide format to long format
By running confirmed_ df. columns、deaths_ df. Columns and recovered_ df. Columns, which should output the same results as below:
Note that the columns are all dates starting from column 4, and get the confirmed date list df columns [4:]
Before merging, we need to use melt() to reverse perspective the dataframes from the current wide format to the long format. In other words, we convert all date columns to values. Use province / state, country / region, latitude, longitude as identifier variables. We will merge them later.
confirmed_df_long = confirmed_df.melt( id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], value_vars=dates, var_name='Date', value_name='Confirmed' ) deaths_df_long = deaths_df.melt( id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], value_vars=dates, var_name='Date', value_name='Deaths' ) recovered_df_long = recovered_df.melt( id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], value_vars=dates, var_name='Date', value_name='Recovered' )
All results are in the new long format. All of these are sorted by date and country because the original data has been sorted by country and the date columns have been sorted in ASC order.
This is confirmed_ df_ Long example
Finally, we use merge() to merge the three dataframes one by one:
full_table = confirmed_df_long.merge( right=deaths_df_long, how='left', on=['Province/State', 'Country/Region', 'Date', 'Lat', 'Long'] ) full_table = full_table.merge( right=recovered_df_long, how='left', on=['Province/State', 'Country/Region', 'Date', 'Lat', 'Long'] )
Now we have a complete table with confirmed, deaths, and recovered columns:
In this article, we introduced five use cases and a practical example, which used pandas’ melt () method to reshape the dataframe from a wide format to a long format. It is very convenient and is one of the most popular methods in data preprocessing and exploratory data analysis.
Reshaping data is an important and indispensable skill in data science. I hope you enjoy this article and learn something new and useful.
Author: B. Chen