Exploration and visualization of shared bicycle data set

Time:2021-2-13

Analysis background

Analysis purpose

  • This paper analyzes the impact of the relevant data of mobike order on the riding time, which is used as the reference and basis for business operation optimization (because there is no order amount data in the original data set, and because mobike takes the riding time as the charging standard, the riding time is the most important factor affecting the amount of the order, so this analysis focuses on the riding time).
  • It mainly focuses on the analysis of the impact of riding time (including working days / weekends, peak hours / non peak hours), riding location and user value on riding time.

Dataset summary

  • The original data set comes from the random sampling data of one million users in Shanghai urban area provided by udacity in August 2016, with a total of 102361 order records, including the starting point, destination, rental time, return time, user ID, vehicle ID, transaction number and route track information.
  • After cleaning and information extraction, 22 new variables are added to the new data set, and the main variable used for analysis is TTL_ Min, distance, daytype, hour, ring_ Stage (in inner ring / in middle ring / in outer ring / out outer ring) and rate (high value user / medium value user / low value user).
  • During the use of the new data set, a small number of abnormal records of riding speed, distance and duration were removed, and the final number of order records was 102338.

Analysis conclusion

User behavior summary

In the process of data exploration, it is found that riding time (including weekdays / weekends, peak hours / non peak hours), riding area and user value all have an impact on riding time. When the four variable conditions are defined in turn, it is found that:

  • Under the condition of the same riding location or user value, the law of riding time for the average riding time is obvious, and the average riding time in peak hours and weekends is higher than that in non peak hours and working days;
  • In general, the higher the user value is, the longer the riding time is, the farther the riding area is away from the city center, and the longer the riding time is. However, the impact of the former on the riding time is far less than that of the latter.

In addition to focusing on the influence of type variables on riding duration, the following findings are found:

  • Compared with different riding locations and user value types, the distribution characteristics of data points in weekdays and peak hours are highly similar, which indicates that the driving behavior characteristics in weekdays and peak hours are similar;
  • High value users are more distributed in the inner ring.

Summary of optimization suggestions

According to the above analysis results of user behavior data, the following optimization suggestions are put forward

  • In view of the high impact of riding time (weekdays / weekends, peak hours / non peak hours) on the length of riding time, different riding time can be used as the dividing standard, and the cycling packages and marketing activities can be launched to improve the order frequency and amount (such as riding badge reward at different time periods, free riding within limited time, meeting peak vehicles, etc.)
  • In view of the behavior characteristics of users far away from the urban area, such as high single consumption (long riding time) and low consumption frequency (few high-value users), the corresponding cycling packages can be launched according to the riding geographical location, so as to improve the consumption frequency of users in such areas and their dependence on the use of mopeds
  • In view of the similar characteristics of user behavior in weekdays and peak hours, it can be considered in the design of operational activities

Other instructions

  • Because the amount of information in the original data set is too little, the content available for analysis is limited. After expanding the information and scope of the data set, the content that can be used for analysis includes but is not limited to:
  • The critical path conversion rate is analyzed according to the user’s click data in the app, so as to judge whether the supply and demand of bicycles in a certain area is balanced, so as to optimize the quantity of bicycles and scheduling efficiency
  • The periodicity of bicycle use in monthly and quarterly units can be used as the reference and basis for marketing scheme design and optimization

This paper analyzes the impact of preferential activities such as cycling coupons, cycling packages and recharge cash back on users’ riding behavior, which can be used as the reference and basis for users’ refined operation or marketing scheme design
Analysis process

Analysis process

This paper mainly focuses on the influence of four types of variables on riding time. Firstly, the data distribution of riding time and distance is introduced. Then, by drawing the violin diagram, we can observe the data characteristics of the two highly similar with the change of type variables, and determine that we only need to pay attention to the relationship between the key indicator of riding time and the four type variables. Finally, the influence of other variables on the riding time is drawn through the point diagram under the condition of controlling the type variables of different conditions.

#Import all the necessary libraries and set the chart to display directly
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

#Clear the warning in the output
import warnings
warnings.simplefilter("ignore")

#Import cleaned data set
df_e = pd.read_csv('mobike_master.csv')

#Convert the corresponding column to a category variable
order_dict = {'ring_stage': ['inside inner ring', 'inside middle ring', 'inside outer ring', 'outside outer ring'],
              'rate': ['high-value user', 'middle-value user', 'low-value user'],
              'daytype': ['weekdays', 'weekends'],
              'hourtype': ['rush hours', 'non-rush hours']}
for var in order_dict:
    order = pd.api.types.CategoricalDtype(ordered = True, categories = order_dict[var])
    df_e[var] = df_e[var].astype(order)

#Data cleaning, remove a small amount of abnormal data in riding speed, riding time and riding distance
df_e['speed'] = df_e['distance'] / (df_e['ttl_min'] / 60)
df_e = df_e[-(((df_e['speed'] < 12) | (df_e['speed'] > 20)) & ((df_e['ttl_min'] > 720) | (df_e['distance'] > 50)))]

Riding time distribution

The data range of riding time is very large, the minimum value is 1 minute, the maximum value is 666 minutes, and presents a long tail distribution. Most of the riding time is short. Using log conversion X-axis mapping, we can find that the riding time presents a right skew distribution, and the peak value appears between 7-10 minutes.

bins = 10 ** np.arange(0, np.log10(df_e.ttl_min.max()) + 0.15, 0.15)
plt.hist(data = df_e, x = 'ttl_min', bins = bins);
plt.xscale('log')
xticks = (1, 2, 5, 10, 20, 50, 100, 200, 500)
plt.xticks(xticks, xticks);
plt.xlabel('Riding Duration (min)');
plt.title('Distribution of Riding Duration');

Exploration and visualization of shared bicycle data set

Riding distance distribution

The range of the data set is also very large, the minimum value is 0.146km, the maximum value is 32.497km, and also presents a long tail distribution. The vast majority of the riding distance is short, and a few of the riding distance is long. Using log conversion X-axis mapping, we can find that the riding distance presents a right skewed distribution, and the peak value is out of range Now it’s between 0.7 km and 1.3 km.

bins = 10 ** np.arange(np.log10(df_e.distance.min()), np.log10(df_e.distance.max()) + 0.08, 0.08)
plt.hist(data = df_e, x = 'distance', bins = bins);
plt.xscale('log')
xticks = (0.1, 0.2, 0.5, 1, 2, 5, 10, 20)
plt.xticks(xticks, xticks);
plt.xlabel('Riding Distance (km)');
plt.title('Distribution of Riding Distance');

Exploration and visualization of shared bicycle data set

The relationship between riding time and distance and other variables

  • In terms of riding time, it is found that the median of riding time and distance on weekends and peak hours are higher than those on weekdays and off peak hours (except that the riding distance on weekends is slightly lower than that on weekdays);
  • In terms of riding area, once the user’s riding area is outside the inner ring road, the median of riding time and riding distance become higher with the distance from the city center, which may be due to the distance between the user’s starting point and destination becoming larger and larger with the closer to the suburb;
  • In terms of user value, the median of riding time and riding distance decreased with the decrease of user value.

Through the comparison, it is found that the data distribution characteristics of the two numerical variables, riding length and riding distance, are almost the same as the change characteristics under the classification situation. The reasons for not analyzing the riding distance are as follows: on the one hand, the riding length is the real data in the original data (the riding distance is the straight-line distance estimated by the starting and ending points of the riding), on the other hand, the riding distance of the motorcycle is the same Therefore, in the case of highly similar data characteristics, the riding time with higher data quality and value is selected as the follow-up analysis index.

#Because the data of riding time and distance show a very long tail, log transformation is carried out for the two data first, so as to observe the data characteristics more clearly
df_e['log_ttl_min'] = np.log10(df_e['ttl_min'])
df_e['log_distance'] = np.log10(df_e['distance'])
cat_vars = ['daytype', 'hourtype', 'ring_stage', 'rate']
fig, ax = plt.subplots(ncols = 4, nrows = 2, figsize = [20,10])
color = sb.color_palette()[0]
for i in range(len(cat_vars)):
    var = cat_vars[i]
    #Draw the first line
    sb.violinplot(data = df_e, x = var, y = 'log_ttl_min', ax = ax[0, i], color = color);
    ttl_min_ticks = [1, 2, 5, 10, 20, 50, 100, 200, 500]
    ax[0, i].set_yticks(np.log10(ttl_min_ticks));
    ax[0, i].set_yticklabels(ttl_min_ticks);
    ax[0, i].set_ylabel('Riding Duration (min)');
    if i == 2:
        xlabels = ax[0, i].get_xticklabels()
        ax[0, i].set_xticklabels(xlabels, rotation = 10);
    #Draw the second line
    sb.violinplot(data = df_e, x = var, y = 'log_distance', ax = ax[1, i], color = color);
    distance_ticks = [0.1, 0.2, 0.5, 1, 2, 5, 10, 20]
    ax[1, i].set_yticks(np.log10(distance_ticks));
    ax[1, i].set_yticklabels(distance_ticks);
    ax[1, i].set_ylabel('Riding Distance (km)');
    if i == 2:
        xlabels = ax[1, i].get_xticklabels()
        ax[1, i].set_xticklabels(xlabels, rotation = 10);
plt.suptitle('riding duration and distance by other features', fontsize = 'xx-large');

Exploration and visualization of shared bicycle data set

Under the condition of given riding time, the riding time changes with riding area and user value

  • In general, except for the data within the inner ring road, the average riding time in other regions increases with the distance from the starting point to the city center
  • Except for the weekend and off peak hours outside the outer ring, the average riding time of high value users is the highest
  • In general, the average riding time in peak hours and weekends is higher than that in non peak hours and working days
  • It can be roughly seen from the upper and lower figures in the first column that the change characteristics of average riding time with the change of user value variable and riding area variable in weekdays and peak hours are very similar. This may be because office workers account for the majority of users in weekdays and peak hours, and these Office workers have similar driving behavior characteristics
#User defined function, draw the pointplot diagram under the condition of control variables
def ppltgrid(row_dict):
    for var in row_dict:
        firstplot = list(row_ dict.keys ()) [0] # set the number of the first drawing, so as to obtain the Y-axis of the first drawing later
        a0,b0,c0 = var.split(',')
        a,b,c = int(a0), int(b0), int(c0)
        plt.subplot(a,b,c)
        flagid, flag, hue, x = row_dict[var]['flagid'], row_dict[var]['flag'], row_dict[var]['hue'], row_dict[var]['x']
        ax = sb.pointplot(data = df_e[df_e[flagid] == flag], x = x, y = 'log_ttl_min', hue = hue,
                          palette = 'Blues_r', linestyles = '', dodge = 0.1);
        ax.set_title("{}'s riding duration across {} and {}".format(flag, x, hue), fontsize = 'small');
        ylocs = np.arange(1, 1.25, 0.025)
        ylabels = np.round(np.power(10, ylocs), 2)
        ax.set_yticks(ylocs);
        ax.set_yticklabels(ylabels);
        ax.set_ Yticklabels ([], minor = true); # the default major scale is not displayed
        If C% B = = 1: # set the y-axis label for the first graph of each row, and other graphs will not be displayed to prevent the contents of the graph from being covered
            ax.set_ylabel('Mean Riding Duration (min)');
        else:
            ax.set_ylabel('');
        if x == 'ring_ stage' or x == 'rate':    # ring_ The class names of stage and rate are too long, which makes the font smaller
            xlabels = ax.get_xticklabels()
            ax.set_xticklabels(xlabels, fontsize = 'small');
        if var == firstplot:
            ylim =  ax.get_ Ylim() # get the Y-axis of the first drawing
        else:
            plt.ylim (ylim); # the y-axis range that keeps all the drawings from the second beginning consistent with the first drawing
plt.figure(figsize = [15, 10])
row_dict = {'2,2,1': {'flagid': 'daytype', 'flag': 'weekdays', 'hue': 'rate', 'x': 'ring_stage'},
            '2,2,2': {'flagid': 'daytype', 'flag': 'weekends', 'hue': 'rate', 'x': 'ring_stage'},
            '2,2,3': {'flagid': 'hourtype', 'flag': 'rush hours', 'hue': 'rate', 'x': 'ring_stage'},
            '2,2,4': {'flagid': 'hourtype', 'flag': 'non-rush hours', 'hue': 'rate', 'x': 'ring_stage'}}
ppltgrid(row_dict)

Exploration and visualization of shared bicycle data set

Under the condition of given riding area and user value, the change law of riding time with riding time

  • It is found that the relative positions of data points are highly similar, indicating that the law of riding time for average riding time is obvious, that is, under the same riding geographical location or user value conditions, the average riding time in peak hours and weekends is higher than that in non peak hours Peak hours and weekdays.
  • Comparing the distribution characteristics of data points in the first line and the second line, we can find that with the user value from high to low, the vertical change range of data points in the perspective of user value variable is much lower than that in the perspective of first riding position, which indicates that the effect of user value on the average riding time is less than that of riding position.
plt.figure(figsize = [20,10])
row_dict = {'2,4,1': {'flagid':'ring_stage', 'flag': 'inside inner ring', 'hue': 'daytype', 'x': 'hourtype'},
            '2,4,2': {'flagid':'ring_stage', 'flag': 'inside middle ring', 'hue': 'daytype', 'x': 'hourtype'},
            '2,4,3': {'flagid':'ring_stage', 'flag': 'inside outer ring', 'hue': 'daytype', 'x': 'hourtype'},
            '2,4,4': {'flagid':'ring_stage', 'flag': 'outside outer ring', 'hue': 'daytype', 'x': 'hourtype'},
            '2,4,5': {'flagid':'rate', 'flag': 'high-value user', 'hue': 'daytype', 'x': 'hourtype'},
            '2,4,6': {'flagid':'rate', 'flag': 'middle-value user', 'hue': 'daytype', 'x': 'hourtype'},
            '2,4,7': {'flagid':'rate', 'flag': 'low-value user', 'hue': 'daytype', 'x': 'hourtype'}}
ppltgrid(row_dict)

Exploration and visualization of shared bicycle data set

Code submitted toGithub
Please pay attention to myPersonal blog

reference material