origin
Since the epidemic, more and more parents and students have begun to accept online teaching actively or passively. Online education industry ushered in a wave of traffic growth, and accumulated more and more data. At the same time, more and more entrepreneurs begin to enter the industry, and the competition in the industry is becoming more and more fierce. The efficient use of these data has become a key factor for a company to install a turbine engine to take off.
1、 Business background
The transaction conversion process of online education is generally as follows: user registration, experience class and paid class purchase. Experience class generally includes 35 sessions, one session a day. Each course has several modules, which will contain various forms of interactive problems. The questions are closed, there will be correct answers, and the results of each answer will be saved.
2、 Data dimension and data structure
Data dimension Division

User dimension
 Parents information, including gender, age, registration time, geographical information, registration channels
 Child information, including gender, age, whether the user has purchased other courses of the company before registration

User course dimension
 Have you finished a lesson, module or question
 Is a question right

Temporal dimension
 The response time of each question (the time from the end of listening / reading to the final answer)
 Whether the first answer is correct, and whether the second and third answers are correct (the wrong answers can be retried)
3、 Data collection and collation
Design rectangular data field
For the convenience of statistics, the most common rectangular data is used as the basis of analysis. Specifically, each user will have a row of records under a certain experience class, and each column is a feature data. For the experience class may have n days, n is not the same, in the need of the field by adding a number to solve. For example: D1s (day one state) indicates the completion of the course on the first day. When the course is completed, the field value is 2, when it is not started, the field value is 0, and when it is started and not completed, the field value is 1. And so on. The data table has 40 columns.
Data collection
Generally, there are two ways of data acquisition: push stream and pull stream. Their advantages and disadvantages are not discussed here. Interested students can draw a profit and loss matrix and choose their own way. Here we use the pull flow method to achieve data collection, and run the script every time when the business flow is low. The script will first pull the target user group, and then cycle to extract the specific data of each user. When extracting specific data, we did not use a lot of database table join, nor did we use too much where condition, and put the calculation and logic to the local for offline calculation. Thanks to golang’s powerful GMP scheduler, pure CPU computing will not become a bottleneck.
Data cleaning
 When analyzing each time, delete the row of the empty value field of the target dimension. For example, when analyzing the age of a child, delete the row that is 0 years old or older than 15 years old. Using dropna of pandas can easily realize data cleaning.
df.dropna(subset=[column_a, column_b])
 Delete outlier. Use pandas to delete outliers,
clean_df = df.loc[df[column_a] > column_a_bound]
Where column_ A_ Bound can be obtained by quartile, variance, mean and so on. Select the appropriate value;
Visualization exploration
 Exploration of single dimension data
Generally, gender is a single dimensional data, which is suitable to be represented by pie chart
Corresponding code:
plt.figure(figsize=(5,5))
vc = df[‘gender'].value_counts()
y = [vc.values[0], vc.values[1]]
patches,l_ text,p_ text = plt.pie (y, labels = ['female ','male'],
explode=(0.1,0),
colors=['pink','yellowgreen'],
labeldistance = 1.1,
autopct = '%3.1f%%',
shadow = True,
startangle = 90,
pctdistance = 0.6)
for t in l_text:
t.set_size=(30)
for t in p_text:
t.set_size=(20)
#Set the scale of X and Y axis to be consistent, so that the pie chart can be round
plt.axis('equal')
plt.legend()
plt.show()
Similarly, we can explore the proportion of different genders in the sample users and in the transaction users. Reflecting multiple scale data is suitable for rectangular graph.
The code is as follows:
df_gender_deal = df.loc[df['deal'] == 1]
gender_rate = df['gender'].value_counts()
gender_deal_rate = df_gender_deal['gender'].value_counts()
gender_dist = pd.DataFrame({'all': gender_rate, 'deal': gender_deal_rate})
gender_dist['all_rate'] = gender_dist['all'].apply(lambda x: x / gender_dist.shape[0] * 100)
gender_dist['deal_rate'] = gender_dist['deal'].apply(lambda x: x / df_gender_deal.shape[0] * 100)
x = gender_dist.index
bar_width = 0.35
plt.figure(figsize=(10,5))
plt.bar (x, gender_ dist['all_ rate'], bar_ Width, align = center, color = C, label = sample proportion, alpha = 0.5)
plt.bar (x+bar_ width, gender_ dist['deal_ rate'], bar_ Width, color = B, align = center, label = transaction proportion, alpha = 0.5)
plt.xlabel("Gender")
plt.ylabel("Percentage")
plt.xticks(x+bar_width/2, x)
plt.legend()
plt.show()
It can be assumed from the comparison that the proportion of male users in transactions has increased. The rising proportion is not obvious, so we can think that gender has no obvious effect on the transaction.
 Multi dimensional joint exploration
The information provided by a single dimension is limited, and some data need to be observed with the time dimension. For example: in order to measure how many people can stick to the class the next day, we need to compare the proportion of users who arrive on the first day with the proportion of users who arrive on the second day.
Among them, the completion ratio is the proportion of users who complete all the course contents among the users. This number can measure whether there are problems in the course contents and teaching links. The conclusion can be drawn from the above figure: the completion ratio of the first day of the course is low, and the reasons need to be investigated.
Line chart is very suitable to reflect this trend change, multiple lines superimposed on the same chart can be more intuitive performance of data.
def course_brief(df):
active_rate = []
finish_rate = []
finish_at_active = []
for i in range(4):
vc = df['d{}'.format(i+1)].value_counts()
n = sum(vc.values)
active_rate.append((vc[1]+vc[2])/n * 100)
finish_rate.append((vc[2])/n * 100)
finish_at_active.append(vc[2]/(vc[1]+vc[2]) * 100)
plt.figure(figsize=(10, 5))
dayx = ['Day_1', 'Day_2', 'Day_3', 'Day_4']
plt.plot (dayx, active_ Rate, color = R, label = attendance rate, linestyle = solid ', alpha = 0.5)
plt.plot (dayx, finish_ Rate, color = B, label = completion rate, alpha = 0.5, linestyle ='dotted ')
plt.plot (dayx, finish_ At_ Active, color = g, label = end of class ratio, alpha = 0.5, linestyle ='dashed ')
plt.legend()
course_brief(df)
 Dimensionality reduction feature construction
Some information must be represented by multidimensional data, and the visualization effect is not good. We need to construct features to reduce dimensions. For example, we want to verify whether the change trend of the correct answer rate of closedended questions is related to the turnover, that is to say, we want to know whether people with better performance in class are more willing to make a deal or people with worse performance are more willing to make a deal.
Based on the original data, this paper abstracts the characteristics of “first correct answer rate”, and counts the number of right answers in all first answer records of each class. The proportion of the number in the answered questions is regarded as “first correct answer rate”. The first time correct rate of each class is calculated and the following data are obtained:
To measure the trend of this number, we can construct two new columns of data, startfr and endfr. Among them, startfr = (d1fr + d2fr + d3fr) / 3, endfr = (d3fr + d4fr + d5fr) / 3. Two new features can be easily calculated by panda
df['startFR'] = (df['d1fr'] + df['d2fr'] + df['d3fr']) / 3
df['endFR'] = (df['d3fr'] + df['d4fr'] + df['d5fr']) / 3
Successfully reduce the 5Dimensional data to 2dimensional data, but it is not enough, can continue.
df['upRate'] = (df['startFR']  df['endFR']) * 100 / df['startFR']
A new feature, uprate, is constructed to represent the change trend of “first correct rate”. When it is positive, it means that “first correct rate” is on the rise, and when it is negative, it means that it is on the decline. The numerical value indicates the slope of the change, and the larger the number is, the more obvious the change is. Combined with other data columns, the following diagram can be drawn to explore the data:
In the figure above, the gray xpoint represents the non closed user, and the blue dot represents the closed user. The xaxis is uprate, and the yaxis is the first correct rate of the first class. The dotted line is where the median is. Scatter plot is suitable for exploring data distribution and observing data correlation.
df_deal = df.loc[df['deal']==1]
df_none = df.loc[df['deal']==0]
plt.scatter(df_deal['upRate'], df_deal['av1'], marker = 'o',color = 'blue', label = 'Deal')
plt.scatter(df_none['upRate'], df_none['av1'], marker = 'x',color = 'gray', label = 'Deal')
plt.axhline (y = DF ['av1 ']. Medium(), LS =: ", C =:" black ") #
plt.axvline (x = DF ['uprate ']. Median(), ls' =: ", C' ='green ') # add vertical line
plt.show()
4、 Summary and questions
 Data visualization can more intuitively show the trend of data and mining the value of data;
 Refined operation needs to collect data from various business environments; for example, the above data lacks effective data to show customers’ purchasing power.
5、 Prospect
We can dig and use data from the following directions to make the value of data play out.
Product end:
 Data visualization is the basis of refined operation. By visualizing the data, problems in product design, content and operation strategy can be found in time
 The comparison of visual data can provide decisionmaking information for policy adjustment;
 Data visualization with data baseline can verify the effect of product scheme adjustment;
Sales end:
 Machine learning modeling based on data can realize the prediction and identification of high potential customers and improve the human efficiency of sales staff;
 Course data mining can provide hook information for sales and facilitate the dialogue between users and sales;
Client:
 The effect can be externalized by data and algorithm
 Provide personalized difficulty curve, course content and learning path