Implementation and practice of real time computing technology in vipkid online education scene


Introduction to vipkid

Vipkid is an online English education platform for young children. Since its establishment seven years ago, the company has adhered to the mission of empowering education and enlightening the future, focusing on the one-to-one online teaching mode, adopting 100% pure North American foreign teachers, with students in 63 countries and regions.

Up to now, the number of paying students has exceeded 700000, the number of one-to-one lessons per day has exceeded 100000, and the maximum number of concurrent courses during peak hours has reached 35000. It has 5 cross sea special lines covering 35 countries in the world, completes the layout of data center transmission nodes in 16 countries and 55 cities, and can complete intelligent switching in one minute according to real-time dynamics [1].

Core business scenarios

Introduction to main scenes

In the process of one-to-one (one teacher and one student) teaching mode, the teacher uses the courseware as the auxiliary teaching through live broadcast. The interactive form includes not only audio and video, but also chat room and writing, drawing and dragging actions on the courseware. The whole course involves multiple component modules.

Each module provides services in the way of collaborative dependence. The events in any link should be visible and synchronized to the teachers and students. For example, the teacher can see the students in the classroom before class starts, the students can hear the teacher speak, the students can see the teacher turning the courseware, and the normal class can continue until the end.

In large-scale network teaching,Real time interaction of streaming mediaLive andMessage real time data transmissionHeavily dependent on user equipment and network, large amount of data, especially in the case of cross sea transmission, becomes very difficult, and has very strict requirements for network stability.

Compared with the live broadcast of large class online class, 1v1 pays more attention to interaction, so the tolerance of problems is very low, and the problems of any party will affect the class experience. One of the scenarios is that when there are network and other abnormal problems, the user will click the “help” button for help. At this time, the supervisor (hereinafter referred to as “FM”, abbreviated from fireman) needs to intervene immediately, which has a great demand for the scale and real-time operation of the service personnel.

Current business pain points

At present, in the mode of only manual processing of user help, due to the large number of daily help requests (accounting for about 10% of the total courses), and the large amount of per capita course supervision, at the same time, from receiving the request to the supervisor’s intervention, it also needs to go through multiple processes, which will lead to the following problems:

  1. If the problem is not handled in time, users are easy to wait and block the class, which brings poor user experience;
  2. The low efficiency of manual processing, the increase of classes and large-scale emergencies lead to the increase of FM team size and need more manpower;
  3. If some users have problems and do not contact the supervisor, the problems will be hidden;

Technology implementation

In order to solve the problem of business pain points mentioned above, after extracting and combing the business features of each link, we designed a scheme to produce business tags through real-time calculation, and use tag data for automatic course monitoring to solve the problem of user help. The following will focus on the technical implementation details of the whole scheme: involving the construction of data system, the construction of automation business system, the core issues and optimization, and the final revenue effect:

  1. Data system construction: This paper introduces the VLink data platform used to support the whole real-time computing, the relevant business data collection and business tag data calculation in the current scenario, which are the support of business implementation;
  2. Automation business system: This paper introduces how to apply real-time data stream to solve the current business pain point;
  3. Problems and optimization: introduce the business and technical problems encountered in the process of implementation and the solutions;
  4. Revenue effect: to introduce the final results of income;

Data system construction

The original intention of the whole data system construction is to solve where the data comes from, what the business logic of the data is, how to calculate, how to unify the management and enable more scenarios to solve more business problems.

  1. VLink data platform: introduce the one-stop data platform and provide data access details:

    a. Data sources;

    b. Business meaning of data;

    c. Data management rules, improve the efficiency of development access, and solve the problem of unclear upstream and downstream;

  2. Business data collection: introduce the business data collection in the current scenario;
  3. Business data calculation: This paper introduces how to use Flink to calculate business data of complex logic;

■ VLink data platform

VLink data platform is based on the reflection of some problems in the development process of Flink streaming job, learning from the server development online process, taking the R & D personnel as the center, improving the development efficiency and reducing the maintenance cost as the starting point to design and develop the system, and support the developmentData acquisition managementDot access managementDot test integrationAnd other functions.

  • Main function points

1. Run the job interactively

In addition to Flink SQL, the running mode of streaming job submission in the industry is the same as that of uploading jar package provided by the government,Package, wait and follow, upload, wait and follow, run. We cooperate with the operation and maintenance team to provide one click packaging deployment function, which can be setAutoRunRun automatically after the deployment is successful.

2. Batch operation:

In some scenarios, partial or full job restart is required. When the job load is large, it is a time-consuming and laborious process, and it is easy to make mistakes. Batch build, stop and run become very easy

  1. A kind of job logic update;
  2. Third, upgrade and update the three-party dependency library;
  3. Cluster upgrade;

3. SP function: create and run savepoint interactively.

4. Blood relationship diagram: it reflects the upstream and downstream relationship of data from management to final output.

The input and output of processors P1, P2 and P3 can be clearly seen from the figure.

5. Other functions:

  1. Version control;
  2. Support interactive development of Flink SQL jobs (only Kafka).
  3. Data schema query
  • Development constraints

In the process of Flink job development, we found that the core logic is the function in the pipeline process. At the same time, there are a large number of repetitive logical functions, such as job context configuration, adding source, setting watermark, etc. so we extracted the logic of each layer and encapsulated it into components, and made some development constraints to let developers only focus on the core logic.

1. Provide ‘abstractjobmodel’, unified schema input data:

private[garlic] trait AbstractJobModel extends Serializable {
  Def Tm: long // event time
  Def inspiration: long // inspiration time
  Def F: Boolean // for filter data that is useless
  Def unnatural: Boolean // filter future data "supernatural" data

Unnatural: data whose time stamp is larger than the current time due to different system time at each end, we call it “supernatural” data, which needs special attention when processing semantic eventtime.

2. Provide unified and flexible Kafka source initialization mode

 *Specifies the consumption timestamp initialization method
def initSourceWithTm[T](deserializer: AbstractDeserializationSchema[T], topics: Array[String], tm: Long): SourceFunction[T]

 *Specify the consumption timestamp and Kafka server initialization method
def initSourceWithServerAndTm[T](deserializer: AbstractDeserializationSchema[T], topics: Array[String], servers: String, tm: Long): SourceFunction[T]

 *General initialization method
def initSource[T](implicit deserializer: AbstractDeserializationSchema[T], topics: Array[String], servers: String, tm: Long = 0L): SourceFunction[T]

3. Multi form sink function

  1. Sinkfiltered datatokafka: non compliant or exceptions are filtered.
  2. Sink unnatural data to Kafka: Supernatural data.
  3. Sinklatedatatokafka: out of order data should be delayed and discarded by window function.
  4. Sink data in and process to Kafka: intake time and processing time of each data.

4. Support common three-party connection components

  1. Kafka
  2. Hbase
  3. ES
  4. JDBC

Business data collection

Data collection is the basis and important part of the whole data processing architecture. The real-time and accuracy of data collection will directly affect the upper business. The collection methods include indirect upload file and direct HTTP management.

The event data burying point involves mobile terminal, PC terminal and server terminal, and the key event point is entering the classroom

  1. The user initiates the process of entering the classroom: after loading the SDK, request the service and gateway, and then initialize the service components, such as streaming media, message channel and dynamic courseware. When all components are normal, it means that entering the classroom is successful. Otherwise, continue to retry the logic until entering the classroom fails or succeeds;
  2. After entering the classroom successfully, when the course is in normal progress, the service component continues to provide services and reports data in real time.

On the whole, there are both problem tags and normal tags on the buried points. According to the classroom entry events and component types, they can be divided into level 1, level 2 and level 3 from coarse to fine.

  1. Enter the classroom label, the user has 0 to many times to enter the classroom records, because of a component initialization failure and can not enter the classroom and into the classroom is too long, and into the classroom successfully.
  2. Streaming label, mainly including audio and video stuck, can’t hear each other and can’t see each other, as well as audio and video normal data, data hundred millisecond level.
  3. Dynamic courseware tagThe main reasons are that the courseware loading fails, the courseware action is not synchronized and cannot be drawn.
Business data calculation

This business calculation has a high demand for real-time performance. In terms of technology selection, Flink will be the main choice [2], and spark will be the main choice for day level offline data analysis.

Label calculation is the key point of the whole automatic processing. The speed of index calculation represents the speed that the system can process. The data comes from multiple business flows. Combined with the current business scenarios, the typical computing scenarios are as follows:

a. Multi stream union based on event time


    val stream = env.addSource(singleSource).name("signal")

      .assignTimestampsAndWatermarks(new DummyEventTimePunctuWaterMarks[InlineInputEventForm](6 * 1000))
      .filter(m => *** ).name("***")

    val ***Streaam = stream
      .filter(f => *** )
      .keyBy(key => *** )
      .window(TumblingEventTimeWindows.of(Time.milliseconds(30 * 1000L)))
      .apply( ***WindowFunction)

    sink***ToKafka(***Streaam, ***name, recordFilter60s, ***kafkaSink, recordTmKafkaSink)

Note:*Desensitization treatment of indication service (the same below)

b. Multi stream join


    ***Omit part of the logic code***

    val ppt***JoinStream = ***Stream
      .where(lb => ***)
      .equalTo(lb => ***)
      .window(SlidingEventTimeWindows.of(Time.milliseconds(30000), Time.milliseconds(15000)))

    sink***StreamToKafka(ppt***JoinStream, ***name, recordFilter60s, ***kafkaSink, recordTmKafkaSink)

The current version of cogroup operator (1.7.2 and above) does not support late data output, and the relevant JIRA [3] has been proposed to the community.

c. Loading dimension data asynchronously



In addition, when calculating dimension data, hot data is cached through guavacache builder [4] according to the effectiveness.

Automation business system

By sorting out the problems encountered by the key ring nodes in the class, we propose a business solution to do a layer of real-time automation service after the user initiates help and before FM intervention.

Technically, the automation business system is built in the whole systemData systemIn the process of classReal time label dataAnd then the label system applies the label data stream throughPre inspection and self inspectionAnd other means to deal with the problem automatically or semi automatically. For problems that cannot be handled by the system, it will be handled manually.

First, there are two ways to report the problems in the course:

  1. Help is initiated by passive users;
  2. Active detection of problem label flow;

Then the verification logic module filters out invalid problems, such as invalid help, repeated help, expired request, FM intervention, special problems, etc., as well as problems that cannot be covered by the label system (such as noise), which are directly transferred to FM for manual processing.

If the request has passed the verification module and the system can process it automatically, the self-test processing system can try to tangent, and then carry out tangent verification and put the tangent mark into the pending queue pending. In the pending verification phase, the feedback within the normal label stream can be obtained in real time to detect whether it is normal.

Problems and optimization

The whole business scenario requires high real-time performance, but also to ensure the accuracy and need to know the context of each data. For cases that are not calculated correctly Specific calculation details should be given, such as which layer of data takes too long to arrive at the processing engine, which link takes too much time to process, which data is lost due to disorder, how to improve the speed of loading dimension information, how to skillfully improve the system processing capacity, and how to use less computing resources at the same time, and how to deal with “supernatural” data (see “VLink” for details) Data platform “).

  1. The quality of data is uneven, and the indicators are inconsistent: the whole data burying point involves 3 departments and 11 teams, and there is no unified caliber. Through the VLink data platform, data indicators, end version control and verification process are managed according to the business level;
  2. Getting dimension information in real-time computing results in pressure on DB databaseA. aggregate data through small windows to reduce the number of queries on the premise of business permission; B. increase the cache according to the timeliness of data;
  3. When there is no course data, the amount of data will double: in the serial logic, there are multiple windows in front, and the size of the window is consistent with the size of the core logic window. Specify the number of partitions with taskmanager * 2, preprocess to obtain the course dimension information, and then shuffle to the downstream core window for logical processing.

Revenue effect

Up to now, the number of user requests in the whole class has decreased by nearly 3%, which has not caused the rise of other business indicators. It has effectively improved the work efficiency of the course supervision personnel, with low processing delay, support for multiple concurrent processing, and effectively improved the course experience.

  1. Nearly 60% of the help seeking can be handled automatically, and the number of supervisors is reduced by nearly 40%;
  2. Users can finish the processing in 20 seconds after asking for help, the processing speed is faster than manual, and the processing success rate is high;
  3. The customer satisfaction is high, and the complaint rate is reduced by 2 / 3;


In order to improve the quality of the course and the user experience in class as the starting point, the application of real-time computing technology to build the basic label data system has achieved great results in business, and has also been highly recognized by the company. In addition, there are many unexpected gains, such as improving the in class experience and the labor efficiency, and the label system based on this business can also be used It can be applied to other services, such as full link fault engine, class closing type center, etc.

Because the two technical characteristics of online education areReal time interactive live streaming mediaandMessage real time data transmissionIt has real-time properties, and has many applications in many business scenarios, such as real-time course, real-time follow-up service and so on.

Related references


About the author

Zhen Guoyou, Senior Data Engineer of vipkid online classroom technology R & D center, is responsible for the implementation and scenario of online classroom real-time computing system, focusing on the construction and architecture of data system.

Recommended Today

Don’t be a tool man. Touching hands teaches you Jenkins!

Hello everyone, I’m a piece of cake, a piece of cake eager to be Cai Bucai in the Internet industry. Soft or hard, praise is soft, white whoring is just!Ghost ~ remember to give me a third company after watching it! This article mainly introducesJenkins If necessary, please refer to If it helps, don’t forgetgive […]