Lin Xiaobo, senior development engineer of Netease games, introduces the construction of streaming ETL of Netease games based on Flink. The contents include:
- Business background
- Dedicated ETL
- Entryx universal ETL
- Tuning practice
- Future planning
1、 Business background
Overview of Netease game ETL service
Netease game’s basic data is mainly collected by logs, which are usually unstructured or semi-structured data, and can be stored in real-time or offline data warehouse only after data integration ETL. After that, business users can easily use SQL to complete most of the data calculations, including real-time Flink SQL and offline hive or spark.
The data flow of Netease game data integration is similar to that of most companies, mainly including game client log, game server log and other peripheral basic logs, such as nginx access log, database log and so on. These logs will be collected to the unified Kafka data pipeline, and then written to hive offline data warehouse or Kafka real-time data warehouse through ETL warehousing service.
This is a very common architecture, but we have some special situations in terms of requirements.
Netease game streaming ETL demand characteristics
First of all, unlike relational databases such as MySQL and Postgres, which are commonly used in the Internet, finance and other industries, the game industry often uses schema free document databases such as mongodb. The problem that this brings to our ETL service is that there is no accurate schema for online business to rely on. In actual data processing, many fields or few fields, or even one field can be changed to a completely different format because of the iteration of the playing method, which is possible. This kind of heterogeneous data problem brings us a relatively high cost of ETL data cleaning.
Secondly, due to the reason of database selection, most of the business database schema follow the anti paradigm design, and deliberately use complex embedded fields to avoid the join between tables. One advantage of this situation is that in the data integration stage, we do not need to join multiple data streams in real time. The disadvantage is that the data structure may be very complex, and multi-layer nesting is very common.
Then, due to the popularity of real-time data warehouse in recent years, we are also gradually building a real-time data warehouse, so reusing the existing ETL pipeline, extracting and converting once, and loading it into two real-time offline data warehouses has become a very natural development direction.
Finally, we have many log types and frequent changes. For example, a game with complex playing methods may have 1 , More than 2000 log types may be published once every two weeks. In this context, it is inevitable for ETL to have abnormal data. Therefore, we need to provide perfect exception handling, so that the business can know the data exception in time and repair the data through the process.
Log classification and characteristics
In order to better optimize for different business usage patterns, we provide different services for different log types of business. Our logs are usually divided into three types: operation log, business log and program log.
The operation log records the behavior events of players, such as login account, receiving gift bag, etc. This kind of log is the most important log and has a fixed format, that is, a specific header + The text format of JSON. The main purpose of data is to do data reports, data analysis and in-game recommendation, such as player team matching recommendation.
Business logs record business events other than player behavior, which are quite extensive, such as nginx access log, CDN download log, etc. these logs have no fixed format, and may be binary or text. Its main purpose is similar to operation log, but it is richer and more customized.
Program logging is the operation of the program, that is, we usually use the log framework to type such logs as info and error. The main purpose of program log is to search and locate running problems. It is usually written to es, but sometimes it is written to data warehouse when the number is too large or it needs to extract indicators for analysis.
Analysis of Netease game ETL service
For these log classification, we provide three types of ETL warehousing services. The first is the ETL dedicated to the operation log, which will be customized according to the mode of the operation log. Then there is the general text log oriented entryx ETL service, which will serve all logs except the operation log. Finally, there are special ETL requirements that entryx can’t support, such as data that need encryption or special conversion. In this case, we will develop ad-hoc jobs to deal with them.
2、 ETL for operation log
Development of ETL
ETL service has a long history. Around 2013, Netease game established the first version of offline ETL framework based on Hadoop streaming + Python preprocessing / postprocessing. The framework has been running smoothly for many years.
In 2017, with the emergence of spark streaming, we developed the second version based on spark streaming, which is equivalent to a PoC. However, due to the difficulty of micro batch tuning and the large number of small files, we did not apply it online.
In 2018, when Flink was relatively mature, we decided to move our business to Flink, so we naturally developed the third version of the operation log ETL service based on Flink datastream. The special point is that our business has accumulated a lot of Python ETL scripts for a long time, and the most important point of the new version is to support seamless migration of these Python UDFs.
ETL architecture of operation log
Next, let’s look at the architecture comparison of the two versions.
In the early versions of Hadoop streaming, the data was first dumped to HDFS, and then Hadoop streaming started mapper to read the data and pass it to Python script through standard input. Python scripts are divided into three modules: first, preprocess UDF. Here, string based replacement is usually used to normalize data. For example, the time format of some overseas partners may be different from ours, so it can be unified here. The preprocessed data will enter the general parsing / conversion module. Here, we will parse the data according to the format of the operation log and carry out the general conversion, such as filtering out the test service data. After the general module, there is a post-processing module for field conversion, such as common exchange rate conversion. After that, the data will be returned to mapper through standard output, and then mapper will batch write the data to hive directory.
After reconstruction with Flink, the data source is changed from HDFS to Kafka directly, and the IO module uses Flink’s source / sink operator to replace the original mapper. Then the intermediate general module can be directly rewritten to Java, and the remaining preprocessing and post-processing are where we need to support Python UDF.
Python UDF implementation
In the specific implementation, we add a runner layer to the Flink processfunction, which is responsible for cross language execution. Jython is chosen instead of py4j in technology selection, mainly because Jython can complete the calculation directly in the JVM, and there is no need to start the python process additionally, so the development and operation and maintenance management costs are relatively low. The limitations of Jython, such as not supporting C-based libraries such as panda, are acceptable to our Python UDF.
In the whole call chain, processfunction will delay the initialization of runner in the open function when taskmanager is called, because Jython is not serializable. When the runner initializes, it is responsible for resource preparation, including adding the dependent modules to pythonpath, and then calling UDF functions according to the configuration reflection.
When calling, for preprocessing UDF, the runner will convert the string to pyunicode type of Jython, while for post-processing UDF, the parsed map object will be converted to pydcision type of Jython, which will be used as the input of both. UDF can call other modules for calculation, and finally return pyobject, which is then converted into Java string or map by runner and returned to processfunction output.
Operation log ETL runtime
Just now is the partial view of UDF module. Let’s take a look at the overall ETL job view. First, we provide the general Flink jar. When we generate and submit ETL jobs to the job platform, the scheduler will execute the general main function to build the Flink jobgraph. At this time, we will pull the ETL configuration from our configuration center, configserver. The ETL configuration contains the python modules used. The back-end service will scan the other modules referenced in the ETL configuration and upload them to HDFS as resource files through the horn distribution function. When Flink jobmanager and taskmanager are started, these Python resources will be automatically synchronized to the working directory by yard for standby. This is the whole process of job initialization.
Then, because small changes to ETL rules are frequent, such as adding a new field or changing the filter conditions, if we need to restart the job every time we change, the unavailability time caused by job restart will cause a bad experience for our downstream users. Therefore, we classify the changes and support hot update for some lightweight changes that do not affect Flink jobgraph. The implementation is that each taskmanager starts a hot update thread and periodically polling the configuration center to synchronize the configuration.
3、 Entryx universal ETL
Next, we introduce our general ETL service entryx. There are two meanings in this paper: the first is the general data format, which supports all kinds of unstructured to structured text data; the second is the general user group, which covers traditional users such as data analysis and data development, and users with weak data background such as business program and planning.
Entryx basic concepts
This paper introduces three basic concepts of entryx, source, streamingtable and sink. Users need to configure these three modules respectively, and the system will automatically generate ETL jobs according to these modules.
Source is the input source of ETL job, which is usually the original log topic collected from the business side, or the topic after distribution and filtering. These topics may contain only one kind of log, but more often, they may contain multiple heterogeneous logs.
Next, streaming table, a popular name, is stream table. The flow table defines the main metadata of ETL pipeline, including how to transform data, and how to schema the data according to the flow table schema defined by the transformed data. Stream table schema is the most critical concept, which is equivalent to table DDL, mainly including field name, field data type, field constraint and table attribute. In order to connect upstream and downstream more conveniently, flow table schema uses self-developed SQL – Like’s type system will support some extended data types, such as JSON type.
Finally, sink is responsible for mapping the flow table to the target stored physical table, such as mapping to the target hive table. The mapping relationship of schema is mainly needed here, such as which field of flow table is mapped to which field of target table, which field of flow table is used as partition field of target hive table. At the bottom, the system will automatically extract the fields according to the schema mapping relationship, convert the data into the storage format of the target table, and load it into the target table.
Entryx ETL pipeline
Let’s look at the implementation of the entryx ETL pipeline. The blue part is the external storage system, and the green part is the internal module of enrtyx.
Firstly, the data flows into the filter from the topic of the original data collected by docking. Filter is responsible for filtering data according to keywords. Generally speaking, we require the filtered data to have the same schema. After these two steps, the data is extracted and transformed.
The first step of transform is to parse the data, which is the parser here. Parser supports JSON / Regex / CSV has three kinds of parsing, which can basically cover all cases. The second step is to transform the data, which is the responsibility of extender. Extender calculates derived fields through built-in functions or UDF. The most common method is to flatten and expand JSON objects to extract embedded fields. The last is formatter, which converts the value of the field to the corresponding physical type according to the previous user-defined logical type of the field. For example, a field whose logical type is bigint will be converted to the physical type of Java long.
After the data is transformed, it comes to the final load stage. The first step of load is to decide which table the data should be loaded into. The splitter module will split the data according to the storage condition of each table (that is, an expression), and then go to the loader in the second step to write the data to the specific external storage system. At present, we support hive / Kafka, hive supports text / parquet / JSON, and Kafka supports JSON and Avro.
Real time offline unified schema
In the design of entryx, data can be written into real-time and offline data warehouses, that is to say, the same data can be expressed in different formats in different storage systems. From the perspective of Flink SQL, it is two tables with the same schema but different connector and format. The schema part often changes with the business, and the connector and format (that is, the storage system and storage format) are relatively stable. So a natural idea is, can you extract the schema part and maintain it independently? In fact, this abstract schema already exists, which is the stream table schema we extracted in ETL.
In entryx, stream table schema is a schema independent of serializer and storage system, which is used as single source of truth. Based on the flow table schema, together with the storage system information and storage format information, we can derive the specific DDL of the physical table. At present, we mainly support hive / Kafka, and it is very convenient to expand to support ES / HBase tables in the future.
Real time Data Warehouse Integration
An important positioning of entryx is to serve as the unified entrance of real-time warehouse. Just now, we have mentioned the Kafka table many times, but we haven’t said how to do the real-time data warehouse. The common problem of real-time data warehouse is that Kafka does not support schema metadata persistence. At present, the mainstream solution of the community is to save the metadata of Kafka table based on hive Metastore, and reuse hivecatalog to connect to Flink SQL directly.
But for us, there are several problems in using hive Metastore: first, we introduce hive dependency in real-time jobs and couple it with hive, which is a heavy dependency. As a result, the defined table is difficult to be reused by other components, including Flink datastream users; second, we already have avatar, a Kafka SaaS platform, to manage physical schema, such as Avro schema Hive Metastore can lead to metadata fragmentation. Therefore, we have expanded the schema registry of avatar platform, supporting both logical schema and physical schema.
Then, the integration relationship between real-time data warehouse and entryx is as follows: first, we have entryx’s stream table schema. When creating a sink, we call avatar’s schema interface to generate a logical schema according to the mapping relationship, and avatar generates a topic’s physical schema according to the mapping relationship between Flink SQL type and physical type.
Kafka catalog, which is developed by ourselves, is also matched with Avatar schema registry. It is responsible for reading the logical and physical schema of topic to generate the tablesource or tablesink of Flink SQL. For users other than Flink SQL, such as users of Flink datastream API, they can also read physical schema directly to enjoy the convenience of data warehouse.
Similar to operation log ETL, when entryx is running, the system will generate Flink jobs based on common jars and configurations, but there are two situations that need to be handled specially.
First of all, a Kafka topic often has dozens or even thousands of logs. In fact, there are dozens or even thousands of stream tables. If each stream table runs in a single job, a topic may be read thousands of times, which is a great waste. Therefore, an optimization strategy is provided when the job is running, which can merge different flow tables of the same source into one job. For example, in the figure, a mobile game uploads three kinds of logs to Kafka, and the user configures three flow tables: player registration, player login, and receiving gift package. Then we can combine these three flow tables into one job and share the same Kafka source.
Another optimization is that, in general, we can write data to hive and Kafka at the same time according to the previous idea of “extracting, converting and loading once”. However, hive or HDFS is an offline system after all, and its real-time performance is poor. Writing to some old HDFS clusters with high load often results in backpressure and blocking the upstream, leading to Kafka The writing of is also affected. In this case, we usually need to separate the ETL pipeline loaded to real-time and offline, depending on the SLA of the business and the performance of HDFS.
4、 Tuning practice
Next, let’s share our practical experience in ETL construction.
HDFS write tuning
The first is the tuning of HDFS writing. A common problem in streaming HDFS scenario is too many small files. Generally speaking, small files and real-time can’t have it both ways. If the delay is low, we need to scroll files frequently to submit data, which will inevitably lead to too many small files.
There are two main problems caused by too many small files: first, from the perspective of HDFS cluster management, small files will occupy a large number of files and blocks, wasting namenode memory; second, from the perspective of users, reading and writing efficiency will be reduced, because RPC and flush data are called more frequently when writing, causing more blocking, and sometimes even checkpoint Timeout, while reading requires opening more files to read the data.
HDFS write tuning data stream pre partitioning
When we optimize the small file problem, we need to do a pre partition for the data stream. Specifically, we need to do a keyby partition based on the target hive table in the Flink job, so that the data of the same table can be concentrated on a few subtasks as much as possible.
For example, suppose that the parallelism of the Flink job is n and the number of target hive partitions is m. Because each subtask can read the data of any partition, by default, each subtask will write to all partitions when all subtasks are completely parallel, resulting in the total number of written files being n * M. Assuming n is 100 and M is 1000, rolling files every 10 minutes will result in 14.4 million files per day, which is a great pressure for many old clusters.
After the optimization of data flow partition, we can limit the growth of Flink parallelism. For example, if we use the keyby hive table field and add salt in the range of 0-s integers to avoid data skewing, the partition will be read and written by s subtasks at most. Suppose s is 5, compared with the original n is 100, then we will reduce the original number of files to one twentieth of the original.
SLA statistics based on operatorstate
The second thing I want to share is our SLA statistics tool. The background is that our users often debug and check problems through the Web UI, such as the number of input and output of different subtasks, but these metrics will be reset due to job restart or fail over. Therefore, we developed SLA utils tool based on operatorstate to count the data input and classify the output. This tool is designed to be very lightweight and can be easily integrated into our own services or users’ jobs.
In SLA utils, we support three metrics. The first is the standard metric, with recordsin / recordsout / recordsdropped / recordsrerored, corresponding to the number of input records / the number of normal output records / the number of filtered records / the number of abnormal records. Generally speaking, records in is equal to the sum of the last three. The second kind of user-defined metric is usually used to record more detailed reasons. For example, records event time droped represents that the data is filtered because of event time.
Then the above two kinds of metric are static, that is to say, the metric key must be determined before the job is run. In addition, SLA – Utils also supports TTL metric registered dynamically at runtime. This metric usually has a dynamically generated date as a prefix, which is automatically cleaned up after the TTL time. TTL metric can be used to do the statistics of day level time window. The special point here is that the operator state does not support TTL and SLA – Utils is to filter every checkpoint snapshot to eliminate the expired metric, so as to achieve the effect of TTL.
Then, after the SLA indicators are saved in the state, all we have to do is expose them to users. Our current approach is to expose it through accumulator. The advantage is that the Web UI is supported and can be used out of the box. At the same time, Flink can automatically merge the metrics of different subtasks. The disadvantage is that there is no way to use metric reporter to push to the monitoring system. At the same time, because acumuulater cannot dynamically log off at runtime, there is a risk of memory leakage when using TTL metric. Therefore, in the future, we also consider supporting metric group to avoid these problems.
Data fault tolerance and recovery
Finally, let’s share our practice in data fault tolerance and recovery.
Similar to many best practices, we use sideoutput to collect the error data in each link of ETL and summarize them into a unified error flow. The error record contains our preset error code, original input data, error class and error information. Generally, error data will be classified and written to HDFS. Users can know whether the data is normal by monitoring HDFS directory.
After storing the abnormal data, the next step is to recover the data. There are usually two cases.
One is that the data format is abnormal, such as the log is truncated, resulting in incomplete or the timestamp does not conform to the agreed format. In this case, we usually repair the data by offline batch operation and backfill it back to the original data pipeline.
The second is ETL pipeline exception. For example, if the actual schema of the data is changed but the flow table configuration is not updated, a field may be null. At this time, our solution is: first, update the flow table configuration on the line to be up-to-date to ensure that no more abnormal data will be generated. At this time, some partitions in hive are still abnormal. Then, we publish an independent complement job to repair the abnormal data. The output data will be written to a temporary directory, and switch the location of partition partition on hive metadata to replace the original abnormal directory. Therefore, such a complement process is transparent to offline query users. Finally, we replace the data of the exception partition at the right time and recover the location.
5、 Future planning
Finally, we introduce our future planning.
- The first is data Lake support. At present, most of our logs are of the type of append. However, with the improvement of CDC and Flink SQL services, we may have more requirements for update and delete. Therefore, data lake is a good choice.
- The second will provide more abundant additional functions, such as real-time data De duplication and automatic merging of small files. These two functions are very practical for the business side.
- The last one supports pyflink. At present, our Python support only covers the data integration stage, and we hope to realize the python support of the subsequent data warehouse through pyflink.