Author: Huang Lianghui
This paper introduces the application of team data in Singapore
- Background of real time data warehouse construction
- Flink combines druid and hive application scenarios in real-time data warehouse construction
- Real time task monitoring
- Streaming SQL platformization
- Streaming job management
- Future planning optimization direction
Shopee is a leading e-commerce platform for Southeast Asia and Taiwan, covering seven major markets of Singapore, Malaysia, Philippines, Taiwan, Indonesia, Thailand and Vietnam. Meanwhile, it has set up cross-border business offices in Shenzhen, Shanghai and Hong Kong.
- Shopee’s total order volume in the first quarter of 2020 reached 429.8 million, up 111.2% year on year.
- According to app Annie, shopee will be among the top three global shopping app downloads in the first quarter of 2020.
- At the same time, it won the three championships of total annual downloads, average monthly live number and total Android usage time of shopping apps in Southeast Asia and Taiwan markets, and led the two leading markets in Southeast Asia, winning the double champion of annual live number of shopping apps in Indonesia and Vietnam next month.
Including order goods, logistics, payment, digital products and other aspects of business. In order to support these Internet products and cope with more and more business challenges, we design and construct the data warehouse.
Data warehouse challenges
At present, with the development of business, the expansion of data scale and the continuous growth of business intelligence team’s real-time demand, business challenges are growing
Business dimensionAs far as business requirements are concerned, business requirements are becoming more and more complex, including detailed data query, real-time aggregation report of various dimensions, real-time label training and query requirements. At the same time, a large number of services share some business logic, resulting in a large number of services with high coupling and repeated development.
Platform architectureWith the increasing number of tasks, management and scheduling, resource management, abnormal data quality monitoring are becoming more and more important. Real time is also more and more urgent. At present, a large number of services are still in the form of offline tasks, which leads to a huge load of early morning services. At the same time, services based on T + 1 (day and hour level) architecture can not meet the needs of refined and real-time operation.
Technology implementationFor example, spark structured streaming is widely used in real-time business, which relies heavily on HBase for stateful requirements, and the development is complex; in case of abnormal failure, task fails, and lacks support of exact only feature, data is easy to be lost and repeated.
In order to solve the above problems, we started the exploration of Flink real-time data warehouse.
Data warehouse architecture
In order to support the growing data and complex business of these Internet-based products, shopee constructs the data warehouse architecture as shown in the figure below
- At the bottom
Data collection layerThis layer is responsible for real-time data, including binlog, service log and tracking service log. After real-time digestion, team data will be collected into Kafka and HBase. The auto ingestion team is responsible for the daily collection of database data to HDFS.
- And then there’s the
Storage layerIn this layer, Kafka stores real-time messages, HDFS stores hive data, and HBase stores dimensional data.
Storage layerIt’s up there
Spark, Flink computing engine, Presto SQL query engine。
- And then there’s the
Dispatching management, various resource management, task management, task scheduling, and various spark and Flink tasks.
Resource ManagementThe next level is OLAP
Data storage layer, Druid is used to store time series data, Phoenix (HBase) is used to store aggregate report data, dimension table data and label data, and elastic search is used to store data that need multi-dimensional field index, such as advertising data and user portrait.
- At the top is
application layer, data report, data business service, user portrait, etc.
Practice of Flink real time data warehouse
At present, shopee data team mainly synchronizes from binlog and tracking service to Kafka cluster. Through Flink / spark calculation, it includes real-time order sales promotion activity analysis, order logistics analysis, product user bidding, user impression behavior analysis, e-commerce activity game operation analysis, etc. The final results are stored in Druid, HBase, HDFS, etc., and then access to some data application products. At present, many core jobs have been migrated from spark structured streaming to Flink streaming.
Real time data warehouse application based on Flink and Druid
In the real-time order sales analysis product, the order flow is processed through Flink, and the processed detailed data is injected into Druid in real time to achieve the purpose of real-time operation activity analysis of the company.
We use T-1 (day) lambda architecture for real-time and historical order data product analysis. Flink only processes real-time today’s order data, and regularly indexes yesterday’s data to Druid through offline tasks every day to cover and correct small errors in real-time data. The overall Flink real-time processing flow is shown in the following figure. From the top to the bottom, there are three pipelines:
The first pipeline accesses the order binlog event through Kafka.
- First, we parse and deserialize order events, filter invalid orders by order time, and only keep today’s orders. By order primary key
ProcessWindowFunctionBecause the upstream data is binlog, there will be repeated order events, so the order will be de duplicated through valuestate.
- Then, by querying HBase (Phoenix table) for the enrichment dimension field, we can get the order product information, classification, user information, etc. from the Phoenix table.
- Finally, by judging whether all the fields are successfully associated, if all the fields are successfully associated, the message will be sent to the downstream Kafka and injected into Druid in real time; if the field association fails, the order event will be passed
Side OutputEnter another slow Kafka topic to handle the abnormal order.
The second pipeline is more complex, through a number of real-time tasks will be sub table
Slave BinlogSynchronize to HBase Phoenix table to make dimension table of real-time order flow. At present, there are many problems, such as binlog delay and data hotspot.
The third flow is basically similar to the first one, similar to dead message exception handling in message queue. Because of a large number of dimension table dependencies, Phoenix cannot be synchronized to the Phoenix table before orders are processed, such as new orders, new products, new users, new stores, new categories, new products, etc. Therefore, we introduce a real-time backfill processing flow, which will repeatedly process the first mainstream, failed order, until all fields are associated successfully, and then enter the downstream Druid.
In addition, in order to avoid some expired messages entering the dead loop, there is also an event filter window to ensure that only today’s order events are processed in the pipeline. The difference is that you need to distinguish between the event types of paid orders and unpaid orders (there may be two status events for an order. When the user places an order, there will be an order event, and when the user completes the payment, there will be a payment completion event). Therefore, you need to mark whether the order has been processed after enrichment to repeat successfully.
Order event status maintenance
Because the upstream data source is binlog, with the update of order status, there will be a large number of repeated order events.
By using the function of Flink state to save in memory (fssateback end), the value state is used to mark whether the order is processed or not. By setting TTL, the order status is guaranteed to be saved for 24 hours and expired. At present, the peak activity period is about 2G States, and each taskmanager is about 100m states on average. When checkpoint interval is set to 10 seconds, HDFS load is not high. At the same time, because the stream uses windows and custom triggers, the state needs to buffer a small amount of window data. Windows will be used in
Enrihcement processThe optimization part is explained in detail.
Optimization of enrichment process
In the enrichment step, the business logic is complex and there is a lot of Io. We have made a lot of improvements and optimizations.
- First, associate fields from HBase table by adding
Local RLU Memeory CacheFor HBase row key salt bucket, it avoids the hot issues of order item table access.
- Second, HBase table direct access layer (service) is managed by Google Guice, which facilitates configuration management, memory cache Association, etc.
- Thirdly, there is a delay in the synchronization of commodity table and order commodity to HBase, which leads to a large number of order events entering slow Kafka topic. Therefore, by setting windows and custom triggers, the processing of window data can be triggered only when the order quantity reaches a certain quantity or the window times out. After optimization, 98% of the orders can be successfully processed in the mainstream.
- Finally, we consider using the
Interval JoinHowever, due to the fact that an order has multiple order commodity information, the upstream binlog event, and other dimension table data delay problems, the business logic is complex, and the calculation output data stored in Druid can only support incremental update. So we choose to use HBase storage to associate the order information and add slow message processing flow to solve the problem of data delay.
Data quality assurance and monitoring
Currently, checkpoint is set to
exactly onceMode, and turned it on
Kafka exactly onceProducer model, through
Two Phase CommitFunction to ensure the consistency of data, to avoid task failure, job restart leading to data loss.
In terms of monitoring, through monitoring upstream Kafka topic, HBase table writing update status, combined with downstream Druid data delay monitoring, end-to-end lag index monitoring is achieved. The performance of the specific steps of Flink job is analyzed by reporting the performance indicators of HBase access, cache size and the number of delayed orders through Flink metric report.
Real time data warehouse application based on Flink and hive
In order logistics real-time analysis business, access to binlog event to realize logistics analysis of support point update
Flink Retract StreamFunction to trigger downstream data update whenever there is the latest status change event of order and logistics.
Interval JoinOrder flow and logistics flow. Rocksdb state and incremental checkpoint are used to maintain the state data of the last seven days. Dimension information is added from HBase. Dimension field enrihcment optimizes the query through the local LRU memory cache layer. Finally, it is regularly exported from HBase to HDFS.
Now, the order logistics events generated by the Flink task are saved in HBase to support point update at record level. The results are exported from HBase to HDFS every hour, and real-time analysis is done through Presto access. HBase is exported to HDFS. By avoiding hot issues for HBase row key salt bucket, the region size (default 10g) is optimized to reduce the export time. But the data delay is still serious, about an hour and a half, and the link is cumbersome. Consider joining in the future
Apache HudiSet up access Presto to reduce the delay to half an hour.
Application and management of streaming SQL
At present, a large number of real-time requirements of shopee are realized by SQL, and the main application scenarios are application layer real-time summary data report, dimension table update, etc. The business is realized through SDK and one-stop website management. One is to provide support in the form of SDK. Users can develop secondary projects by introducing jar dependency. The second is to make a related website to meet the business requirements by creating tasks, editing and saving SQL in the form of tasks
- Task list, group management, support restart, stop, disable task function.
- The task supports crontab regular execution scheduling mode and streaming mode.
- Jar resource management, task custom jar reference, so as to reuse UDF.
- Common SQL resource management, task introduction of shared SQL file, avoid repeating SQL logic, repeatedly defining view and environment configuration.
- User group permission management.
- Integrate garafna for task delay alarm.
The following is the UI form of part of the task organization:
At present, the platform only supports spark SQL to implement stream SQL, hive to store metadata, and enrichment through external tables such as join Apache Phoenix and external services. By comparing Flink SQL with spark SQL, we find that spark SQL has many disadvantages
- There are few types of spark SQL window functions, which are not flexible without the support of Flink. As a result, a large number of aggregation tasks cannot be sqlized through the platform.
- Spark stateful state control is poor, without Flink rocksdb state incremental state support.
- When spark associates dimension tables, it used to load full scale dimension tables in every micro batch, but now it has been changed to get mode. The performance of lookup has been improved a lot, but there is still no asynchronous function like Flink asynchronous lookup to improve performance.
- Without the support of Flink snapshot and two phase commit functions, the task will be restarted, and the data will be inconsistent and inaccurate when the recovery fails.
Spark SQL support still has many limitations. At present, we are in the phase of Flink SQL requirement import evaluation, and plan to access Flink SQL support in stream SQL platform. To meet the company’s more and more complex user portrait label annotation and simple real-time business SQL, reduce business development costs. At the same time, we need to introduce a better UDF management mode, integrate metadata services and simplify development.
Streaming job management
Shopee data team has a large number of real-time tasks, which are published through jar package. At present, the job management is paged through the website to reduce the cost of job maintenance. At present, it supports environment management, task management, task application configuration management, and task monitoring alarm.
At present, you can configure the Flink / spark bin path to support different versions of Flink / spark, to support the multi version problem caused by the upgrade of Flink, and to support some color highlights to distinguish different environments.
Now it supports real-time task environment retrieval, status retrieval, name retrieval and so on. Support restart, disable, configure task parameters, etc. The task can be recovered from checkpoint / savepoint. Stop the task, save the savepoint automatically, and start from Kafka timestamp.
Task configuration management
At the same time, real-time tasks also support the configuration of memory, CPU and other Flink job running parameters, jar dependency configuration, etc. Currently, it supports previewing, editing and updating, and completes job deployment and upgrading through jekins cicd integration and human intervention results.
Task application configuration management
Task application configuration is supported by hocon configuration format. At present, shared configuration integration is supported, and checkpoint path is automatically bound to configuration through configuration name convention. The website supports preview mode, edit mode, configuration highlight, etc. in the future, it will integrate configuration version rollback and other functions.
Task monitoring alarm
For task monitoring, task exception handling alarm is now supported. Exception handling supports automatic suspension of failed tasks and recovery from the last latest checkpoint; the Flink rest API is used to detect the status of the Flink job to avoid the false active status caused by the Flink job exception. In case of task restart, the person in charge of the task will be alerted by e-mail in case of abnormal conditions. In the future, we plan to integrate monitoring tools such as grafana / Promethus on the website to complete task monitoring automation.
In general, Flink started its research in shopee at the end of 2019, and it took less than half a year to implement the project. It has completed a large number of business needs import assessment, and verified a series of functions of exactly only, Kafka exactly only semantics, two phase commit, interval join, rocksdb / FS state. In terms of future planning:
- First of all, we will try to make more real-time tasks Flink SQL to further realize the stream batch unification;
- Secondly, a large number of spark structured streaming jobs will be migrated to Flink, and new businesses will be explored.
- Flink SQL support will also be added to the streaming SQL platform to solve some performance bottlenecks and business support limitations of the current platform.
About the author:
Huang Lianghui, who joined shopee in 2019, is responsible for real-time data business and data product development in Shoppe data team.