How to build a big data system with 10 billion SDK cumulative coverage


As a leader in the push industry, up to now, the cumulative installation coverage of individual push SDK has reached 10 billion (including overseas), the access applications have exceeded 430000, and the coverage of independent terminals has exceeded 1 billion (including overseas). A large number of logs and data will be generated every day in the personal push system, which faces many data processing challenges.
First of all, in terms of data storage, individual push generates more than 10TB of data every day, and the cumulative data has been at the Pb level. Secondly, as a push technology service provider, individual push has a lot of data analysis and statistical needs from customers and various departments of the company, such as message push and data report. Although part of the data analysis work is offline mode, the stability of the open source data processing system is not very high, and ensuring the high availability of data analysis services is also a challenge. In addition, the push business is not simply message distribution. It needs to help customers deliver the right content to the right person in the right scene through data analysis, which requires the system to support data mining and ensure the real-time data. Finally, individual push requires rapid response to data analysis requirements. Therefore, the personal push big data system is facing challenges in data storage, log transmission, log analysis and processing, scheduling and management of a large number of tasks, high availability of data analysis and processing services, massive multi-dimensional reports and rapid response analysis and data retrieval requirements.

Evolution of big data system

Facing many challenges, the getui big data system is constantly improving in the gradual development. Its development can be divided into three stages. The first is the statistical report, that is, Bi in the traditional sense; the second is the infrastructure stage of big data system; the third is the tool, service and product.

How to build a big data system with 10 billion SDK cumulative coverage

The first stage of evolution of individual push big data system: statistical report calculation

How to build a big data system with 10 billion SDK cumulative coverage

In the early days, due to the lack of complex data processing requirements, a few high-performance machines were selected and all data were separately calculated on these machines. Only need to run PHP or shell script on the machine to complete processing and statistics. Data processing pays more attention to the number of messages pushed by customers today and the number of receipts for a certain push task, so as to execute relatively simple reports.
The characteristics of the individual push big data system at this stage are that only regular operation and maintenance scripts are required to be transmitted to the designated intermediate node; although users have 100 million levels, the types of logs are relatively single; only PHP and shell scripts are needed to run and data only need to be saved for a short time (result set is saved for a long time, intermediate data and original data are saved for a short time).

The second stage of the evolution of getui big data system: big data infrastructure, offline batch processing system

How to build a big data system with 10 billion SDK cumulative coverage

In 2014, individual push launched intelligent push solution. The star app with large number of users is connected, and the number of users covered by the system increases dramatically. After customers access to the push system, they put forward a lot of new requirements, such as: the statistical dimension of the report is more abundant, it requires more complex calculation when the amount of data is doubled, and the calculation pressure is increased. Secondly, the essence of intelligent push is data deep mining. The longer the data storage cycle is, the more dimensions are covered, the better.
In this case, personal push introduces Hadoop ecosystem, uses HDFS to basically solve the storage problem, uses hive to do data warehouse and offline analysis, and uses mahout to do machine learning. Individual push completes the transformation from single machine or multi machine mode to cluster mode. The whole operation process is similar to the original one. The difference is that after the log is transferred to the transfer node, the HDFS command is used to put the data to HDFS, and the hive table partition is added. Then the log is further processed and imported into the data warehouse. The last push is to mine the data in the data warehouse, label the users and store them into HBase and online es. This is the basic construction of offline batch processing system.

The second stage of the evolution of getui big data system: big data infrastructure, real-time processing system

With the continuous development of business, the demand also increases accordingly. For example, many statistical analysis tasks require that they be met within the time of T + 0, or when customers push messages in the morning, they ask for data reports reflecting the push effect in the afternoon, instead of waiting for the time of T + 1, these requirements put forward higher requirements for real-time data processing. Moreover, many customers will propose to retrieve some data or view some tag related data, which requires a quick response. Therefore, getui has made some adjustments to the original architecture and introduced an architecture mode which mainly includes offline processing, real-time processing and data services (including retrieval).

The original data is saved to HDFS, and offline batch processing is performed by spark, Mr, etc. Kafka is introduced to solve the problem of log collection. Flume is used to collect the logs of each business node and write them to the Kafka cluster. Then, the hour level and second level processing are carried out according to the business classification. Finally, a piece of data will be delivered to the dB or es of the business line for use.
In the stage of infrastructure construction, several tasks have been completed: using lambda architecture (batch layer, speed layer, servinglayer); introducing Hadoop (HDFS, hive / MR, HBase, mahout, etc.); using ES, solrcloud + HBase scheme to realize multi-dimensional retrieval; introducing flume, Kafka, Camus, optimizing log transmission and introducing and optimizing domestic open-source redis cluster scheme – CODIS.

The third stage of the evolution of getui big data system: instrumentalization + servitization + productization

How to build a big data system with 10 billion SDK cumulative coverage

In the process of infrastructure construction, although individual push discovery has an overall framework, it still can not respond to the needs of business parties more conveniently. Therefore, individual push provides tools to the business side, and adds a service agent layer, that is, the red part in the figure above. Batch tasks are abstracted into task templates, configured to the agent layer, and finally put to the business side to call. They can use the computing services of individual push clusters and improve the speed of business development as long as they do simple secondary development.
In this stage, the architecture of personal push mainly completed the following work: add Job Scheduling Management: introduce Azkaban and carry out transformation (variable sharing, multi cluster support, etc.); add service proxy layer: introduce DataService and job proxy (open to more product lines and decouple); add application layer: develop corresponding tools and data retrieval products based on service proxy layer.

How to build a big data system with 10 billion SDK cumulative coverage

Experience and summary of evolution of individual push big data system

First, exploring data and understanding data are necessary work before development. Before data processing, it is necessary to explore the dirty data, the distribution of these dirty data, and the discovery of invalid data and default conditions.

Second, the data storage scheme is closer to the needs of analysis and calculation. Consider using file formats with indexes such as carbondata.

Third, data standardization is the primary means to improve the follow-up processing. Most of the data need to be standardized for subsequent use (basic cleaning, unified internal ID, and adding necessary attributes). For example, for real-time data, it should be standardized first, then published to Kafka, and finally processed by all other real-time systems, so as to reduce the repetition of routine cleaning and conversion processing in multiple businesses, and unify the ID, so as to facilitate the connection with data.

Fourth, instrumentalization, servitization and productization can improve the overall efficiency. At the development level, we can encapsulate MR and spark API and provide enough toolkits.

Fifthly, full link monitoring of big data systems is very important. Batch processing monitoring mainly includes: daily task running time monitoring, skew, result set daily curve, abnormal data curve, GC monitoring; flow processing monitoring includes: original data fluctuation monitoring, consumption rate monitoring alarm, computing node delay monitoring, etc.