Today, I’d like to share with you the practices of the fourth paradigm in large-scale feature engineering of recommendation system and llvm based optimization of spark, including the following four topics.
- Introduction to feature engineering of large scale recommendation system
- Sparksql and fesql architecture design
- Performance optimization of spark based on llvm
- Summary and optimization of spark recommendation system
Introduction to feature engineering of large scale recommendation system
Recommendation system is widely used in news recommendation, search engine, advertising and the latest popular short video app. It can be said that most Internet enterprises and traditional enterprises can improve their business value through the recommendation system.
We layered the common recommendation system architecture. The offline layer is mainly responsible for processing large-scale data with HDFS for preprocessing and feature extraction. Then we use the mainstream machine learning training framework to train the model and export the model. The model can be provided to online services. Stream layer, also known as near line layer, is an intermediate layer between offline and online. Stream computing framework such as Flink can be used for near real-time feature calculation and generation, and the results are saved in NoSQL or relational database for online service. The online layer includes UI and online services that interact with users. It extracts streaming features in real time and uses offline model to estimate. The online recall and recommendation function of the recommendation system can be realized. The prediction structure and user feedback can also be written to the queue of loss calculation and offline Hadoop storage through event distributor.
This sharing will focus on the optimization of the offline layer. In large-scale recommendation systems, offline storage data may reach Pb level. The commonly used data processing includes ETL (extract, transform, Load) and Fe (feature extraction), while the programming tools are mainly SQL and python. In order to process large-scale data, Hadoop, spark, Flink and other distributed computing frameworks are generally used. Spark is the most widely used in the industry because it supports both SQL and python interfaces.
Sparksql and fesql architecture design
Spark has just released version 3.0. Both performance and ease of use have been greatly improved. Compared with Hadoop MapReduce, spark has a performance acceleration of more than 100 times. It can handle Pb level data, support horizontal expansion of distributed computing and automatic failure, and support easy-to-use SQL, python, R and For the recommendation system, the built-in recommendation algorithm model can also be used out of the box.
There are many scenarios in the industry that use spark as offline data processing framework of recommendation system. For example, spark is used to load distributed data sets, spark UDF and SQL are used to do data preprocessing and feature selection, and mllib is used to train recall and sorting models. However, spark can’t support it in the online part. The main reason is that spark does not support long running service, while the driver executor architecture is only suitable for offline batch computing. In spark 3.0, hydrogen is introduced to support some pre run tasks, but it is only suitable for offline calculation or model calculation. It requires higher real-time performance, but it is not good for online service support. Spark RDD programming interface is also suitable for iterative computing. We conclude that spark’s advantages are that it can process large-scale data in batch and support standard SQL syntax. The disadvantage is that there is no online prediction service support, so it can not guarantee the consistency of offline and online services, and there is no special optimization for the feature calculation of AI scenarios.
Based on sparksql, fesql, a self-developed fesql service, provides performance optimization for AI scene feature extraction calculation, and fundamentally solves the problem of offline online consistency. The traditional AI landing scenario is to first model and export AI model files through machine learning training framework in offline environment, and then business developers build online services. Because SQL and python are used offline for data preprocessing and feature extraction, a set of online processing framework matching with it needs to be developed online. Two different computing systems are prone to be separated in function When the lines are inconsistent on-line, even when modeling offline, the traversal feature may be used, which makes the online part unable to be realized. Fesql uses a unified SQL language. In addition to the standard SQL support, it also expands the calculation syntax and UDF definition for AI scenarios. The same set of high-performance llvm JIT code generation is used offline and online to ensure the same computing logic whether offline or online, so as to ensure the consistency of offline and online features in machine learning.
In order to support online functions that cannot be supported in sparksql, fesql online part implements a self-developed high-performance full memory sequential database. Compared with other general key value memory databases such as redis and voltdb, the read-write performance and compression capacity of temporal characteristics are greatly improved, and it can better meet the online requirements than traditional sequential databases such as opentsdb The demand of ultra-low delay service. The offline part still relies on Spark’s distributed task scheduling function, but uses a more efficient native execution engine in SQL parsing and execution. Through llvm JIT code generation technology implemented by C + +, more intrinsic functions can be used for morden CPU to optimize instruction sets such as vectorization, and even accelerate with special hardware such as FPGA and GPU. Through the same set of SQL execution engine optimization, it not only improves the offline and online execution efficiency, but also ensures that the feature extraction scheme of offline modeling can be migrated to online services without additional development and comparison work.
The performance of fesql is compared with the commercial product memsql, which is also full memory. In the temporal feature extraction scenario for machine learning, the performance of the same SQL is also greatly improved compared with memsql.
Performance optimization of spark based on llvm
Since spark 2.0, catalyst and tungsten projects have been used to optimize the performance of spark and SQL tasks. Catalyst generates the unresolved abstract syntax tree data structure through lexical and syntactic parsing of SQL syntax, and optimizes the abstract syntax tree for dozens of times. The final physical plan generated can be tens of times faster than that of normal SQL parsing. Tungsten project realizes the management of internal data structure out of heap by using java unsafe interface, which reduces the overhead of JVM GC to a great extent, and realizes whole stage CodeGen for multiple physical nodes and expressions. Java bytecode is generated directly and compiled and optimized by janino memory compiler. The generated code avoids too many virtual function calls and improves CPU The cache hit rate is several times faster than the traditional volcano model interpretation, and is very close to the performance of Java code written by senior programmers.
Is spark’s catalyst and tungsten perfect enough? We think it is not enough. First of all, spark is implemented based on Scala and Java. Pyspark is also connected with the JVM through socket to call Java functions. Therefore, all code is executed on the JVM, so it is inevitable to accept the overhead of the JVM and GC. Moreover, with the update of CPU hardware and instruction set, it is more difficult to use the new hardware features through the JVM, let alone more and more Popular FPGA and GPU, for high-performance execution engine, using lower level C or C + + implementation can improve code performance. For parallelizable data computing tasks, loop unrolling and other optimization methods can be used to double the performance. For continuous memory data structure, more vectorization optimization can be done, and thousands of computing cores of GPU can be used for parallel optimization, which are still not supported in the latest open source version of spark 3.0. Moreover, in machine learning scenarios, the window function of SQL is often used to calculate the temporal characteristics, which corresponds to the physical node of spark, windowexec has not implemented the whole stage CodeGen, that is to say, tungsten’s optimization can’t be used in windowing calculation of multiple expressions, and each feature can be calculated by interpreting and executing, so the performance is much slower than the Java program code written by users themselves.
In order to solve the performance problem of spark, we implement spark’s native execution engine based on llvm and compatible with spark SQL interface. Compared with spark, it can generate logical nodes and Java Bytecode, as well as running on the physical machine based on the JVM, the fesql execution engine will also parse the SQL generation logic plan, and then directly generate the platform related machine code for execution through JIT technology. In terms of architecture, compared with spark, the overhead of the JVM virtual machine layer will be reduced, and the performance will be greatly improved.
Llvm is a very popular tool chain of compiling system, among which the projects include the famous clang and lldb. The technology of llvm is used in MLIR and TVM, which are mainly promoted by tensorflow in machine learning field. It can be understood as a tool for generating compiler. Currently, the popular programming languages such as Ada, C, C + +, D, Delphi, FORTRAN, Haskell, Julia, Objective-C, rust, swift and other programming languages are used Both languages provide a compiler based on llvm implementation.
JIT is corresponding to the concept of AOT. AOT (ahead of time) means that the compilation is executed before the program is run. That is to say, the C and Java code we often write are compiled into binary or bytecode before running, which belongs to AOT compiling. JIT (just in time) means that the runtime is optimized for compilation. Now many interpreted languages such as Python and PHP have applied JIT technology. For the hot code with very high running frequency, JIT technology is used to compile the platform optimized native binary. This dynamic generation and code generation technology is also called JIT compiling.
Llvm provides a high-quality, modular compilation and link tool chain. It can be easily implemented as an AOT compiler or integrated into C + + projects to implement the JIT of user-defined functions. The following is an example of implementing a simple add function. Compared with writing function implementation directly in C, JIT needs to define function header, function parameters, return value and other data structures in the code, and finally Llvm JIT module is used to generate platform related symbol table and executable file format. Because llvm has built-in massive compiler optimization pass, the JIT compiler implemented by ourselves is not much worse than GCC or clang. JIT can be used to generate various UDFs (user defined functions) and udaf (user defined aggregation functions). Moreover, llvm supports a variety of backends. Besides the common architectures such as x86 and arm, PTX can also be used Backend generates CUDA code running in GPU. Llvm also provides the underlying internal functions interface, so that the program can use modern CPU instruction set, and its performance is equivalent to handwritten C or even handwritten assembly.
On spark + AI summit in 2020, databrick not only released spark 3.0, but also mentioned the internal closed source project photon. As Spark’s native execution engine, it can accelerate the execution efficiency of spark SQL and other aspects. Photon is also implemented in C + +. From the experimental data of databrick, we can see that the performance of string processing and other expressions implemented by C + + is several times higher than that of Java implementation, and there are more vectorized instruction set support. The overall design scheme is very similar to fesql. However, as a closed source project, photo can only be used on the commercial platform of databrick. It is still in the experimental stage and needs to contact customer service to manually open it. As there are no more implementation details announced, it is not sure whether photon is implemented based on llvm JIT. For the time being, the official has not introduced the support of PTX or CUDA.
In the native execution engine provided by fesql, many node optimization and expression optimization techniques are also applied. For example, in the project node, simpleproject can optimize the unused column data, introduce the number of nodes running and the amount of data transmission between nodes, and use the whole stage of window node to optimize the unused column data CodeGen can be merged directly with the project node, and all the required results can be obtained in one iterator run.
In the aspect of expression optimization, mainstream and limit, where merge, constant Folding and simplification of filter, cast, upper and lower can be optimized by optimization pass to generate the simplest expression calculation, which greatly reduces the number of instructions executed by CPU. The related SQL optimization will not be discussed in detail. However, only after optimization of logic node, expression, instruction set and code generation can it be close to the handwritten code of top programmers Yes.
In the common test scenario of time series feature extraction in machine learning, based on the same version of spark scheduling engine and the same SQL statement and test data, fesql’s native execution engine improves performance nearly twice in single window, and nearly six times in more complex multi window scenarios due to more intensive CPU computing, and the results are similar in multi thread environment.
From the results, the performance improvement of llvm JIT is very obvious. Using the same code and SQL, no line of code will be modified, just replace spark_ The implementation of the execution engine under home can achieve nearly 6 times or even greater performance improvement. We find the reason for the performance improvement from the generated calculation chart and flame chart. First of all, we can see from the spark UI that the whole stage CodeGen is not implemented in the window node of sparksql. Therefore, this part is the interpretation and execution of scala code. Moreover, the physical plan of sparksql is very long, and each node is unsafe There is a certain amount of overhead in the check and generation of row. Compared with fesql, there are only two nodes in fesql. After reading the data, the binary code of llvm JIT is directly executed, and the overhead between nodes is greatly reduced. From the flame diagram analysis, the bottom layer is the runtask function of spark scheduler. When spark SQL calculates the aggregate characteristics by sliding window, the number of samples and time consumption are relatively long, while fesql is native execution. The CPU execution time of the basic min, Max, sum and AVG is shorter after compiler optimization. Although there is unsafe row encoding and decoding time on the left, it accounts for a small proportion, and the overall time is much less than that of sparksql.
Fesql is one of the few native execution engines that can perform several times faster than open source spark 3.0. It can support standard SQL and be integrated into spark. Unlike photo, which can only be used inside of databrick, we will release llvm enabled spark distribution integrated with llvm JIT optimization in the future. We do not need to modify any line of code, just specify spark_ Home can get great performance acceleration, and it can also be compatible with existing spark applications. For more fesql use cases, please pay attention to GitHub projecthttps://github.com/4paradigm/… 。
Summary and optimization of spark recommendation system
Finally, we summarize our work on recommendation system and spark optimization. First, large-scale recommendation systems must rely on frameworks that can handle big data computing, such as spark, Flink, ES (elastic search) and fesql. Spark is the most popular offline processing framework for big data, but it is only applicable to offline batch processing and cannot support online. Fesql is our self-developed SQL execution engine. Through the integration of internal temporal database, SQL can be online with one click and online consistency can be ensured. Meanwhile, the internal llvm JIT can optimize the SQL execution performance, which is several times better than the open source spark 3.0.
For more fesql use cases, please pay attention to GitHub projecthttps://github.com/4paradigm/… 。