In the sharing of the first issue of recommendation system college, Luo Yuanfei, senior researcher of the fourth paradigm, introduced how to automatically generate features and selection algorithms efficiently in the index level search space for the high-dimensional sparse data in the recommendation system, and how to combine the large-scale distributed machine learning system to reduce the cost of computing, storage and communication Effective combination features are quickly screened out.
The following is Luo Yuanfei’s technology sharing in the first online activity of recommendation system College:
Hello everyone! I am Luo Yuanfei of the fourth paradigm! A kind of
I’m very glad to have the opportunity to share with you some work on automatic machine learning. My work in the fourth paradigm is mostly related to automatic machine learning, and my previous focus was mainly on Automatic Feature Engineering. Although model improvement can bring stable benefits, it is more difficult. So if you are doing a new business, you can try to start with features, which often bring more obvious benefits.
The background of autocross
The automatic machine learning mentioned in this report is for table data. Sheet data is a classical data format, which generally contains multiple columns, which may correspond to discrete or continuous features. We can’t directly use the model used in image, voice or NLP. We need to make specific optimization. A kind of
The feature combination mentioned in this report, especially feature crossing, is the Cartesian product of two discrete features. Take “restaurant I have been to” as an example. I often go to McDonald’s. then McDonald’s and I can be a combined feature. For example, when I go to KFC, KFC and I can also be a combined feature.
The automatic feature engineering mentioned in this report refers to the automatic discovery of these effective combined features from the data in the table above. For example, I am a software engineer, which is a feature; working in the fourth paradigm, which is another feature. These two features are stored in two columns. We can combine these two columns into a new feature, which is more indicative and personalized. A kind of
Why do we need automatic feature engineering?
Firstly, features play an important role in modeling effect. Secondly, there are more customer scenarios than modeling experts. For example, our first recommendation business has thousands of media. We can’t equip each business with an expert to model manually for each scenario. Finally, even if there is only one business, the data is variable, and the scenarios we are facing are constantly changing. Therefore, we need to do automatic feature engineering, and we can’t make the manpower directly proportional to our business volume.
Research on autocross
There are two main types of Automatic Feature Engineering, one is explicit feature combination, the other is implicit feature combination.
Explicit feature combination
Explicit feature combination has two representative works, RMI  and CMI . The letter “Mi” stands for mutual information, which is a classic feature selection method. A kind of
<figure data-size=”normal”>< / figure > MI is calculated by counting the occurrence frequency and co occurrence frequency of two columns of features in the same data. But RMI’s approach is to count part of the information in the training set and another part of the information in the reference data, which is also the source of “R”. The above figure comes from RMI’s paper , which shows that AUC gradually rises as different features of the combination are added. CMI is another classic work. CMI calculates the importance of each feature by analyzing the rate loss function and combining Newton method.
They have all achieved good results. However, on the one hand, they only consider the combination of second-order features; on the other hand, they are all serial algorithms. Each time they select a combined feature, they need to retrain other features, which is O (n ^ 2) complexity, where n is the number of features. In addition, MI itself does not allow multiple values to appear simultaneously under one feature. A kind of
Implicit feature combination
<figure data-size=”normal”>The other is implicit feature combination. You may be more familiar with it. FM  and FFM  enumerate all the second-order feature combinations. They are composed by using the inner product of low dimensional space to represent the combination of the two features, which has achieved good results. With the rise of DL, it is more popular to combine implicit features based on DNN. But its explainability is not strong, has been criticized by everyone.
We propose autocross , which has a strong interpretability, can achieve high-order feature combination, and has a high inference efficiency.
Autocross overall structure
<figure data-size=”normal”>From left to right, the input of auto cross is data and corresponding feature types, and then through the flow of auto cross, a feature generator is output, which can apply the learned feature processing method to new data.
There are three parts in the flow: first, preprocessing, and then iterative process of generating and selecting combined features. For data preprocessing, we propose multi granularity discretization; for how to effectively generate composite features from exponential space, we use beam search; for how to effectively and cheaply select features, we propose field wise LR and successive Mini batch GD Method. A kind of
Let’s look at the algorithms involved in each process. A kind of
The first is data preprocessing. The purpose of data preprocessing is to supplement missing values and discretize continuous features. We have observed that for continuous features, when discretizing, if the selected discretization granularity is different, the effect will be very different. Even in one data set, a difference of 10% in AUC was observed. If we manually set the optimal discretization granularity for each data set, it is expensive and unrealistic.
Based on this, we propose a multi granularity discretization method, which uses multiple granularity to discretize the same feature at the same time, such as the feature “age”. We discretize once according to the age interval of 5, once according to the age interval of 10, once again according to the age interval of 20, and generate multiple different discretization features at the same time, so that the model can automatically select the most suitable one Features.
<figure data-size=”normal”></figure>Beam search
As mentioned above, assuming that there are N original features and there are o (n ^ k) possible k-level features, this is an exponential growth process. How to search, generate and combine features effectively in this space? If both are generated, it is not feasible in calculation and storage. A kind of
We use the beam search method to solve this problem. Its working principle is that Mr. Chen forms a part of the second-order combination features, and then uses the effective second-order combination features to derive the third-order combination features. It does not generate all the third-order combination features, which is equivalent to a greedy search method.
<figure data-size=”normal”></figure>Field wise LR
We preprocess the data by multi granularity discretization, and then reduce the search space by cluster search.
However, the number of generated features is still large. How can we quickly and cheaply select effective features from the generated features? In this regard, we propose a field wise LR algorithm to fix the model parameters corresponding to the selected features, and then calculate which features among the candidate features are added, which can maximize the model effect. This can significantly reduce the cost of computing, communication, and storage. A kind of
Successive Mini batch GD
In order to further reduce the cost of feature evaluation, we propose a continuous small batch gradient descent（Successive Mini batch GD) method. In the iterative process of small batch gradient decline, candidate features that are not significant are gradually eliminated, and more batches of data for more important features are given to increase the accuracy of evaluation.
Autocross system optimization
Here are some of the optimizations we have done on the system.
Cache feature weight
From the algorithm point of view, our system is an index space search problem, even if it can reduce its complexity, its computational cost is still very high. Therefore, we will sample the data and sequentially compress the storage.
After that, when running log by domain probability regression, the system will cache the calculated feature weights. If according to the previous method, we need to obtain the weight of the generated feature from the parameter server first, which will bring network cost; after obtaining, we need to do calculation and generate the feature and forecast, which will generate calculation cost; after generating the feature, it will be stored in the hard disk, which will further generate storage cost. However, we cache the weights of the previous features. By directly looking up the table, we can reduce the cost of network, computing and storage. A kind of
In addition to caching feature weights, we also calculate them online. When we do feature generation, we have independent threads to deserialize data and generate features.
In addition, data parallelism is also a common method of system optimization. In each process of the system, there is a calculation diagram, and through the master node, the parameter server or the parameter server, it can ensure that they are operating in an orderly manner. A kind of
The figure below shows our experimental results. A kind of
<figure data-size=”normal”>There are two baselines here. Let’s first look at the help of the features generated by autocross to LR. When we put the features of autocross into LR, the effect changes significantly (lines 1 and 2). At the same time, we compared the two approaches of autocross and CMI (see lines 2 and 4). After comparison, it is found that autocross is always better than CMI. A kind of
To verify whether the features generated by autocross will help the depth model, we also combine the features of autocross with the W & D model (see line 3). We found that when we gave the features to W & D, the W & D model also achieved good results, which were comparable to the best in-depth learning model in the current 10 data sets.
 Yuanfei, Luo, Wang Mengshuo, Zhou Hao, Yao Quanming, Tu WeiWei, Chen Yuqiang, Yang Qiang, and Dai Wenyuan. 2019. “AutoCross: Automatic Feature Crossing for Tabular Data in Real-World Applications.” KDD.
 Rómer Rosales, Haibin Cheng, and Eren Manavoglu. 2012. Post-click conversion modeling and analysis for non-guaranteed delivery display advertising. WSDM.
 Olivier Chapelle, Eren Manavoglu, and Romer Rosales. 2015. Simple and scalable response prediction for display advertising. TIST.
 Rendle, Steffen. “Factorization machines.” 2010. ICDM.
 Yuchin Juan, Yong Zhuang,Wei-Sheng Chin, and Chih-Jen Lin. 2016. Field-aware factorization machines for CTR prediction. In ACM Conference on Recommender Systems.
 Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction. IJCAI.