The algorithm competition with fast feedback and fierce competition is an important way for algorithm practitioners to improve their technical level. The algorithm competition abstracted from several industry core problems has strong practical significance. Based on the author’s champion experience in seven kaggle/kdd cups, this paper introduces three aspects: multi domain modeling optimization, automl technical framework, and how to analyze and model new problems. It is hoped that readers can gain general and efficient modeling methods and problem understanding ideas in the competition.

## 1 background and introduction

The algorithm competition with fast feedback and fierce competition is an important way for algorithm practitioners to improve their technical level. Algorithm competition topics abstracted from several industry core issues have strong practical significance. The real-time scoreboard of the competition encourages participants to continuously improve in order to try to surpass the current best practices. Moreover, the winning scheme also has a strong push for Industry and academia, such as the field aware factorization machine (FFM) algorithm produced by the KDD cup competition^{[1]}RESNET model of Imagenet competition output^{[2]}It is widely used in the industry.

The meituan to store advertising quality estimation team won the first place in the meituan internal algorithm competition MDD cup. Invited by the competition organizing committee, we hope to share some common competition experience. This article is the author’s sharing of seven kaggle/kdd cup champion experiences (as shown in Figure 1 below), hoping to help more students.

As we all know, the competitions of kaggle/kdd cup are international top-level events, which have a great influence in the competition circle and the industry. Specifically, kaggle is the largest top-level data mining platform in the world, with hundreds of thousands of users around the world. It has produced a large number of excellent algorithm schemes through high bonuses and sharing atmosphere. For example, the bonus of heritage health is up to $3million. At present, the kaggle competition has made outstanding achievements in AIDS research, chess and card rating and traffic prediction. Thanks to this, the kaggle platform was later acquired by Google.

ACM SIGKDD (International Conference on data mining and knowledge discovery, KDD for short) is an international top-level conference in the field of data mining. KDD cup competition is an international top competition in the field of data mining research sponsored by SIGKDD. Since 1997, it has been held once a year, which is the most influential event in the field of data mining. The competition is open to both business and academic circles. It gathers top experts, scholars, engineers and students from the world’s data mining industry, and provides a platform for data mining practitioners to exchange academic knowledge and display research results.

Through analysis, it is not difficult to find that KDD cup has been closely combined with cutting-edge and hot issues in the industry for 20 years, and the evolution is mainly divided into three stages. The first stage started around 2002, focusing on the hot recommendation system of the Internet, including recommendation, advertising, behavior prediction, etc; The second stage focuses on traditional industries, paying more attention to education, environment, medical care and other fields; In the third stage, since 2019, it has focused on non supervised issues, such as automl, debiasing, reinforcement learning, etc. the common feature of such competitions is that it is difficult to solve existing new problems through previous methods. The trends of these three stages also reflect the difficulties and priorities of the current industry and academia to some extent. Whether from the way, method, or problem dimension, they all show a trend of evolution from narrow to wide, from standard to non-standard.

This article will first introduce the scheme and understanding of the author’s seven times of KDD cup/kaggle champion. The problems involve many fields, such as recommendation, advertising, transportation, environment, artificial intelligence fairness and so on. Then it will introduce the automl technical framework that plays a key role in the above competitions, including automated feature engineering, automated model optimization, automated model fusion, etc., and how to systematically model different problems through this technical framework. Finally, it introduces the general methods of the above competitions, that is, how to analyze, understand, model, and solve challenges in the face of a new problem, so as to realize the in-depth optimization of the problem.

This article is mainly for the following two types of readers. Other interested students are also welcome to learn about it.

- Algorithm competition enthusiasts hope to understand the method and logic of the international data mining top competition champion scheme and get better ranking.
- Engineers and researchers in the industry have used the competition method for reference and applied it to practical work to achieve better results.

## 2 multi domain modeling optimization

In this part, we will divide the above competitions into three parts to introduce the scheme. The first part is about the recommendation system; The second part is the time series problem. The important difference from the first part is that it predicts the future multi-point series rather than the single point prediction of the recommendation system; The third part is the automated machine learning problem. The competition input of this problem is not a single data set, but a multi problem multi data set, and the problem of B-list data set in the final evaluation is also unknown. Therefore, the robustness of the scheme is very high. As shown in Table 1, the winning schemes of the seven race tracks will be introduced in detail, but will be combined into five core solutions for specific introduction.

### 2.1 recommended system problems

This section mainly introduces the kaggle outdoor ads click prediction and KDD cup 2020 debiasing competition. Both tasks are oriented to the prediction of the user’s next click. However, due to different application scenarios and backgrounds, there are different challenges: the former has a huge data scale, involving billions of browsing records of hundreds of millions of users on thousands of heterogeneous sites, and has strict requirements for model optimization and integration; The latter pays special attention to the deviation in the recommendation system, and requires the contestants to put forward effective solutions to alleviate the selectivity deviation and popularity deviation, so as to improve the fairness of the recommendation system. This section will introduce these two competitions respectively.

#### Kaggle outdoor ads click prediction: a model fusion scheme based on multi-level and multi factor

**Competition issues and challenges**: it is required to estimate the user’s next click on the web advertisement on the outbrain web content discovery platform. For details, please refer to:Kaggle outbrain competition introduction details^{[26]}. Participants will face two important challenges:

**Isomerism**: the platform provides demand side platform (DSP) advertising service, which involves the behavior depiction of users on thousands of heterogeneous sites.**Ultra high dimensional sparsity**: features are high-dimensional and sparse, and the data scale is huge, including 700million users and 2billion browsing records.

**Model fusion scheme based on multi-level and multi factor**In view of the challenge of this competition, our team adopted the model fusion scheme based on multi-level and multi factor to model. On the one hand, a single model is not easy to describe the behavior of heterogeneous sites in a comprehensive way. On the other hand, the 100 million level data scale brings a large space for the separate optimization of multiple models. Because FFM has strong feature crossing ability and generalization ability, it can better deal with high-dimensional sparse features. Therefore, we choose this model as the main model of the fusion base model. Model fusion learns different content through different models, so as to effectively mine users’ heterogeneous behaviors on different sites. The key to model fusion is to produce and combine “good but different” models^{3}. The model fusion scheme based on multi-level and multi factor firstly constructs the differences between models from multiple perspectives of model differences and feature differences, and then fuses through multi-level and multi feature factors (model PCTR estimates and hidden layer representation) using base learners:

Specifically, as shown in Figure 3 above. The purpose of the first level is to build a single model with differences, which is mainly trained by different types of models on the user’s recent behavior, all behavior data and different feature sets to generate differences. In the second level, the combination of different single models will further generate differences. The improvement of differences comes from two aspects, namely, the difference in the mode of model combination (using different models, scoring according to the characteristics of a single model) and the difference in the feature factors used for model combination. Here, the feature factors include the scoring of the model and the hidden layer parameters in the model. The third level is to consider how to combine different fusion results. Because the divided validation data set is small, it is easy to over fit if complex nonlinear models are used. Therefore, a constraint based linear model is used to obtain the fusion weight of the second level model.

Compared with the models in our business, the above scheme adopts more model fusion, resulting in higher overhead while achieving high precision. In actual business, we should pay more attention to the balance between effect and efficiency.

#### KDD cup 2020 debasing: debasing scheme based on i2i multi hop walk

**Competition issues and challenges**: the competition is based on the e-commerce platform to estimate the next click of the user. It focuses on how to mitigate the selectivity bias and popularity bias in the recommendation system. For details, please refer to:KDD cup 2020 debiasing competition introduction details^{[27]}. There are many deviation problems in the recommendation system. In addition to the above two deviations, there are also exposure deviation, position deviation, etc^{5}. Our team has also conducted relevant research on position deviation before^{[7]}. In this competition, in order to better measure the recommendation effect of the recommendation system on historical low heat commodities, the scores of the contestants are mainly [email protected]_half Index. This indicator is to extract half of the clicked commodities with little historical exposure from the entire evaluation data set. Because they are low heat and have been clicked, it can better solve the evaluation deviation problem. The competition includes the following challenges:

- Competition questions only provide click data, and the problem of selective deviation should be considered when constructing candidate sets.
- The popularity of different commodities varies greatly. The number of historical clicks of commodities presents a long tail distribution. There is a serious popularity deviation in the data, and the evaluation indicators [email protected]_half It is used to investigate the sorting quality of low heat commodities.

**Debiasing sorting scheme based on i2i walk**: our scheme is a sorting framework based on i2i modeling. As shown in the figure, the overall process includes four stages: i2i composition and multi hop walk, i2i sample construction, i2i modeling and u2i sorting. The first two stages solved the problem of selective deviation, and the last two stages focused on solving the problem of popularity deviation.

The first stage is to build i2i graph based on user behavior data and product multimodal data, and generate candidate samples by multi-hop walking on the graph. This method expands the commodity candidate set, better approximates the real candidate set of the system, and alleviates the selectivity bias.

The second stage is to calculate the similarity of i2i candidate samples according to different i2i relationships, so as to determine the number of candidate samples under each i2i relationship, and finally form a candidate set. Through different candidate construction methods, we can explore more different candidate products, which can further alleviate the problem of selectivity bias.

The third stage includes automatic feature engineering based on i2i sample set and modeling to eliminate popularity bias using popularity weighted loss function. Automated feature engineering includes the description of multi-modal information of commodities, which can reflect the competitive relationship of commodities beyond the heat information, and can alleviate the problem of popularity deviation to a certain extent. The popularity weighted loss function is defined as follows:

Where, parameter α It is inversely proportional to popularity to weaken the weight of popular goods, so as to eliminate the popularity deviation. parameter β Is the positive sample weight, which is used to solve the sample imbalance problem.

In the fourth stage, i2i scores are aggregated through Max operation to highlight the high score signal of low heat commodities in the scoring set, so as to alleviate the problem of popularity deviation. Then the scoring of the commodity list is adjusted in combination with the popularity of commodities, so as to alleviate the popularity deviation.

For more details about the competition, you can refer to《KDD cup 2020 debiasing Championship technical scheme and its practice in the US League》One article.

### 2.2 time series problem

**Time series problem**: the time series problem is quite different from the recommendation system problem. In terms of tasks, the recommendation system predicts a single point in the future, while the time series predicts multiple points in the future; In terms of data, the recommendation system usually contains multi-dimensional information such as users, commodities and context, and the time series usually contains numerical series information that changes in time and space.

**Time series competition**: in this article, the time series competition mainly introduces KDD Cup 2018 fresh air and KDD cup 2017 highway tollgates traffic flow prediction. They are all time series problems. The former is to predict the pollutant concentration and changes in the next two days, and the latter is to predict the high-speed traffic conditions and changes in the next few hours. One of their common points is the traditional industry problem, which has strong practical significance; Second, there are all kinds of mutation and low stability; Third, both involve multi regional and multi spatial issues, which need to be modeled in combination with time and space. Their similarities and differences are that it takes a short time for the sudden change of pollutant concentration to occur. There is a certain regularity in the data during the sudden change, but the traffic sudden change is highly accidental. The traffic roads are vulnerable to accidental traffic accidents, accidental geological disasters, etc., and the data will not show obvious regularity.

#### KDD Cup 2018 fresh air: air quality prediction scheme based on spatiotemporal gated DNN and seq2seq

**Competition issues and challenges**: the objective of the competition is to predict the concentration change of pm2.5/pm10/o3 at 48 stations in Beijing and London in the next 48 hours. For details, please refer to:KDD Cup 2018 competition introduction details^{[28]}. Contestants need to solve the following two challenges:

**temporality**: the pollution concentration in the next 48 hours is predicted, and there is a sudden change in the actual pollutant concentration. As shown in Figure 5, site 2 has a large number of fluctuations and mutations between 05-05, 05-06 and 05-07.**Spatiality**: there are obvious differences in pollutant concentrations at different sites, and they are related to the topology between sites. As shown in the figure, the waveforms of stations 1 and 2 are quite different, but the same bulge is generated in 05-07.

**Model fusion scheme based on spatial temporal gated DNN and seq2seq ^{[9]}**: in order to strengthen the modeling of time series and spatial topology, we introduced two models: spatial temporal gated DNN and seq2seq, and built a model fusion scheme together with lightgbm, as follows.

**（1）Spatial-temporal Gated DNN**: for the time series problem, because the difference of statistical eigenvalues near the time point of future prediction is small, the direct use of DNN model will make the difference of predicted values of different hours and stations small. Therefore, we introduce a spatial temporary gate into DNN to highlight the spatio-temporal information. As shown in Figure 6 below, the spatial temporary gated DNN adopts a double tower structure, which splits the space-time information and other information, and controls and emphasizes the space-time information through the gate function, which can ultimately improve the sensitivity of the model to space-time. In the experiment, it is found that the introduction of the swish activation function f (x) = x · sigmoid (x) can improve the accuracy of the model.

**（2）Seq2Seq**: although spatial temporal gated DNN has enhanced the spatiotemporal information compared with DNN, their data modeling method is to copy 48 copies of historical data of samples and label them respectively for the next 48 hours, which is equivalent to predicting the pollution concentration values for 48 hours respectively. In fact, this method is divorced from the time series prediction task and loses the time continuity. The seq2seq modeling method can naturally solve this problem, and has achieved good results. Figure 7 below shows the seq2seq model structure we adopted in this competition. In response to the timing challenge, the historical weather features are organized into sequences before and after time and input into the encoder. The decoder decodes them based on the coding results and future weather forecast features to obtain a 48 hour pollutant concentration sequence. The future weather forecast information is aligned to the decoding process of the decoder every hour, and the decoder can effectively predict the abrupt value through the weather information in the weather forecast (such as wind level, air pressure, etc.). In view of the spatial challenges, the scheme adds site embedding and spatial topology features to the model to describe the spatial information, and splices and normalizes the weather information in the model, so as to realize the spatio-temporal joint modeling.

**(3) Model fusion**: our team adopts the stacking fusion method. A single learner builds differences through different models, data and modeling methods. Lightgbm model uses the characteristics of weather quality, historical statistics, spatial topology and so on. The spatial temporal gate introduces the gate structure to strengthen the spatio-temporal information. Seq2seq describes the continuity and volatility of the sequence by using the sequence to sequence modeling method. Finally, a constraint based linear model is used to fuse different single learners.

For more details, please refer to the SIGKDD conference paper:AccuAir: Winning Solution to Air Quality Prediction for KDD Cup 2018。

#### KDD cup 2017 traffic flow prediction: a high stability traffic prediction scheme based on cross validation noise reduction and multi loss fusion

**Competition issues and challenges**: the goal of the competition is to take 20 minutes as the time window, give the driving conditions from the expressway entrance to the checkpoint in the first 2 hours, and predict the driving conditions in the next 2 hours. For details, please refer to:KDD cup 2017 competition introduction details^{[29]}. According to different driving conditions, the race is divided into two tracks: driving time prediction and traffic flow prediction. Contestants need to solve the following two challenges:

- Small data and much noise. As shown in Figure 8 below, the numerical distribution of the time period in the box is significantly different from that of other time periods.

- Extreme value has a great impact on the results. MAPE is used as the evaluation index, as shown in the following formula, where a
_{t}Represents actual value, f_{t}Represents the predicted value. When the actual value is a small value (especially a minimum value), this term has a great weight on the contribution of the whole sum formula.

**Extreme point optimization model fusion scheme based on cross validation noise reduction:**

**(1) Noise reduction based on cross validation**, because online submission can only be conducted once a day, and the final evaluation will be switched from list a test set to list B test set, and because the data set of list a is small, the online evaluation indicators are unstable, so the offline iterative verification method is particularly important. In order to make the off-line iteration confident, we use two verification methods for assistance. The first is the verification in the same time period of the next day. We take the online data sets in the same time period of each day on the last m days of the training set to get m verification sets. The second is n-fold day level sampling verification, which is similar to n-fold cross verification. We take the data of each day in the last n days as the verification set to get n verification sets. These two methods jointly assist the iteration of the offline effect of the model and ensure our robustness in the B list.

**(2) Optimization of extreme point problem and model fusion**: since MAPE is sensitive to extreme values, we carry out a variety of different treatments in different aspects such as labels, losses and sample weights. For example, log transform and box Cox transform are carried out on labels. Log transform is to carry out log transform on labels and restore the estimated value after model fitting. This can help the model focus on small values and be more robust. Mae, MSE and other losses are used. Labels are used to weight samples in sample weights, We introduce these processes into xgboost, lightgbm and DNN to generate multiple different models for model fusion, optimize the extreme point problem, and achieve robust results.

**Remarks**: special thanks to Chenhuan, Yanpeng, huangpan and other students who participated in KDD cup 2017.

### 2.3 automated machine learning problems

Automated machine learning problems^{[10]}It mainly includes KDD cup 2019 automl and KDD cup 2020 autograph competitions. This kind of problem generally has the following three characteristics:

**Strong data diversity**: there are 15+ data sets, which come from problems in different fields and will not identify the data source. The automated machine learning framework designed by the contestants is required to be compatible with data in multiple fields and make certain adaptation to data in different fields.**Robustness of automation**: the evaluation data of public ranking list is different from that of private ranking list. The final score is obtained according to the average ranking / score of multiple data sets. It is required that robust results can be obtained on data sets that have not been seen before.**Performance limitations**: it corresponds to the search space of real problems and needs to be solved in limited time and memory.

#### KDD cup 2020 autograph: an optimization scheme for automatic multi-level graph learning based on Agent Model

**Competition issues and challenges**: the autograph learning challenge is the first automl challenge applied to graph structure data. For details, seeIntroduction to KDD cup 2020 autograph competition^{[30]}. Competition selection graph node multi classification task to evaluate the quality of learning, participants need to design an automated graph to represent learning^{[11-13]}Solutions. This scheme needs to efficiently learn the high-quality representation of each node based on the given feature, neighborhood and structure information of the graph. The competition data is collected from real business, including 15 fields such as social network, paper network, knowledge map, etc., of which 5 data sets are available for download, 5 feedback data sets evaluate the scores of the scheme in the public ranking list, and the remaining 5 data sets evaluate the final ranking in the last submission.

Each data set is given the graph node ID and node characteristics, graph edge and edge weight information, as well as the time budget (100-200 seconds) and memory computing power (30g) of the data set. Each training set will be randomly divided into 40% of the nodes as the training set and 60% of the nodes as the test set. The participants will design an automated graph learning solution to classify the test assembly points. Each data set is ranked by accuracy, and the final ranking will be evaluated according to the average ranking of the last five data sets. To sum up, this competition needs to directly implement the automatic graph learning scheme on five data sets that have not been seen before. The participants were faced with the following challenges:

- The graph model has the characteristics of high square error and low stability.
- Each data set has strict time budget and memory computing power constraints.

**Agent-based multi-level model optimization for automation ^{[14]}**

**Multi category hierarchical graph model optimization:**

**(1) Generation of candidate graph model**: a graph in the real world is usually a combination of multiple attributes. It is difficult to capture these attribute information in one way. Therefore, we use a variety of different types of models based on spectral domain, spatial domain and attention mechanism to capture multiple attribute relationships. Different models have different effects on different data sets. In order to prevent models with poor effects from being added during subsequent model fusion, GCN, gat, appnp, TAGC, DNA, graphsage, graphmix, grand, gcnii and other candidate models will be quickly screened to obtain a model pool.

**(2) Hierarchical model integration**: this section contains the integration of two dimensions. The first floor is**Model self integration**In order to solve the problem that the graph model is particularly sensitive to initialization, and the accuracy fluctuation of the same model can reach ± 1%, the self integration of the same model is used to generate multiple same models at the same time, and the average value predicted by the model is taken as the output result of this model, which successfully reduces the variance of the same model and improves the stability of the model on different data sets. The second floor is**Integration of different models**In order to effectively utilize the information from local and global neighborhoods and fully capture the different properties of graphs, we use weighted integration of different kinds of graph models to further improve the performance. At the same time, in the parameter search phase, it is necessary to optimize the parameters in the model at the same time α， And multiple model weighted integration parameters β， The model integration parameters and model internal parameters are used to solve the problem through mutual iterative gradient descent, which effectively improves the speed.

**Two stage optimization based on agent model and final model**: data set sampling, hierarchical sampling of subgraphs according to label to reduce model verification time; Proxy model and bagging, calculate the average results of several small hidden layer models, and quickly evaluate such models. Kendall rank and speedup are used to balance accuracy and acceleration ratio to obtain a suitable proxy model. Finally, the optimal super parameters are obtained through the agent model, and then the final large model is trained on the searched parameters.

For details, please refer to the team’s ICDE 2022 paper,AutoHEnsGNN: Winning Solution to AutoGraph Challenge for KDD Cup 2020。

## 3 automl technology framework

### 3.1 overview of automation framework

After the above competitions, the team has continuously summarized and optimized in multi domain modeling, abstracted the more common modules, and summarized a set of more common solutions for data mining problems – automl framework. The framework includes data preprocessing and automatic feature Engineering^{[15]}And automated model optimization^{[16-20]}Three parts. The data preprocessing part is mainly responsible for common basic operations such as feature classification, data coding, missing value processing, etc., but it is not expanded. The automatic feature engineering and automatic model optimization of automl framework are introduced in detail.

### 3.2 automation Feature Engineering

Feature engineering is a very important work in machine learning. The quality of features directly determines the upper limit of model accuracy. At present, the common way is to combine and transform features manually, but there are some problems in manual feature mining, such as slow speed and unable to mine comprehensively. Therefore, the design of fully mined automated feature engineering can better solve the above problems. Automated feature engineering mainly includes three parts:

**1、 Second order characteristic operator**: more complex high-order features can be obtained by basic operations on data. There are three feature operators. Frequency coding refers to the statistics of the number of times and nunique equivalence of category features in the sample. Target coding refers to the operations such as mean, sum, max min, and percentile for numeric features. Timing difference refers to the difference processing of time characteristics. The first-order operator uses one entity for calculation, and the second-order operator uses two entities for calculation. For example, the order quantity of a user under a category uses two entities, user and category.**Quick feature selection**: automatic feature engineering is a Cartesian product combination of all entities according to different feature operators, which will produce a large number of invalid features, so it is necessary to carry out rapid feature selection. Use lightgbm model to quickly identify effective features and useless features, cut out useless features from the perspective of index promotion and feature importance, and identify important features for higher-order combination with other features.**Higher order characteristic operator**: the new features constructed based on the combination of first-order and second-order feature operators are further combined with other features. The k+1 high-order combination cycle iteration based on k-order (k>=1) can produce a large number of high-order features that are not considered artificially.

According to whether the results of multiple entities are completely matched, high-order feature operators are divided into match mode – match all entities, and all mode – match some entities to obtain the calculation results of all values of another entity. These two feature output modes. As illustrated in the following figure, the match method matches two entities: the user and the time period to obtain the average order price of the user in the time period; In all mode, only users are matched to get the average order price of users in all time periods.

Compared with deepfm, deepffm and other algorithms, automated feature engineering has three advantages. Firstly, in the case of multi table information, it is easy to use the information of non training data. For example, in the advertising scene, the information of natural data can be used through features. Compared with the direct use of natural data for training, it is not easy to produce problems such as inconsistent distribution; Secondly, automatic cross learning only through the model does not have enough manual structure learning for some strong feature cross. Many display cross features, such as the click through rate of user products, often have strong business significance. It is easier for the model to directly perceive the relationship between the combined features than automatic learning features; Third, for many high-dimensional sparse ID features, such as recommendation or advertising scenes above 100 million level, deepfm and deepffm are difficult to fully learn these features. Automated feature engineering can construct strong feature representation for these sparse IDs.

### 3.3 automation model optimization

**Grid search based on importance**: in our framework, we use a greedy search method based on global importance to speed up the search; The obtained optimal results can be searched in a more detailed grid in a small area to alleviate the local optimization caused by the greedy strategy. According to the previous competition experience, the order of importance of different models is summarized as follows:

**LightGBM**: learning rate > sample unbalance rate > number of leaves > row and column sampling, etc.**DNN**: learning rate >embedding dimension > number and size of full connection layers. It is worth mentioning that the hyper parameter search will be carried out many times in the whole iteration process. At the same time, the parameter search strategy at the early stage of the iteration is different from that at the late stage of the iteration. In the early stage of the iteration, generally, a larger learning rate, smaller embedded dimensions and full connection layers will be selected to reduce the number of model parameters and speed up the iteration. In the later stage, more parameters will be selected to achieve better results.**Model fusion**: the key point of model fusion lies in the differences between the construction models. The models of lightgbm and DNN are quite different. The differences in the same model are mainly reflected in three aspects: data differences, feature differences and hyperparametric differences. Data difference is mainly realized by automatic row sampling, which automatically generates models of different data samples; Feature difference generates feature sampling model through automatic column sampling; The hyperparametric difference is generated by high optimal parameter perturbation, and the parameter group grid is locally perturbed in the optimal part. Model fusion methods are generally blending, stacking or simple mean pooling. Before fusion, model granularity pruning (removing models with poor effect to avoid affecting the fusion effect) and regularization are required.

### 3.4 recent practice of automl framework: MDD cup 2021 meituan takeout map recommendation competition champion scheme

In the MDD cup 2021 internal algorithm competition held by meituan from August to September 2021, the quality estimation team of meituan to store advertising platform applied the automl framework and won the championship. Next, the application of the framework in specific problems will be introduced in combination with this competition.

MDD cup 2021 requires participants to predict the next merchant to purchase according to the attributes of users and merchants in the map, users’ historical clicks, real-time clicks and ordering behavior. There are 1.35 million orders in four weeks, involving 200000 users, 29000 merchants, 179000 dishes, and a total of 4.38 million pieces of dish data associated with orders, forming a knowledge map. apply [email protected] As the evaluation index.

**Data preprocessing stage**: perform feature classification, abnormal value processing, unified coding and other operations. It mainly involves three types of entity data: users (user portrait features, etc.), merchants (category, score, brand, etc.) and dishes (taste, price, ingredients, etc.) and two types of interactive data: click and purchase (LBS, price, time, etc.). It performs common preprocessing operations such as feature classification, data coding, missing value processing on the original data.

**Automated feature Engineering**: first and second-order feature operators: first, for the four original features of category, data, time sequence and label, the first and second-order features are crossed according to three abstractable entities and two types of interactive data, and the first and second-order statistical features are obtained in multiple periods by using frequency coding, target coding and time sequence difference operators. For example, the frequency code can be used to calculate the number of times a user clicks on a certain business, the nunique value of the business category purchased by the user, and the number of orders placed by the user in a certain scene. The target code can calculate the average order price of the user, the category of the merchant with the most clicks, etc. The timing difference can be used to calculate the average time difference when a user purchases a dish of a certain taste. Multi period statistics means that the above characteristics can be calculated in different periods.

Fast feature selection: the number of first-order and second-order statistical features automatically generated above is 1000+, among which there are a large number of invalid features. Therefore, lightgbm model is used for feature screening and important identification from the perspective of index promotion and importance. If the taste characteristics of the user’s x dishes are ineffective, screen them out; The price range most frequently purchased by users is very effective. It is marked as a high-order combination of important features.

High order feature operator, a new feature constructed based on the combination of first-order and second-order feature operators, can be used as input for high-order feature combination. It is worth mentioning here that there are two forms of high-order feature combination. The first is a higher-order combination of original features. For example, the user’s favorite dish taste in a business combines three entities without additional operations. The second requires the use of first-order and second-order new features, in which the results of frequency coding can be used directly. The target coding and timing difference can only be used after the numerical bucket division operation is converted to discrete values, For example, the mode of the user’s order price range X the combined bucket count of the average value of the merchant’s order price. The final feature set is obtained after feature combination and filtering.

**Automated model optimization**: in the model part, the fusion scheme of lightgbm and din is used. During the iteration process, automatic super parameter search is carried out for many times. Through automatic row and column sampling and local disturbance of optimal parameters, multiple models with differences are constructed, and the final results are obtained by fusion.

## 4 general modeling method and understanding

This section will introduce the general modeling method of the competition, that is, how to design the overall scheme quickly and efficiently in the face of a new problem.

### 4.1 modeling framework and method

When facing new problems, we mainly divide the technical framework into the following three stages: exploratory modeling, critical modeling, and automated modeling. The three stages have the function of gradually deepening and further supplementing.

**Exploratory modeling**: in the early stage of the competition, first understand the problems, including the understanding of evaluation indicators and data tables, then build the basic model, and submit online to verify the consistency. In the process of consistency verification, multiple submissions are often required to find an evaluation method that is consistent with the online indicators. The core goal of exploratory modeling is to find iterative ideas and methods, so we need to explore the problem in many aspects and find the right direction in the exploration.

Generally, in non temporal problems, n-fold method is used to construct multiple verification sets, and the seeds can be generated flexibly to get different sets. In the time series problem, the sliding window method is generally used to construct verification sets that are submitted on the same line at the same time, and K verification sets can be constructed by sliding forward for K days. In the evaluation of multiple verification sets, the mean, variance, extreme value and other reference indicators can be used for comprehensive evaluation to obtain consistent results on the same line.

**Critical modeling**: in the middle of the competition, we will dig deep into the key issues and reach a solution that is at the top of the list. In terms of problem understanding, we will try our best to customize the loss function design on the evaluation method.

Classification problem optimization can be combined with logloss and AUC loss^{[21]}, ndcg loss and other different loss functions for mix loss design. The loss function design of regression problem is more complex. On the one hand, the loss function can be designed in combination with square error, absolute value error, etc. on the other hand, the problems of regression outliers can be solved in combination with log transform, box Cox transform, etc.

**Automated modeling**: in the later stage of the competition, due to the blind spot in details and angles based on human understanding, on the other hand, it is difficult to model Abstract relationships, so we will use automated modeling to supplement. As shown in Figure 18 below, automatic association is carried out based on relational multi table input, then a large number of features are constructed through generative automatic feature engineering, then feature selection and iteration are carried out, then automatic hyperparametric search and model selection are carried out based on model input, and finally automatic fusion construction is carried out based on multiple models, and the generated diversified model relationships are selected and weighted.

The framework shown in Figure 18 is generally used for automatic modeling. First, multi table association is performed, and then feature selection is performed based on the logic of first expansion and then filtering. Next, hyperparametric search is performed based on selected features and multiple hyperparametric ranges. Finally, xgboost is used^{[22]}, lightgbm, DNN, RNN, FFM and other different models for automatic model fusion.

### 4.2 method contact with industry

Compared with the actual situation of the industry, an important difference of the algorithm competition is that the industry involves online systems, so the performance challenges in engineering are greater, and the algorithm involves more consistency problems of online and offline effects. Therefore, the algorithm competition will further improve the model complexity and model accuracy. In the algorithm competition, algorithm models such as RESNET, field aware factorization machine (FFM) and xgboost are also produced, which are widely used in the actual systems in the industry.

In the air quality prediction, we use the spatiotemporal gated DNN network combined with time and space for effective modeling, which is close to the problem of air quality. In the actual business of meituan, we also face the problem of spatiotemporal combined modeling. Take the modeling of user behavior sequence as an example. We have fully modeled and interacted with users’ historical and current spatio-temporal information^{[24]}. We distinguish the triple space-time information of user behavior, namely, the time when the user clicks, the geographical location of the user request, and the geographical location of the merchant the user clicks.

Based on the above triple spatio-temporal information, we propose the spatio-temporal activator layer (as shown in Figure 19): a trilateral spatio-temporal attention mechanism neural network to model the user’s historical behavior. Specifically, we learn through the interaction of the requested longitude and latitude information, the merchant longitude and latitude information and the requested time. For spatial information intersection, we further use the combination of geographic location hash coding and spherical distance; For the time information intersection, we also use the combination of absolute and relative time to effectively realize the trilateral expression of user behavior sequence under different space-time conditions. Finally, the spatiotemporal information encoded by the above network is fused through the attention mechanism network to obtain the personalized expression of the user’s ultra long behavior sequence to different request candidates in the LBS scene.

In comparison, the spatial temporal gated DNN in the competition pays more attention to the impact of spatiotemporal fusion information on the predicted value. Due to the time series that need to be predicted, it focuses more on different time and spatial information, which can fully model the differences. The spatio-temporal network in meituan’s business focuses on fine-grained depiction of spatial information, which originates from different spherical distances and is greatly affected by different block positions, and requires multiple information depth modeling. For more details, please refer to the team’s CIKM paper:Trilateral Spatiotemporal Attention Network for User Behavior Modeling in Location-based Search^{[23]}。

In the actual modeling, more online parts are involved than the competition, and the competition mainly focuses on the accuracy extremum of offline data sets. Compared with debiasing competition, the actual online system involves more problems such as bias. Taking position bias as an example, the actual high click through rate of display data is naturally higher than the low. However, part of it is due to the differences in browsing habits between high and low users. Therefore, the direct modeling of data is not enough to represent the evaluation of high and low click through rate and quality of advertising. In the actual advertising system of meituan, we have designed a position combination prediction framework for modeling, and achieved good results, which will not be detailed here. For details, please refer to the SIGIR paper of the team:Deep Position-wise Interaction Network for CTR Prediction^{[7]}。

### 4.3 key understanding of modeling

**A consistent evaluation method is the key to determine the generalization ability of the model**

In the competition mechanism, the private data that is finally evaluated and the public data that has been on the list before are usually not the same data. Sometimes there will be dozens of ranking jitters when switching data, affecting the final ranking. Therefore, avoiding over fitting to the public data of conventional iteration is the key to win. So how to construct a consistent verification set on the same line? From the perspective of consistency, verification sets with consistent time intervals are generally constructed. However, some problem data are noisy, so multiple verification sets can be constructed by means of dynamic sliding window. A consistent verification set determines the direction of subsequent iterations.

**Big data pays attention to the deepening of the model, while small data pays attention to the robustness of the model**

Different data sets focus on different contents. In the scenario of sufficient data, the core problem is model deepening to solve complex problems such as intersection and combination between features. In the case of small data, the core problem is the robustness of the model because of the large amount of noise and strong instability. High data sensitivity is the key to scheme design.

**The balance of variance and deviation is the key to guide the optimization in the later stage**

From the perspective of error decomposition, square error can be decomposed into bias and variance^{[25]}When the model complexity is low in the middle and early stage, the deviation can be effectively reduced by increasing the model complexity. However, in the later stage when the deviation has been highly optimized, the optimization of variance is the key. Therefore, in the later stage, the optimization results will be fused through models such as emsemble on the basis of constant single model complexity.

**The key to automl is the continuous reduction of human priors**

While using the automl framework, there will be some hidden artificial priors such as hyperparameters. Understanding the automl technology from the perspective of model, there is also the problem that the higher the complexity of the model, the easier it is to over fit. A key issue in the iteration is not whether the evaluation effect is good or bad, but whether the scheme has unnecessary hyperparameters and other information, whether it can continuously simplify the modeling of automl, and continuously automate and adaptively adapt to various problems.

Finally, I would like to thank the teammates of the revolution team, Nomo team, getmax team, and aister team.

## summary

Based on the author’s champion experience in seven algorithm competitions, this paper shares the algorithm experience in the competitions in different fields such as recommendation system, time series and automated machine learning, then introduces the automl technical framework in combination with specific problems, finally summarizes the general modeling scheme in the competition, and introduces its connection with the competition in combination with the industrial scheme. It is hoped that some of the algorithm competition related experience in this article can help algorithm enthusiasts better participate in the competition, provide some ideas for everyone, and inspire more engineers and researchers to achieve better results in practical work. In the future, our team will continue to pay attention to the international algorithm competition and actively try to combine the competition ideas with industrial solutions. At the same time, we also welcome you to join our team. The recruitment information is attached at the end of the text. We look forward to your email.

## About the author

Hu Ke, Xingyuan, Mingjian and adamant all come from the quality estimation team of meituan advertising platform.

## References

- [1] Juan Y , Zhuang Y , Chin W S , et al. Field-aware Factorization Machines for CTR Prediction[C]// the 10th ACM Conference. ACM, 2016.
- [2] He K , Zhang X , Ren S , et al. Identity Mappings in Deep Residual Networks[J]. Springer, Cham, 2016.
- [3] Ali, Jehad & Khan, Rehanullah & Ahmad, Nasir & Maqsood, Imran. (2012). Random Forests and Decision Trees. International Journal of Computer Science Issues(IJCSI). 9.
- [4] Robi Polikar. 2006. Ensemble based systems in decision making. IEEE Circuits and systems magazine 6, 3 (2006), 21–45.
- [5] Jiawei Chen, Hande Dong, Xiang Wang, Fuli Feng, Meng Wang, and Xiangnan He. 2020. Bias and Debias in Recommender System: A Survey and Future Directions. arXiv preprint arXiv:2010.03240 (2020).
- [6] H. Abdollahpouri and M. Mansoury, “Multi-sided exposure bias in recommendation,” arXiv preprint arXiv:2006.15772, 2020.
- [7] Huang J, Hu K, Tang Q, et al. Deep Position-wise Interaction Network for CTR Prediction[J]. arXiv preprint arXiv:2106.05482, 2021.
- [8] KDD cup 2020 debiasing Championship technical scheme and its practice in the US League.
- [9] Luo Z, Huang J, Hu K, et al. AccuAir: Winning solution to air quality prediction for KDD Cup 2018[C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019: 1842-1850.
- [10] He Y, Lin J, Liu Z, et al. Amc: Automl for model compression and acceleration on mobile devices[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 784-800.
- [11] Yang Gao, Hong Yang, Peng Zhang, Chuan Zhou, and Yue Hu. 2020. Graph neural architecture search. In IJCAI, Vol. 20. 1403–1409.
- [12] Matheus Nunes and Gisele L Pappa. 2020. Neural Architecture Search in Graph Neural Networks. In Brazilian Conference on Intelligent Systems. Springer, 302– 317.
- [13] Huan Zhao, Lanning Wei, and Quanming Yao. 2020. Simplifying Architecture Search for Graph Neural Network. arXiv preprint arXiv:2008.11652 (2020).
- [14] Jin Xu, Mingjian Chen, Jianqiang Huang, Xingyuan Tang, Ke Hu, Jian Li, Jia Cheng, Jun Lei: “AutoHEnsGNN: Winning Solution to AutoGraph Challenge for KDD Cup 2020”, 2021; arXiv:2111.12952.
- [15] Selsaas L R, Agrawal B, Rong C, et al. AFFM: auto feature engineering in field-aware factorization machines for predictive analytics[C]//2015 IEEE International Conference on Data Mining Workshop (ICDMW). IEEE, 2015: 1705-1709.
- [16] Yao Shu, Wei Wang, and Shaofeng Cai. 2019. Understanding Architectures Learnt by Cell-based Neural Architecture Search. In International Conference on Learning Representations.
- [17] Kaicheng Yu, Rene Ranftl, and Mathieu Salzmann. 2020. How to Train Your Super-Net: An Analysis of Training Heuristics in Weight-Sharing NAS. arXiv preprint arXiv:2003.04276 (2020).
- [18] Haixun Wang, Wei Fan, Philip S Yu, and Jiawei Han. 2003. Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. 226–235.
- [19] Robi Polikar. 2006. Ensemble based systems in decision making. IEEE Circuits and systems magazine 6, 3 (2006), 21–45.
- [20] Chengshuai Zhao, Yang Qiu, Shuang Zhou, Shichao Liu, Wen Zhang, and Yanqing Niu. 2020. Graph embedding ensemble methods based on the heterogeneous network for lncRNA-miRNA interaction prediction. BMC genomics 21, 13 (2020), 1–12.
- [21] Rosenfeld N , Meshi O , Tarlow D , et al. Learning Structured Models with the AUC Loss and Its Generalizations.
- [22] Chen T , Tong H , Benesty M . xgboost: Extreme Gradient Boosting[J]. 2016.
- [23] Qi, Yi, et al. “Trilateral Spatiotemporal Attention Network for User Behavior Modeling in Location-based Search”, CIKM 2021.
- [24] Breakthrough and imagination of advertising depth estimation technology in the meituan to store scenario.
- [25] Geurts P . Bias vs Variance Decomposition for Regression and Classification[J]. Springer US, 2005
- [26] kaggle outbrain competition link:https://www.kaggle.com/c/outbrain-click-prediction.
- [27] KDD cup 2020 debiasing competition linkhttps://tianchi.aliyun.com/competition/entrance/231785/introduction.
- [28] KDD Cup 2018 competition link:https://www.biendata.xyz/competition/kdd_2018/.
- [29] KDD cup 2017 competition link:https://tianchi.aliyun.com/competition/entrance/231597/introduction.
- [30] KDD cup 2020 autograph competition link:https://www.automl.ai/competitions/3

## Recruitment Information

Based on the advertising scene, the algorithm team of meituan to store advertising platform explores the technological development of deep learning, reinforcement learning, artificial intelligence, big data, knowledge map, NLP and computer vision, and explores the value of local life service e-commerce. Main work directions include:

**Trigger strategy**: user intention identification, advertising business data understanding, query rewriting, depth matching, correlation modeling.**Quality estimation**: advertising quality degree modeling. Click through rate, conversion rate, customer unit price and transaction volume estimation.**Mechanism design**: advertising ranking mechanism, bidding mechanism, bidding suggestion, traffic estimation, budget allocation.**Creative optimization**: intelligent creative design. Optimization of advertising pictures, words, group lists, preferential information and other display ideas.

**Job requirements**：

- At least three years of relevant working experience, and at least one application experience in ctr/cvr prediction, NLP, image understanding and mechanism design.
- Familiar with common machine learning, deep learning and reinforcement learning models.
- Have excellent logical thinking ability, be enthusiastic about solving challenging problems, be sensitive to data, and be good at analyzing / solving problems.
- Master degree or above in computer or mathematics.

**The following conditions are preferred**：

- Experience in advertising / search / recommendation.
- Experience in large-scale machine learning.

Interested students can submit their resumes to:[email protected](please specify in the email Title: Guangping algorithm team).

**Read more technical articles of meituan technical team**

front end | algorithm | back-end | data | security | operation | iOS | Android | test

**|**In the dialog box of the menu bar of the official account, reply to the keywords [goods in 2020], [goods in 2019], [goods in 2018], [goods in 2017], and you can view the collection of technical articles of the meituan technical team over the years.

|This article is produced by meituan technical team, and the copyright belongs to meituan. You are welcome to reprint or use this article for non-commercial purposes such as sharing and exchange. Please note that “the content is reprinted from the meituan technical team”. This article may not be reproduced or used commercially without permission. For any business activities, please send an email to[email protected]Apply for authorization.