Meituan Delivery 2021KDD Paper: A Deep Learning Method for Route and Time Prediction in Food Delivery Service
background
There are some key forecasting links in the distribution system.
 ETA: Before the user places an order, the system will give the user a promised Estimated Delivery Time (ETA);
 ETR: After the order (A) is allocated to a specific rider, the rider’s position and the more accurate and realtime Estimated Delivery Time (ETR) can be seen on the userside APP;

Estimated delivery order: In the subsequent order allocation, it is necessary to predict the rider’s new delivery order (estimated order) and the new arrival time (ETR) of each user after the order (B) is allocated to the rider. (It should be noted that since riders are constrained by ETA, they will not choose the order of delivery at will)
 Evaluation index 1: reference ranking index, LSD (location square deviation), kendall coefficient
 Evaluation Indicator 2: Sequence Consistency Rate, that is, the ratio of the common prefix lengths of the two sequences
 Evaluation indicator 3: time indicator, ME, MAE, Nminute confidence
 Recommended delivery order: On the rider side, the system will give a suggested delivery order (suggested order) based on the time limit, distance, traffic rules, etc., and there is no mandatory force on the rider.The recommended sequence is the action after the order split, which has no reference significance for the order split strategy.
This article is to solve the task of rider route prediction and time prediction in the field of food delivery, referred to as FDRTP (Food Delivery Route and Time Prediction). The FDRTP problem is more complex, mainly reflected in three aspects:
 Riders usually pick up and deliver multiple orders at the same time (in peak hours, even 10 orders can be picked up and delivered at the same time), and the corresponding path solution space is huge. For the case of 10 orders during peak hours, it can reach 2.376*10^15 untie. In addition, route prediction needs to take into account the distance traveled and the punctuality of each order.
 The time prediction of the entire route involves three parts: the riding time, the walking time in the pickup and delivery stage, and the rider’s waiting time in the restaurant. There is a relatively large uncertainty in the time prediction process, and this uncertainty increases with the increase of the path length.
 As an online algorithm, FDRTP tasks are frequently called by other applications. Therefore, FDRTP tasks need to return results at millisecond granularity.
Before(Food Delivery Route Planning Algorithm: A TwoStage Fast Heuristic Algorithm) regard the FDRTP task as a PDPTW (pickup and delivery problem with time windows) problem, focusing on minimizing the heuristic cost (heuristic cost, consisting of time delay and delivery distance) by optimizing the search algorithm, and rarely improving the time The accuracy of the prediction, and the time prediction is a key parameter for calculating the time delay in PDPTW, and the decrease in its accuracy will also lead to a decrease in the accuracy of the searched path results.Under the current algorithm structure, for a single FDRTP task, the heuristic search algorithm needs to solve thousands of rounds before returning the result, and each round needs to call the time prediction algorithm. It is precisely because of this algorithm structure that it takes time for the prediction algorithm to quickly return results. Therefore, there are always some tradeoffs in model selection and feature selection., eg time series model, the characteristics of the rider’s historical behavior and habits, the characteristics of the spacetime relationship, and the environmental characteristics of the rider’s location.
Related work
RR (Route Recommendation)
The RR problem refers to recommending appropriate routes for users based on historical trajectories given the OD. In previous work, this problem was regarded as a pathfinding problem on graphs, and the research focus was mainly on designing heuristic search algorithms to reduce the search space. The heuristic search algorithm requires a large number of iterative rounds, and at the same time, designing an appropriate cost function becomes the key point of the algorithm.
However, a large number of studies have designed the cost function to be heuristic, resulting in limited applicability of the algorithm and inconvenient for largescale promotion. at the same time,It is difficult for heuristic search algorithms to make full use of various environmental features in the search process, which is basically a common defect faced by the entire operation research optimization direction algorithm.。
In response to the above problems, a large number of researches have focused on using machine learning methods to introduce more features to participate in inference.
 Chen et al. propose a Maximum Probability Product algorithm based on the Absorbing Markov Chain to discover the most popular route between two locations [4]
 In [22], the authors adopt probabilistic models incorporating both temporal dynamics and spatial dynamics and address the data sparsity challenge for route recovery.

The development of deep learning sheds light on the RR problem using neural networks, especially the promising effectiveness of RNN for modeling sequential trajectory data.
 In [1], the authors propose Space Time Featuresbased RNN to predict people next movement by discovering the mobility patterns.
 Wu et al. design two RNN based models to capture variable length sequence of trajectory data and address the constraints of topological structure on trajectory modeling [21]
The FDRTP task is more complicated than the RR problem due to the introduction of the order ontime rate and the rider’s riding cost.
ETA
There are two schools of ETA prediction: ODbased, routebased.
The ODbased method mainly uses the starting point features and does not consider the features of route and segment granularity. Related studies are as follows:
 Jindal et al. propose STNN (SpatioTemporal Neural Network), predicting the travel distance between OD and further predict the travel time combined with other time information [9]
 In [14], Wang et al. propose TEMP+R, which estimates the time duration as the weighted average travel times of similar historical trips.
The routebased method mainly focuses on the prediction of multiple road segments that make up the route, and finally aggregates the prediction results.
 BusTr infers bus delays combining realtime road traffic forecasts and contextual information [2]. Using a restricted fea ture set, BusTr can be generated to cities without training data. Besides, formulating ETA as a pure regression problem, Wang et al.
 formulating ETA as a pure regression problem, Wang et al. propose a Wide Deep Recurrent learning model to capture global spatiotemporal and route segment information in [17]
 Hong et al. further employ an Attention based GNN (Graph Neural Network) to embed road network data, and model temporal heterogeneous information, beyond the stateoftheart methods [8].
Compared with traditional ETA, in addition to considering the riding time, FDRTP also needs to consider the walking time during the pickup and delivery phases.
problem definition
The goal of the FDRTP task is to accurately predict the access order of the rider’s location set, that is, the permutation and combination of elements in l_0, P, and D. A feasible route needs to satisfy three constraints:
1) The route must start from l_0;
2) The pickup location of an order should be before the delivery location;
3) The pickup time of order o must be later than PT_o (the completion time of the merchant’s meal preparation).
Data Set
The original data is the business data of the last two weeks, involving 430 million orders and 1.6 million riders. Since new orders will continue to be added during the rider’s delivery process, the raw data cannot be directly used for model modeling. Therefore, we split the rider’s actual delivery path by the point in time when the rider assigns a new order. As an example in the following figure, we assign order 4 to the rider as the boundary time, and divide the delivery route into two segments.
After some other preprocessing (eg removing abnormal riders with excessively high speeds), 135 million samples remain. The figure below lists the feature information used for each sample.
It should be noted that the delivery time (drop off duration) and the meal pickup time (earliest pickup time) are produced by other models, which are regarded as known information here. There are differences in the information used in the RP module and the TP module. For details, see table2, which will not be repeated here.
Model details
As shown in the figure above, this paper proposes a method based on deep learning to solve FDRTP, referred to as FDNET (Food Delivery Route and Time Prediction Deep Network). FDNET predicts the probability of future riders visiting a location by mining a large amount of historical distribution data. Compared with the heuristic search algorithm, this method can greatly reduce the path planning solution space and reduce the calling frequency of time prediction. The method generates only a small number of highprobability feasible solutions and predicts their delivery time. After the calculation amount is reduced, more features can be considered to be added to the model, which further improves the accuracy of the model prediction.
FDNET consists of two parts: RP (route prediction) module and TP (time prediction) module.
The RP module is used to predict the probability of the rider’s next visit to the node (location), and then predict the entire route. After analyzing the factors that affect the rider’s behavior, we design a time series model based on RNN and Attention to describe the rider’s decisionmaking process in more detail.
The TP module is used to predict the travel time between two adjacent nodes on the route (leave O, arrive at D), and generate the fulfillment completion time of each order considering the meal time and delivery time. We consider the TP module as a variant of the ETA (Estimated Time of Arrival) problem.
The RP module and the TP module are cleverly combined in FDNET. The route feature of the TP module comes from the RP module, and the output result of the TP module will be added to the next round of prediction of the RP module as the future feature of the rider.
RP module
The RP module can be formally described as follows:
Among them, X represents all the information that can be obtained at the l_i location.
feature
In terms of features, the factors that affect the rider’s decisionmaking are mainly considered and the features are designed accordingly. The analysis results are shown in the figure below.
As can be seen from the figure,
1) Considering the punctuality rate, riders will give priority to orders with short remaining time;
2) Riders tend to visit closer locations first
3) Under the same circumstances, riders tend to have completed meal preparation orders, and the earlier the completion time, the better
4) Tends to prioritize delivery of orders with long delivery times to avoid overtime
Model
1. In the preprocessing stage, the dense features are discretized and embedded. Mainly based on two considerations:
 Slight changes in dense features will not affect the rider’s behavior, and at the same time, discretization can improve model stability;
 We want the model to learn subtle representations instead of using numerical values directly.
2. Use the deepFM model to make full use of firstorder features, secondorder cross features, and highorder features;
3. Finally, a vector of length m is obtained, which represents the environmental feature v_c, the rider feature v_u, the meal pickup feature v^p_o of the order o, and the delivery feature v^d_o of the order o;
4. Use LSTM to process time series data, and use Attention to calculate propensity
5. Ensure the first two constraints in the Problem Definition section.
LSTM Layer
In the current problem, it can be expressed as:
Among them, h_i1 and c_i1 respectively represent the two states calculated in the i1 stage.
The calculation details are as follows:
in:
 σ is the sigmoid function, σ(x)=1/(1+e^{x})
 v_{l_{i1}} represents the vector representation of the location l_{i1}, including environmental features, rider features, meal pickup or delivery features (depending on the actual pickup and delivery of l_{i1}). It should be noted that at l_0, because it is not a specific pickup and delivery point, the vector only contains environmental features and rider features.
 Calculated at each step, updating environment features and rider features
 h_t contains all the information of the first few steps as the main feature of subsequent predictions
 Add dropout between steps to avoid overfitting
Attention Layer
At each location, the rider will select one of all pickup and delivery tasks (denoted by C_i) as the next visit location. Here, the Attention mechanism is used to describe the rider’s decisionmaking process. In the ith round, we first obtain the global view of the rider, and then calculate the probability P(l) that the rider visits location l at i+1, where l belongs to C_i.
To represent global information,
Among them, h_i represents the output representation of the ith round of the LSTM unit, and v_l_j represents the feature of the jth location. The larger the product of h_i and v_l_j is, the larger the a_ij is. Correspondingly, g_i is more focused on the v_l_j vector (g_i is the weighted vector of the v_l_j vector). In fact, it is based on the correlation between the hidden layer and the feature to decide which location the attention should be placed on. It should be noted that there is no additional parameter to be trained here. Personally, it can be regarded as a simplification of attention. We regard g_i as the global view of the rider, and the probability corresponding to the next access point is as follows:
At this point, the RP module is basically finished.
There are two different strategies for the subsequent training and prediction process, the greedy strategy and the Beam Search strategy. The former greedily selects a best position per round and continues until the end of all rounds; the latter selects n best positions per round. In the next round, the n best positions are selected based on the n*m candidate results, until all rounds end.
In short, LSTM obtains h_i with the information of i1 round, and calculates the probability of becoming the next visited node with the candidate location (C_i) and h_i.
The role of Attention lies in the weight vector g_i, which is determined by 1) the correlation between h_i and the location vector v_l_j 2) The location vector v_l_j is determined by two parts; P is determined by the weight vector g_i and the location vector v_l_j, so the most important thing is the product of h_i and v_l_j the size of.
TP module
The TP module is used to predict the time to arrive at each location, that is, the time difference from leaving the previous location to arriving at the current location. In the problem definition stage, three constraints are proposed, and the TP module solves the third constraint problem, that is, the order o’s meal pickup time must be later than PT_o (the merchant’s meal preparation completion time). By combining each time difference and meal preparation time, we can get the estimated arrival time of each location. If the rider arrives at the restaurant early, he needs to wait for the takeout to be ready; otherwise, the rider can pick up the meal immediately.
We treat the TP module as a pure regression problem, formally described as follows:
Among them, X^t_i represents the information that can be used in the i round.
feature
 environmental characteristics
 rider characteristics
 Geographic features: We use the embedding information of GPS latitude and longitude
Among them, dist refers to the spherical distance between grids, and this paper uses a 300m grid.
 Geographical Characteristics: Delivery Time
 OD features: navigation distance, statistical features of navigation time at different time granularities (yesterday, the previous seven days, and the same day last week). Considering the sparsity of the data, we use geohash coding to calculate the above features
Model
Using the wide&deep model:
 The deep part is good at extracting latent features related to location
 The wide part is different from the RP module, using dense features of numerical type (because the predicted target is very sensitive to changes in such features, eg navigation distance)
According to the difference of predicted OD types, OD is divided into 6 parts. Mainly for two reasons:
 There are large differences in characteristics between different types
 The time consumption distribution varies greatly between different types
Model training and prediction
train
The teacher forcing strategy is adopted, that is, when time series is involved, real data is used for training, and prediction results from previous rounds are used for prediction. In our problem, both the RP module and the TP module use the real riding sequence of the rider as input; in the RP module, when the time feature of l_i needs to be calculated, the real time when the rider leaves l_i1 is used as the input.
In the RP module, we use the crossentropy loss function:
where D represents the training set and i represents the rounds included in the sample.
In the TP module, we use MAE as the loss function:
We train the RP and TP modules with independent adam optimizers and learning rates, i.e., the two parts are trained independently.
predict
Although independent during training, the RP and TP modules are very closely coordinated during prediction. In round i, the RP module uses the location generated by the RP module in round i1 and the updated features (earliest meal pickup time, remaining delivery time, navigation distance, etc.) of the TP module corresponding to round i1 as input . After that, the TP module uses the location predicted by the RP module as input to generate the time prediction results of the i rounds.
Experimental effect
Offline evaluation
RP module
The length of the route has a great influence on the indicators. According to the business characteristics, it is divided into two types of data sets: short route (less than or equal to 8 locations) and long route (more than 8 locations). Test the effects of FDNET with LR, RF, and XGB while ensuring that the features are as consistent as possible. On the indicator, look at PointWise sorting, adopt use[email protected], MMR indicator.
 [email protected]: The denominator is all test sets, and the numerator is the sum of the number of test sets in each user’s topK recommendation list
 MRR is an internationally common mechanism for evaluating search algorithms, that is, the first result matching, the score is 1, the second matching score is 0.5, the nth matching score is 1/n, if there is no matching sentence score is 0
From the results it can be seen that:
1) The longer the route, the larger the solution space and uncertainty, and the worse the index
2) The effect of LR is better than that of RF and XGB, which is a bit unexpected. It is speculated that the integrated learning method of RF and XGB is good at using statistical features
3) FDNET is better than traditional methods on both types of data sets. It is speculated that FDNET uses LSTM to learn behavior sequences and attention to describe the decisionmaking process.
4) The experiment of excluding environmental features and rider features shows that, because the environmental features have little influence on the rider’s decisionmaking, the corresponding difference is not much different from the original FDNET algorithm; it can also be inferred from the results that the rider’s features have a greater impact on the rider’s decisionmaking. In the future, you can try to expand the granularity for optimization in this regard.
In addition, XGB is used to measure the importance of different features, that is, the number of times the feature is used as a split node in XGB. It can be seen from the results that the pickup feature and the delivery feature are more important.
TP module
In the case of ensuring that the features are as consistent as possible, the effects of FDNET and linear regression, RF, and XGB are tested. (except for GPS embedding, which cannot be used in traditional methods such as linear regression, RF, and XGB), the comparison index is MAPE.
The analysis results are as follows:
 Among the traditional methods, XGB outperforms the other two types of methods
 FDNET outperforms traditional methods
 Unlike the RP module, the rider information is very helpful for time prediction. (But from the perspective of the feature importance of table 4, the current utilization of rider information is not high and needs to be expanded in the future)
location granularity time accuracy
Combining the RP module and the TP module, we are able to calculate the time point at which each location is reached, and this part will evaluate the accuracy of locationgranular time prediction. Table 6 is the ratio of MAPE reduced by FDNET compared to the traditional heuristic search method. The larger the value, the better the algorithm effect.
From the results it can be seen that:
 Whether it is Greedy or Beam Search (retaining 2 highest probability results each time), FDNET outperforms traditional methods;
 For the short type, the greedy method in the pickup sample is better than the beam search method, and the greedy method in the delivery sample is the same as the beam search method; for the long type, the beam search method is better than the greedy method. Therefore, in the actual project, the short chain adopts the greedy method, and the long chain adopts the beam search method.
Online AB experiment
The AB experiment results show that FDNET can achieve a 0.08pp increase in order punctuality rate, reduce order life cycle by 20 seconds, and reduce the average driving distance of riders by 60 meters. So, the FDNET approach improves both user satisfaction and rider experience.
case study
Case #1
The traditional method considers that the remaining time of l1 is insufficient and there is a risk of overtime, and gives the result of sending l1 first, then taking l2 and sending l2. Balancing time and distance points based on experience is a disadvantage of traditional methods.
FDNET takes into account the rider’s habits and historical behavior, and gives the result of first taking l2, sending l2, and then sending l1, which is the same as the actual access order.
Case #2
Both the traditional method and the FDNET method predict errors in the case. There are two reasons:
1) Since the locations of l1 and l2 are too close, there is a very high randomness between taking l1 or l2 first.
2) There is a large uncertainty in the meal preparation time, which comes from another set of ETA systems, and the prediction results are biased
Personal opinion: In terms of actual business, whether to take 1 or 2 first does not affect the rider’s overall distance and time. This is more of a problem with the evaluation system.
Case #3
Delivery difficulty is an important factor affecting rider behavior, o1 is closer to the road, easier to deliver; o2 requires parking to enter the community, and delivery is more difficult.Riders tend to prioritize easy orders, which contradicts the previous feature analysis. In the followup, more attention needs to be paid to the individualized information of the rider, eg the rider’s preference for the delivery order, and the rider’s historical access order.
future optimization direction
 The rider’s preference for delivery order, the rider’s historical visit order
 Rider’s familiarity with restaurants, delivery locations
 More environmental information, eg traffic conditions
 For the TP module, the link information of the route link is introduced