Recommendations as Treatments: Debiasing Learning and Evaluation
Authors: Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, Thorsten Joachims
ICML’16 Cornell University
0. Summary
This paper proposes an IPSbased evaluation index and a model training method, and proposes two propensity score estimation methods. The Coat dataset is collected and published, and the robustness of the evaluation metrics to Propensity score estimation and the performance superiority of IPSMF are verified on semisynthetic datasets and unbiased datasets.
1. Research objectives
Remove the adverse effects of selectionbias on model evaluation and model training.
2. Problem background
Selection bias in recommender systems may come from two sources: first, users are more likely to interact with items they are interested in, and items that are not of interest are more likely to have no interaction records; second, the recommendation system is giving a recommendation list when It also tends to recommend products that meet the user's interests.
3. IPS evaluation index
Figure 1
Considering the model in Figure 1, the first row in the figure represents the true score Y, exposure probability P, and exposure distribution O, where the lower the score, the lower the observed probability. second line\(\hat{Y}_1\)and\(\hat{Y}_2\)represent two different prediction results, respectively.\(\hat{Y}_3\)Indicates whether an interaction has occurred.
3.1 Task 1: Evaluation of Score Prediction Accuracy
Ideally, when all ratings are observed, the evaluation metric is
\]
However, in the presence of selection bias, the evaluation index will become
\]
From the point of view of judging likes and dislikes,\(\hat{Y}_1\)clearly better than\(\hat{Y}_2\); but from the evaluation index, because\(\hat{Y}_2\)Those interactions that are mispredicted in , are rarely observed, so,\(\hat{Y}_2\)would be better than\(\hat{Y}_1\)。
3.2 Recommendation quality evaluation
Evaluating the quality of recommendation results is answering a counterfactual question: how much would the user experience be improved if the user interacted with an item in the recommendation list, rather than the actual interaction history?
The evaluation index can be DCG or the like. Since the observation data is biased, similar to the description in 3.1, the final evaluation index is also biased.
3.3 Performance Evaluation Based on Propensity Scores
The key to solving the selection bias is to understand the assignment mechanism of observation data, which includes two factors: system generation (Experimental Setting) and user selection (Observational Setting).
In order to solve the problem of bias in evaluation indicators, the author proposes to use inverse propensity scores to weight the observation data to construct an unbiased estimator for ideal evaluation indicators – IPS Estimator:
\mathbb{E}_{O}\left[\hat{R}_{I P S}(\hat{Y}  P)\right] =\frac{1}{U \cdot I} \sum_{u} \sum_{i} \mathbb{E}_{O_{u, i}}\left[\frac{\delta_{u, i}(Y, \hat{Y})}{P_{u, i}} O_{u, i}\right] \\
=\frac{1}{U \cdot I} \sum_{u} \sum_{i} \delta_{u, i}(Y, \hat{Y})=R(\hat{Y})
\]
in\(O_{u,i} ~ Bernoulli(P_{u,i})\)，\(P_{u,i}\)is the property score.
3.4 Experimental verification
Using the fullexposure simulation dataset generated by MF, the authors design several scoring strategies, each with different scoring errors. Based on the exposure situation in the real data set, the evaluation index of exposure interaction is calculated, which proves that the IPS evaluation index can effectively offset the evaluation error caused by the selection bias.
4. IPS recommendation system
IPSbased recommendation system, the training objectives are:
\]
in\(P_{u,i}\)is a propensity score, which is equivalent to adding a weight to the corresponding loss item.
5. Estimation of Propensity Scores
The authors propose two estimation methods

Naive Bayes Estimation
This method seems to give the same rating for UI interactions with the same rating?
\[P\left(O_{u, i}=1 \mid Y_{u, i}=r\right)=\frac{P(Y=r \mid O=1) P(O=1)}{P(Y=r)}
\] 
logistic regression
Learn a linear model with all information about ui pairs as features
\[P_{u, i}=\sigma\left(w^{T} X_{u, i}+\beta_{i}+\gamma_{u}\right)
\]
6. Experiment
6.1 Experimental setup
The training set is biased (MNAR) data, parameters are tuned using kfold crossvalidation, and unbiased data or synthetic full exposure data is used as the test set.
6.2 Influence of sampling deviation on evaluation indicators
Build a fully exposed synthetic dataset:On the ML 100K dataset, MF is used to fill in all vacant scores, and the distribution of scores after filling is adjusted to reduce the proportion of high scores.
The experimental results are shown in 3.4
6.3 The impact of sampling bias on model training
For different degrees of selection bias (\(\alpha\)The smaller the selection bias, the greater the selection bias), and the experimental results are shown in the figure below.
It can be seen that the performance of IPSMF and SNIPSMF is significantly better than that of naiveMF.
6.4 Impact of Propensity Score Estimation Accuracy
Using different scales of data to estimate propensity scores, it can be seen that IPS and SNIPS outperform MF under all conditions, validating the robustness of the model to propensity scores.
6.5 Performance on real datasets
Yahoo! R3: Use 5% unbiased data to estimate propensity scores and 95% unbiased data as the test set.
Coat: This paper collects a new unbiased data set Coat (a great contribution), including 290 users and 300 items, each user chooses 24 items to give ratings, and gives ratings to 16 random items (15 points).
Experimental results show that it outperforms the best baselines on both datasets.