[Paper Notes] Recommendations as Treatments: Debiasing Learning and Evaluation


Recommendations as Treatments: Debiasing Learning and Evaluation

Authors: Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, Thorsten Joachims

ICML’16 Cornell University

0. Summary

This paper proposes an IPS-based evaluation index and a model training method, and proposes two propensity score estimation methods. The Coat dataset is collected and published, and the robustness of the evaluation metrics to Propensity score estimation and the performance superiority of IPS-MF are verified on semi-synthetic datasets and unbiased datasets.

1. Research objectives

Remove the adverse effects of selection-bias on model evaluation and model training.

2. Problem background

Selection bias in recommender systems may come from two sources: first, users are more likely to interact with items they are interested in, and items that are not of interest are more likely to have no interaction records; second, the recommendation system is giving a recommendation list when It also tends to recommend products that meet the user's interests.

3. IPS evaluation index


Figure 1

Considering the model in Figure 1, the first row in the figure represents the true score Y, exposure probability P, and exposure distribution O, where the lower the score, the lower the observed probability. second line\(\hat{Y}_1\)and\(\hat{Y}_2\)represent two different prediction results, respectively.\(\hat{Y}_3\)Indicates whether an interaction has occurred.

3.1 Task 1: Evaluation of Score Prediction Accuracy

Ideally, when all ratings are observed, the evaluation metric is

\[R(\hat{Y})=\frac{1}{U \cdot I} \sum_{u=1}^{U} \sum_{i=1}^{I} \delta_{u, i}(Y, \hat{Y})

However, in the presence of selection bias, the evaluation index will become

\[\hat{R}_{n a i v e}(\hat{Y})=\frac{1}{\left|\left\{(u, i): O_{u, i}=1\right\}\right|} \sum_{(u, i): O_{u, i}=1} \delta_{u, i}(Y, \hat{Y})

From the point of view of judging likes and dislikes,\(\hat{Y}_1\)clearly better than\(\hat{Y}_2\); but from the evaluation index, because\(\hat{Y}_2\)Those interactions that are mispredicted in , are rarely observed, so,\(\hat{Y}_2\)would be better than\(\hat{Y}_1\)

3.2 Recommendation quality evaluation

Evaluating the quality of recommendation results is answering a counterfactual question: how much would the user experience be improved if the user interacted with an item in the recommendation list, rather than the actual interaction history?

The evaluation index can be DCG or the like. Since the observation data is biased, similar to the description in 3.1, the final evaluation index is also biased.

3.3 Performance Evaluation Based on Propensity Scores

The key to solving the selection bias is to understand the assignment mechanism of observation data, which includes two factors: system generation (Experimental Setting) and user selection (Observational Setting).

In order to solve the problem of bias in evaluation indicators, the author proposes to use inverse propensity scores to weight the observation data to construct an unbiased estimator for ideal evaluation indicators – IPS Estimator:

\[\hat{R}_{I P S}(\hat{Y} | P)=\frac{1}{U \cdot I} \sum_{(u, i): O_{u, i}=1} \frac{\delta_{u, i}(Y, \hat{Y})}{P_{u, i}}\\
\mathbb{E}_{O}\left[\hat{R}_{I P S}(\hat{Y} | P)\right] =\frac{1}{U \cdot I} \sum_{u} \sum_{i} \mathbb{E}_{O_{u, i}}\left[\frac{\delta_{u, i}(Y, \hat{Y})}{P_{u, i}} O_{u, i}\right] \\
=\frac{1}{U \cdot I} \sum_{u} \sum_{i} \delta_{u, i}(Y, \hat{Y})=R(\hat{Y})

in\(O_{u,i} ~ Bernoulli(P_{u,i})\)\(P_{u,i}\)is the property score.

3.4 Experimental verification

Using the full-exposure simulation dataset generated by MF, the authors design several scoring strategies, each with different scoring errors. Based on the exposure situation in the real data set, the evaluation index of exposure interaction is calculated, which proves that the IPS evaluation index can effectively offset the evaluation error caused by the selection bias.


4. IPS recommendation system

IPS-based recommendation system, the training objectives are:

\[\underset{V, W, A}{\operatorname{argmin}}\left[\sum_{O_{u, i}=1} \frac{\delta_{u, i}\left(Y, V^{T} W+A\right)}{P_{u, i}}+\lambda\left(\|V\|_{F}^{2}+\|W\|_{F}^{2}\right)\right]

in\(P_{u,i}\)is a propensity score, which is equivalent to adding a weight to the corresponding loss item.

5. Estimation of Propensity Scores

The authors propose two estimation methods

  1. Naive Bayes Estimation

    This method seems to give the same rating for UI interactions with the same rating?

    \[P\left(O_{u, i}=1 \mid Y_{u, i}=r\right)=\frac{P(Y=r \mid O=1) P(O=1)}{P(Y=r)}

  2. logistic regression

    Learn a linear model with all information about ui pairs as features

    \[P_{u, i}=\sigma\left(w^{T} X_{u, i}+\beta_{i}+\gamma_{u}\right)

6. Experiment

6.1 Experimental setup

The training set is biased (MNAR) data, parameters are tuned using k-fold cross-validation, and unbiased data or synthetic full exposure data is used as the test set.

6.2 Influence of sampling deviation on evaluation indicators

Build a fully exposed synthetic dataset:On the ML 100K dataset, MF is used to fill in all vacant scores, and the distribution of scores after filling is adjusted to reduce the proportion of high scores.

The experimental results are shown in 3.4

6.3 The impact of sampling bias on model training

For different degrees of selection bias (\(\alpha\)The smaller the selection bias, the greater the selection bias), and the experimental results are shown in the figure below.

It can be seen that the performance of IPS-MF and SNIPS-MF is significantly better than that of naive-MF.


6.4 Impact of Propensity Score Estimation Accuracy

Using different scales of data to estimate propensity scores, it can be seen that IPS and SNIPS outperform MF under all conditions, validating the robustness of the model to propensity scores.


6.5 Performance on real datasets

Yahoo! R3: Use 5% unbiased data to estimate propensity scores and 95% unbiased data as the test set.

Coat: This paper collects a new unbiased data set Coat (a great contribution), including 290 users and 300 items, each user chooses 24 items to give ratings, and gives ratings to 16 random items (1-5 points).

Experimental results show that it outperforms the best baselines on both datasets.