AB experimental high-end play series 3 – AB group is not random? Observation test? Propensity Score



It is said that random is the core of AB experiment. Why is random so important? It is said that there is no difference in AB group as a whole because of randomization, so as to accurately estimate the experimental effect (ATE)

ATE = E(Y_t(1) – Y_c(0))

How is random defined According to Rubin causal model, in order to make the above estimation unbiased, the following two conditions should be met in random experiment:

  1. SUTVA
    • No interaction between experimental individuals
    • Comparison of treatment among experimental individuals
  2. Ignoreability (unconfoundness is a stronger hypothesis)
    Whether or not there is experimental intervention has nothing to do with the experimental results. From the perspective of cause and effect diagram, there is no other variable that affects both treatment and outcome at the same time
    \[Y(1),Y(0) \perp Z \]

Sutva is assumed to be true in general experiments. Online experiments are good, and many offline experiments are difficult to ensure this. For example, more vehicles dropped in some areas will lead to insufficient transportation capacity in other areas, so there is implicit interaction between individuals. But this is outside the scope of this section.

In the random experiment, the ability is guaranteed by random sampling of samples. But in the observation experiment or not completely random experiment, the ignoreability is not tenable. The solution is to take into account those variables that affect whether the experiment and the experimental results at the same time to get the conditional ignoreability. as well as

\[Y(1),Y(0) \perp Z | X\]

Theory is so, but x is often unknown and high-dimensional. It is not realistic to find the exact same sample of X to estimate ate. One of the solutions is the following dependency score matching. Name is very high-end ~ calculation is simple ~ use with caution~

Next, I will introduce the core methods, and use kaggle a medical related data set to briefly compare each method.

Core method


The core method of profitability score is divided into two steps: the calculation of score and the use of score. The score is calculated as follows:
\text{Propensity Score} = P(Z=\text{treatment assignment}| X \in R^n)

One understanding is that it influences Z through\(X \in R^N\)Modeling, refining all the information of founding covariate. Another understanding is to\(P(z|x)\)As a measure of similarity (sample distance). I personally tend to understand it as a targeted dimension reduction ($n \ to 1 $), or clustering (similar samples).

Then, based on the score, we aggregate, match or weight the samples to meet the above requirementsconditional Ignorability

Prosperity score estimate

Estimation itself is a classic two classification problem. Based on the characteristics, we predict the probability of each sample entering the experimental group. Several classic papers (before 2011) use logistic regression to solve problems, but today, xgboost and LGB and other set tree algorithms should perform better in feature compatibility and accuracy. Moreover, the tree naturally guarantees that the samples of leaf nodes have the same score and similar characteristics. [of course, LR is preferred if your data is too small]

Here are two points to pay attention to when modeling:

1. Feature selection

The features here can be roughly divided into three categories

  • Impact treatment
  • Influence outcome
  • Confound that affects both treatment and outcome

There is no doubt that confinder is necessary. Removing confiding bias is the core of AB experiment. However, whether to add features that only affect treatment and outcome or not gives different opinions.

Combined with the conclusions of all parties, it’s okay to add features that have an impact on the outcome. In fact, combined with the high-end play method series 2 of the previous ab experiment – more sensitive AB experiment, cuped! It can be found that adding features that have influence on outcome is similar to using cuped in disguise, which may reduce the variance of core index estimation.

The addition of features that only affect treatment may lead to a lower coincidence of the final distribution of the proportion score of the samples in the experimental group and the control group, resulting in the failure of some experimental samples to find matching control samples, which needs careful consideration.

2. Model test

Only AUC and cross entropy are used to evaluate the fitting of the model. This relates to the balancing nature of the dependency score:
Z \perp X | PropensityScore

In short, for samples with similar score, X should be similar. Here you can directly use the visual boxplot / violinplot to test, or more accurately use statistical means such as t-test to test whether X has differences.

Score use

There are usually four ways to use dependency score. Let’s briefly introduce them one by one


In a word, the experimental group and the control group were matched according to their dependence.

According to the score, the samples of each experimental group were matched with [1 / N] [with / without putting back]. In addition to the limitation of real data volume, the parameter selection here is also a trade-off of bias variance. Therefore, it can be considered to calculate ate under 1-N matched samples respectively based on the sample size and the score difference is less than the threshold value. If the result difference is too large, the method itself needs to be adjusted.

There are also corresponding trim methods to eliminate the extreme score value can not find a matching sample (eg\(score \to 0\)) But in some cases, the trim method will be questioned. (Xiaoming: if you throw away some high-income samples, ROI must be uneven. How can you calculate that >_

When the amount of data allows, I prefer that nto1 has put back matching, because in most scenarios, it is impossible to fully consider all covariates, which means that the estimation of the probability score must be biased in some features. In this case, taking multiple sample matching can reduce the bias


In a word, the control group of the experimental group was divided into groups according to the similarity, and ate was calculated in the group.

There is no definite rule about how to group, as long as there are enough samples in each group to calculate ate. The same is the trade-off of bias variance. The more you group, the less you will have, the more variance you will have. There are usually two quantile bucket methods

  • Group the whole sample according to the proportion of the number of people
  • For a small number of people (usually the experimental group), determine the grouping boundary according to the number of people

Trim can also be used here, but please consider it carefully in combination with specific business scenarios.

Inverse probability of treatment weighting(IPTW)

In a word, the samples are weighted according to the reciprocal of the probability score.

For a completely random AB experiment, the probability score should be all around 0.5, while for an incomplete random experiment, after adjusting the probability score, Z will be adjusted to equal weight when calculating ate, as follows:

e &= P(Z=1|x) \\
w &= \frac{z}{e} + \frac{1-z}{1-e} \\
ATE & = \frac{1}{n}\sum_{i=1}^n\frac{z_iY_i}{e_i} – \sum_{i=1}^n\frac{(1-z_i)Y_i}{1-e_i}

There are two reasons for my reservation on this method. First, although score is used in matching and strategy, it is essentially the sample similarity given by scoresortThe score itself is not used, so there is a certain tolerance for the accuracy of score estimation. Second, it’s easy to use score as denominator\(score \to 0/1\)The resulting extreme value problem requires manual adjustment, and whether the adjustment itself is reasonable will be questioned.

Covariate adjusted

In a word, I haven’t been exposed to this method very much, and I don’t like model dependency very much

Application example

The data source is kaggle’s open source dataset, heart disease UCI [data link]
Data itself is based on people’s gender, age, whether there has ever been heart pain and other medical indicators to predict the probability of people suffering from heart disease.

The data volume and characteristics are very small. The following is only used for method exploration, and the result confidence is not discussed.

Here we take the data as a sample of an observational experiment to see if women (sex = 0) and men (sex = 1) are more likely to suffer from heart disease. The data are as follows:

AB experimental high-end play series 3 - AB group is not random? Observation test? Propensity Score

AB experimental high-end play series 3 - AB group is not random? Observation test? Propensity Score

Directly from the data, men are 30% less likely to have heart disease than women! WHAT?!
Considering that the data is very small, we use LR to estimate the percentage score. The distribution of male and female scores is as follows

AB experimental high-end play series 3 - AB group is not random? Observation test? Propensity Score

Next, I used strategy, matching and iptw to estimate ate


I tried to use the experimental group and the whole sample to find the quantile to calculate ate. I used the experimental group to estimate the quantile time. One group of control group will have too few samples, so I changed it to two groups. The results showed that ate was – 0.15-0.16. It’s half lower than using the full sample estimation directly!

In order to determine the number of strategy groups, we need to ensure that each group has sufficient treatment and control samples, and that the covariate distribution of each group is similar

AB experimental high-end play series 3 - AB group is not random? Observation test? Propensity Score

AB experimental high-end play series 3 - AB group is not random? Observation test? Propensity Score


The results are as follows: with trim & match 1 ~ 4 + without trim & match 1 ~ 4. The final estimated ate is similar to the above-mentioned strategy, and the ATE is direct at -0.15 to -0.16. Moreover, the relative robust matching quantity does not have a great influence on ate calculation.

We find that ate will become more and more significant with the increase of matched samples, so the larger n of match, the better? In fact, it is not because the p value is a function of the sample size. As the sample size increases, the ‘small’ change will become significant. So I don’t think it’s very important to choose the best n. It may be more meaningful to compare ate with different N’s stability.



. It is expected that the result will be quite strange. On the one hand, there is less data (more than 100), on the other hand, there are fewer confander features, and score fitting is certainly not good. So what we get is a positive result…

PSM is almost so much. Welcome all kinds of feedback and comments. In the next section, we will discuss how to deal with the low permeability / dilution of the experiment You are interested in this series

AB experiment series 2 – more sensitive AB experiment, cup!
High end play method series of AB Experiment 1 – GitHub collection of practical hte (heterogeneous treatment effects) papers

Papers and materials

  1. Peter C. Austin 1. An introduction to prosperity score methods for reducing the effects of foundation in observational studies. Multivariate behav res. 2011 may; 46 (3): 399 – 424. [paper link]
  2. Austin PC, Stuart EA. Moving towards best practice when using inverse probability of treatment weighting (iptw) using the dependency score to estimate canal treatment effects in observational studies. Stat med. 2015; 34: 3661 – 3679. [paper link]
  3. King, g., & Nielsen, R. (2019). Why prosperity scores should not be used for matching. Political analysis, 1-20. [paper link]
  4. Morgan, S. & Winship, C. (2015). Countryfacts and causal influence: methods and principles for social research. Cambridge: Cambridge University Press. Page 142. [link]
  5. Guion, R. (2019). Causal influence in Python
  6. King, G. (2018). Matching methods for causal influence. Published presentation given at Microsoft Research, Cambridge, Ma on 1 / 19 / 2018. [link]