Meta learner is different from the causal tree direct estimation model introduced before, and belongs to one of the indirect estimation models. It does not directly model the treatment effect, but models the response effect (target), and uses the change of target caused by treatment as the estimation of hte. There are three main methods: t-learner, s-learner and x-learner. The traditional method is to approximate causality on the basis of supervision model.

The advantage of meta learner is obvious. It can use any ml supervised model for fitting, and it does not need to build a new estimator. Therefore, if there is a need for DNN / LGB based requirements, meta learner can be used as benchamrk

## Core thesis

Künzel, S. R., Sekhon, J. S., Bickel, P. J., & Yu, B. (2019). Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences, 116(10), 4156–4165.

## Model

### T-Learner

T is the abbreviation of two, which is a more traditional ML model for causal reasoning. The control group and the experimental group were respectively modeled to obtain two models, and the difference between the predicted values of the two models was calculated for each sample as hte estimation

\mu_0(x) = E[Y (0)|X = x]\\

\mu_1(x) = E[Y (1)|X = x]\\

\hat{\tau}(x) = \hat{\mu}_1 (x) – \hat{\mu}_0(x)

\end{align}

\]

T-learner has three obvious problems

- The model of the control group could not learn the pattern of the experimental group, and the model of the experimental group could not use the data of the control group. If the two models are completely isolated, the two models may have their own deviations, which will lead to large errors in the prediction.
- T-learner limits treatment to discrete values
- In most cases, the treatment effect is very small compared with the response, so the estimation bias on the response will have a great impact on the treatment

### S-Learner

S is the abbreviation of single. The control group and the experimental group are put together for modeling, and the experimental groups are added as features to the training features. Then, we use the method of imputation to calculate the difference between the model predictions if the sample enters the experimental group and the control group as the estimation of the experimental impact.

μ(x, w) &= E[Y|X = x, W = w]\\

\hat{\tau}(x) &= \hat{\mu} (x,1) – \hat{\mu}(x,0)

\end{align}

\]

The problem of s-learner is also that the essence is to fit the response. If the tree is used as the base learner, the final hte can be simply understood as that the samples fall on different leaf nodes, and the sample difference of leaf nodes. But because the tree itself is to model the outcome rather than the treatment effect, it is likely that the effective crowd division method will not be learned in this case.

The idea of s-learner is very common, which is the same as the individual conditional expectation (ice) in explainable machine learning. Averaging the whole sample is also known as partial dependence.

### X-Learner

X-learner integrates t-learner and s-learner to solve the above problems. The steps are as follows

- The control group and the experimental group were modeled to get the model\(M_1\),\(M_2\)Same as t-learner
- The control group was put into the experimental group model prediction, and then the experimental group was put into the control group model prediction. The difference between the predicted value and the actual value was regarded as the approximation of hte. Here, it is similar to the idea of s-learner, which is the method of imputation.
- The experimental group and the control group were respectively modeling the above target\(M_3\),\(M_4\)Two prediction values are obtained for each sample, and then weighted. Generally, the weight can be selected from the propensity score. The number of users can be directly used in the random experiment. It is no problem to directly use 0.5 in the random experiment with the same traffic

\hat{\mu_0}(x) &= M_1(Y^0 \sim X^0)\\

\hat{\mu_1}(x) &= M_2(Y^1 \sim X^1)\\

\hat{D_1}(x) &= Y_1 – \hat{\mu}_0(x)\\

\hat{D_0}(x) &= \hat{\mu}_1(x) – Y_0 \\

\hat{\tau_0} &= M_3(\hat{D_0}(x) \sim X_0)\\

\hat{\tau_1} &= M_4(\hat{D_1}(x) \sim X_1)\\

\hat{\tau} &= g(x) *\hat{\tau_0} + (1-g(x)) *\hat{\tau_1}\\

\end{align}

\]

## Comparison of methods

In this paper, several possible types of simulation are given, and the performance of S, x, t is evaluated. The following are: treatment unbalanced, cate complex linear, cate complex non linear, hte = 0 global linear, hte = 0 local linear.

In short, x-learner performs best when the experimental influence is large, and the performance of s-learner and x-learner is similar when the experimental influence is small.

Interested in other hte models here

Spring of causal reasoning: practical hte papers GitHub collection

AB experimental population oriented hte model 1 – causal tree

AB experimental population oriented hte model 2 – causal tree with trigger

AB experimental population oriented hte model 4 – double machine learning

Welcome to message ~

### Reference materials & open source code

- Tian L, Alizadeh AA, Gentles AJ, Tibshirani R (2014) A simple method for estimating interactions between a treatment and a large number of covari- ates. Journal of the American Statistical Association 109(508):1517–1532.
- Powers S, et al. (2017) Some methods for het- erogeneous treatment effect estimation in high- dimensions. arXiv preprint arXiv:1707.00102.
- Microsoft causal reasoning open source code
- https://github.com/JasonBenn/deep-learning-paper-notes/blob/master/papers/meta-learners-for-estimating-heterogeneous-treatment-effects-using-machine-learning.md