The purpose of heterogeneous treatment effect is to quantify the differential effects of experiments on different populations, and then to conduct differential experiments or adjust experiments through crowd oriented / numerical strategies. Double machine learning takes treatment as a feature, and calculates the difference effect of the experiment by estimating the influence of the feature on the target.

Machine learning is good at giving accurate forecasts, while economics pays more attention to unbiased estimation of the impact of characteristics on targets. DML combines the methods of economics and machine learning, and gives unbiased estimation of the influence of features on targets with any ML model under the framework of economics

Other methods of hte can be found in the spring of causal reasoning – Practical hte paper GitHub collection

## Core thesis

5. Chernozhukov, D. chetverikov, M. demirer, e. Duflo, C. Hansen, and A. W. Newey. Double machine learning for treatment and calcium parameters. ArXiv, e-prints

## background

Hte problems can be simply abstracted with the following notation

- Y is the core index of experimental influence
- T is treatment, usually 0 / 1 variable, which represents whether the sample enters the experimental group or the control group\(T \perp X\)
- X is a confolder, which can be simply understood as a user feature that has not been interfered by experiments, and is usually a high-dimensional vector
- What DML finally estimates is\(\theta(x)\)In other words, the experiment has different effects on the core indicators of different users

Y &= \theta(x) T + g(X) + \epsilon &\text{where }E(\epsilon |T,X) = 0 \\

T &= f(X) + \eta &\text{where } E(\eta|X) = 0 \\

\end{align}

\]

The most direct method is to use X and t together to model y and estimate directly\(\theta(x)\)。 But that’s how it’s estimated\(\theta(x)\)It is often biased. The deviation is partly due to the over fitting of the sample and partly from the over fitting of the sample\(\hat{g(X)}\)Bias of estimation, assumption\(\theta_0\)Is the true value of the parameter, the deviation is as follows

\]

## DML model

The DML model is divided into the following three steps

#### Step 1: fitting y and T with arbitrary ML model to get the residual\(\tilde{Y},\tilde{T}\)

\tilde{Y} &= Y – l(x) &\text{ where } l(x) = E(Y|x)\\

\tilde{T} &= T – m(x) &\text{ where } m(x) = E(T|x)\\

\end{align}

\]

#### Step 2. Yes\(\tilde{Y},\tilde{T}\)Fitting with arbitrary ML model\(\hat{\theta}\)

\(\theta(X)\)It can be parametric model or nonparametric model, and parametric model can be directly fitted. Since the non parametric model only accepts input and output, the following transformation is needed. The model target is changed to\(\frac{\tilde{Y}}{\tilde{T}}\)The sample weight is\(\tilde{T}^2\)

& \tilde{Y} = \theta(x)\tilde{T} + \epsilon \\

& argmin E[(\tilde{Y} – \theta(x) \cdot \tilde{T} )^2]\\

&E[(\tilde{Y} – \theta(x) \cdot \tilde{T} )^2] = E(\tilde{T}^2(\frac{\tilde{Y}}{\tilde{T}} – \theta(x))^2)

\end{align}

\]

#### Step 3. Cross fitting

Cross fitting is an important step for DML to ensure unbiased estimation, which is used to reduce the estimation bias caused by overfitting. First, divide the total sample into two parts: sample 1 and sample 2. Sample 1 is used to estimate the residual error, and sample 2 is used to estimate the residual error\(\hat{\theta}^1\)Then sample 2 is used to estimate the residuals, and sample 1 is used to estimate $\ hat {theta} ^ 2 $, and the final estimate is obtained by averaging. Of course, k-fold can be further used to increase the robustness of the estimation.

sample_1, sample_2 &= \text{sample_split} \\

\theta &= \hat{\theta}^1 + \hat{\theta}^2 \\

\end{align}

\]

In his blog, Jonas compared the estimated effects of not using DML, using DML without cross fitting, and using cross fitting as follows

### From the perspective of propensity

Recently I thought of a more intuitive understanding of DML than GMM below to share with you. For better understanding, let’s make some simplified assumptions.

If the sample is still completely random in the high-dimensional feature space, then the first step of prediction T will get all probability predictions of 0.5\(\tilde{Y}\)It was 0.5 and – 0.5 in the control group.

In the first step of predicting y (assuming gbdt fitting), each leaf node (k) will get\(0.5*(\mu_{cmp,k} + \mu_{exp,k})\)The predicted value of. Assuming that there is no hte in each leaf node, the experiment has the same effect on all the experimental group samples in the leaf node, and the residual error of the experimental group sample is\(0.5*(\mu_{exp,k} – \mu_{cmp,k} )\)And the control group was\(0.5 *(\mu_{cmp,k} – \mu_{exp,k})\)They are opposite numbers to each other. It’s working like this\(\tilde{T}\)To fit\(\tilde{Y}\)When negative is positive, what you get is\(\mu_{exp,k} – \mu_{cmp,k}\)

The prediction of T in random AB experiments is often around 0.5, but it is not 0.5 in general, because the samples in the experiment are limited after all, and there will be some uneven situation when the samples are cut by high-dimensional features. Suppose that the prediction of a leaf node T is 0.6\(\tilde{T}\)=4, control group\(\tilde{T}\)=-0.6。 This also means that in this leaf node, 40% of the samples in the experimental group and 60% in the control group. With the assumption of no hte, the prediction of Y becomes $0.6_ {exp,k} +0.4 \mu_ The residual error of the experimental group is {CMP, K} $\(0.4*(\mu_{exp,k} – \mu_{cmp,k} )\)And the control group was\(0.6 *(\mu_{cmp,k} – \mu_{exp,k})\)So, in formula 7\(\frac{\tilde{Y}}{\tilde{T}}\)Is it just make sense. The adjustment of sample weight is also consistent with the logic of propensity. The closer to 0.5 means that the estimated hte is closer to the real hte, and the more deviation from 0.5 means that the sample estimation bias is higher, so the weight is lower.

### From the perspective of GMM

Generalized method of moments (GMM) is more widely used in the field of economics. In this paper, I first saw the moment condition and thought about it for a long time, but I didn’t think of it. I simply review the content of GMM here.

What is moment estimation? It can be simply understood that the total distribution is estimated by the distribution characteristics of the samples\(E((x-a)^K)\)The first moment is the mean value, and the second-order origin moment is the variance. Give me a few examples

For example, the population sample obeys\(N(\mu, \sigma^2)\)If there are two parameters to be estimated, then two equations are needed to solve two unknowns, that is, the first moment condition\(\sum{x_i}-\mu=0\)And second moment conditions\(\sum{x_i^2} – \mu^2 – \sigma^2=0\)。

Another example is OLS,\(Y=\beta X\)It can be solved by the least square method\(argmin (Y-\beta X)^2\)But it can also be solved by moment estimation\(E(X(Y-\beta X))=0\)。 In fact, least squares is only a special case of GMM.

What kind of moment conditions should we choose to estimate the hte problem\(\theta\)What about it?

Direct estimation\(\theta\)The moment condition of is as follows

\(E(T(Y-T\theta_0-\hat{g_0(x)}))=0\)

The moment conditions of DML based on residual estimation are as follows

\(E([(Y-E(Y|X))-(T-E(T|X))\theta_0](T-E(T|X)))=0\)

The author points out that the moment condition of DML obeys the Neyman orthogonality condition\(g(x)\)If the estimation is biased, unbiased can still be obtained\(\theta\)The estimation of.

### Reference materials & open source code

- V. Chernozhukov, M. Goldman, V. Semenova, and M. Taddy. Orthogonal Machine Learning for Demand Estimation: High Dimensional Causal Inference in Dynamic Panels. ArXiv e-prints, December 2017.
- V. Chernozhukov, D. Nekipelov, V. Semenova, and V. Syrgkanis. Two-Stage Estimation with a High-Dimensional Second Stage. 2018.
- Microsoft causal reasoning open source code econml
- Double machine learning open source code mlinference
- https://www.linkedin.com/pulse/double-machine-learning-approximately-unbiased-jonas-vetterle/
- https://www.zhihu.com/question/41312883