Simpson’s paradox in Covid19 case fatality rates: a mediation analysis of agerelated causal effects
Authors: Julius von Kügelgen , Luigi Gresele , and Bernhard Schölkopf
IEEE Trans on AI 2021, Max Planck Institute for Intelligent Systems, Germany, University of Cambridge
Paper link:https://ieeexplore.ieee.org/abstract/document/9404149
Link to this article:https://www.cnblogs.com/zihaojun/p/15737080.html
 Simpson’s paradox in Covid19 case fatality rates: a mediation analysis of agerelated causal effects
 0. Preface
 1. Problem background and research objectives
 2. Simpson's paradox in Chinese and Italian mortality data
 3. Causal Modeling of COVID19 Mortality
 4 Overall, direct and indirect causal effects of COVID19 mortality
 4.1 Total Causal Effect (TCE)
 4.2 "Why?" Mediating Effect Analysis of COVID19 Mortality
 4.3 Controlled Direct Effect (CDE)
 4.4 Natural Direct Effect (NDE)
 4.5 Natural Indirect Effect (NIE)
 4.6 Mediation Formulas
 4.7 Relationships among overall effects, natural direct effects, and natural indirect effects (TCE, NDE and NIE)
 5. Analysis of the mediating effect of age distribution on country and COVID19 mortality
 6. Limitations of this paper and future work
 some thoughts
 references
0. Preface
This paper is the first paper to introduce causal inference methods into the field of new crown research. After obtaining relevant data, the method in this paper can be applied to more complex data. This paper also provides a further understanding of the mechanism behind the new crown mortality rate. Convenient and transparent causal framework.
I read this article to see how quantitative causal analysis is done, and how population effects relate to direct and indirect effects.
1. Problem background and research objectives
Beginning with the Wuhan outbreak in December 2019, the novel coronavirus has spread rapidly around the world, causing hundreds of millions of infections and millions of deaths. In the data related to the new crown, the mortality rate is an important indicator. Because mortality is highly agerelated, mortality is often studied by age group. However, statistical methods may create some paradoxes, such as Simpson’s paradox in the mortality data of China and Italy analyzed in this paper – China has higher mortality rates at all ages than Italy, but overall lower mortality rates than Italy .
This paper uses the method of causal inference to study the relationship between various countries, the mortality rate of the new crown, and the age distribution, especially the indirect impact of the age of the infected person as an intermediary variable on the mortality of the new crown, which provides support for policy formulation and more complex data in the followup. The research above lays the foundation.
2. Simpson's paradox in Chinese and Italian mortality data
When comparing the death rate data from China and Italy, Italy has a lower death rate than China across all age groups, but Italy has a higher overall death rate than China. As shown on the left of Fig. 1, the blue bars represent China and the orange bars represent Italy.
This phenomenon is known as Simpson's paradox, in which the dominant side in the group comparison is sometimes the loser in the overall rating.
The reason for Simpson's paradox is that when we focus on comparing mortality rates across age groups, we ignore the differences in the distribution of infections between the two countries. As shown in the right figure of Fig. 1, among the infected people in Italy, the elderly account for a large proportion, and the mortality rate of the elderly group is relatively high; while most of the infected people in China are young and middleaged people, and the mortality rate of these groups is relatively low. This results in a much higher death rate in Italy overall.
Similar phenomena include:
 When comparing the tuberculosis mortality rates of New York and Richmond in 1910, it can also be observed that the overall mortality rate in New York is lower, but if the population is divided by race, the tuberculosis mortality rate of each race in New York is higher than that in Richmond. full to high.
3. Causal Modeling of COVID19 Mortality
Statistics can only find correlations between variables, but correlation is not causation. Not only that, but statistics lacks the language of cause and effect to express and prove cause and effect. From another point of view, the same set of data can be explained by different causal models. Therefore, human domain knowledge must be introduced to understand the data – to establish a causal model.
3.1 Variables in a causal model
In this article, we introduce the following three variables:
 country (country, C)
 age group (A)
 Mortality (fatality, F)
3.2 Data generation models and causeandeffect diagrams
This paper only models the mortality rate of infected persons, and does not model the infection process.
The cause and effect diagram is as follows:
 \(C \rightarrow A\): Country affects age distribution of infected people
 Population age structure and social conditions vary from country to country
 The impact of epidemic prevention policies on people of different ages is different
 \(A \rightarrow F\): The age of the infected person affects the mortality rate of the infected person
 \(C \rightarrow F\): The mortality rate of infected people varies from country to country
 Medical conditions vary, such as number and price of beds and ventilators
 Vaccination rates vary
 Different levels of acceptance of modern medicine
4 Overall, direct and indirect causal effects of COVID19 mortality
This part of the analytical theory comes from an article published by Pearl in 2001, see[Classic Paper on Causal Inference] Direct and Indirect Effects – Judea Pearl, the basic knowledge of causal inference can refer toCausalInferenceinStatisticsAPrimerby Judea Pearl, I might write about Rubin laterCausal Inference for Statistics, Social, and Biomedical Sciences's notes.
【Symbol Description】
 T: treatment, this article refers to which country to choose.
 X: Mediating variable, this article refers to the age of the new crown infected person.
 Y: Outcome variable, this article refers to the death of infected people due to the new crown.
4.1 Total Causal Effect (TCE)
Questions about overall causal effects:
 \(Q_{TCE}\): If the country is changed from China to Italy, what will happen to the death rate of the new crown?
[Definition 1] (TCE) The overall causal effect of a binary variable T on Y is defined as:
\begin{aligned}
\operatorname{TCE}_{0 \rightarrow 1}=& \mathbb{E}_ {Y do(T=1)}[Y \mid d o(T=1)] \\
&\mathbb{E}_ {Y do(T=0)}[Y \mid do(T=0)]
\end{aligned}
\end{equation}
\]
 The overall causal effect of T on Y was defined as the difference between the outcomes of the two interventions.
4.2 "Why?" Mediating Effect Analysis of COVID19 Mortality
We are not satisfied with the overall differences between the two countries, but are more interested in the reasons for these differences. As previously analyzed, the age distribution of the infected is an important factor affecting the mortality rate, but the government has limited control measures on the age distribution of the infected. Therefore, we hope to bring the difference in mortality caused by the age distribution of the infected and other factors into account. difference.
From a causal inference perspective, this is to separate direct causal effects from indirect causal effects.
4.3 Controlled Direct Effect (CDE)
Controlling direct effects refers to intervening on mediating variables, thereby blocking the mediating causal path and retaining only direct effects.
A question about controlling for direct effects:
 \(Q_{CDE(5059)}\): For 5059 year olds, which is safer to contract the new crown in China or Italy?
 Equivalent to controlling for the mediator variable for age 5059
[Definition 2] (CDE) Under the condition that the mediating variable X=x, the direct causal effect of the binary variable T on Y is:
\begin{aligned}
\operatorname{CDE}_{0 \rightarrow 1}(x)=& \mathbb{E}[Y \mid d o(T=1, X=x)] \\
&\mathbb{E}[Y \mid d o(T=0, X=x)]
\end{aligned}
\end{equation}
\]
In controlling the direct effect, the value of the intermediary variable is artificially defined and cannot represent the situation of the entire population. We are more interested in the difference between the two countries under the true age distribution of infected people, the natural effect.
4.4 Natural Direct Effect (NDE)
The natural direct effect study is to maintain the state of the mediator variable before treatment, and what changes will be made to the variable Y after receiving the treatment.
A question about the direct effects of nature:
 \(Q_{NDE}\): If the age distribution of infected people in Italy is like that in China, will the death rate in Italy be higher or lower than that in China? (comparison between two countries)
[Definition 3] (NDE) It is known that the mediating variable is X, and the direct causal effect of the binary variable T on Y is:
\begin{aligned}
\operatorname{NDE}_{0 \rightarrow 1}= \mathbb{E}[Y_{X(0)} \mid do(T=1)]
– \mathbb{E}[Y \mid do(T=0)]
\end{aligned}
\end{equation}
\]
in\(X(0)\)Represents the distribution of X when T=0.
4.5 Natural Indirect Effect (NIE)
The natural indirect effect is what happens to the variable Y if the mediating variable changes to its posttreatment value, but no treatment is administered.
A question about natural indirect effects:
 \(Q_{NIE}\): If the age distribution of infected people in China becomes the distribution in Italy, what will happen to the death rate of the new crown in China? (Comparison between China and China itself)
[Definition 4] (NIE) The known mediating variable is X, and the natural indirect causal effect of the binary variable T on Y is:
\begin{aligned}
\operatorname{NIE}_{0 \rightarrow 1}= \mathbb{E}[Y_{X(1)} \mid do(T=0)]
– \mathbb{E}[Y \mid do(T=0)]
\end{aligned}
\end{equation}
\]
4.6 Mediation Formulas
In the causal diagram assumed in this paper, the causal quantities in (1)(4) can be transformed into the following statistics:
&\operatorname{TCE}_{0\to1}^{\mathrm{obs}}=\operatorname{E}[YT=1] \operatorname{E}[YT=0]\\
&{\operatorname{CDE}_{0\to1}^{\mathrm{obs}}(x)=\operatorname{E}[YT=1,X=x]\operatorname{E}[YT=0,X=x]} \\
&{\operatorname{NDE}_{0\to1}^{\mathrm{obs}}=\sum_{x}P\left(X=xT=0\right)\left(\mathrm{E}[YT=1,X=x]{\mathrm{E}[Y=0,X=x]}\right)}\\
&{\operatorname{NIE}_{0\to1}^{\mathrm{obs}}=\sum_{x}(P(X=xT=1) – P(X=xT=0))\mathrm{E}[YT=0,X=x]}
\end{align}
\]
Population, direct, and indirect causal effects can be calculated in observed data using the statistics in (5)(8).
4.7 Relationships among overall effects, natural direct effects, and natural indirect effects (TCE, NDE and NIE)
Can the overall effect be decomposed into natural direct and natural indirect effects?
 In a linear model, the answer is yes
 But most of the models, including the model in this paper, are nonlinear models, and the direct and indirect effects are not independent, but depend on each other.
 For example, a drug A (Treatment) needs to activate a certain protein (intermediary) in the body, that is, only drug A, without protein, the drug is ineffective; only this protein, without drug A, the drug is also ineffective
 In this case, both the natural direct effect and the natural indirect effect are 0, but the overall effect is not 0.
 It is worth mentioning that the direct effect of control may not be 0, because the amount of protein can be interfered (in fact there may be no means of intervention).
 For example, a drug A (Treatment) needs to activate a certain protein (intermediary) in the body, that is, only drug A, without protein, the drug is ineffective; only this protein, without drug A, the drug is also ineffective
5. Analysis of the mediating effect of age distribution on country and COVID19 mortality
This part quantifies the overall effect, natural direct effect and natural indirect effect.
5.1 Dataset
This article collects data on people infected with Covid19 from 11 countries and on the Diamond Princess, including the number of people infected and the death rate by age group. The dataset contains 756,044 infected people and 68,508 deaths, with an overall mortality rate of 9.06%.
5.2 Changes in causal effects over time
Using the formula from Section 4, calculate how the causal effect on mortality changes over time (in weeks) if the country changes from China to Italy. During the study period, the number of cases and death rates in China were relatively stable, so these changes were mainly due to changes in the situation in Italy.
 The overall effect (TCE) is gradually increasing, indicating that the overall mortality rate in Italy is gradually increasing compared to China.
 The natural direct effect (NDE) – if the age distribution of the cases is the same as that of China, the mortality rate in Italy will be higher than that in China – it is negative at the beginning, indicating that if the effect of age distribution is removed, the mortality rate in Italy at the beginning is about lower than China. But starting in midMarch, the NDE became positive and gradually increased, at the same time that Italy's medical system was overloaded. The NDE did not stabilize until midApril.
 The Natural Indirect Effect (NIE)—what would happen to China’s Covid19 mortality rate if the age distribution of cases in China changed to that of Italy—maintained a relatively large positive value, somewhere between 3% and 3.5%.
In general, the contribution of NIE to TCE is relatively stable and has always been relatively large; and the change of TCE over time is mainly caused by the change of NDE.
The Simpson's paradox of the death rate of the new crown in China and Italy mentioned in the second part is due to the different symbols of NDE and NIE in early March.
It is worth mentioning that,\(NDE+NIE \not = TCE\)。
5.3 Comparison between multiple countries
Calculating NDE and NIE between different countries, we get the following graph:
Since this is a nonlinear model, it can be seen that,\(NDE(t,t^*;Y)\not = NDE(t^*,t;Y)\), NIE is the same.
 In terms of NDE, Diamond Princess, China, Portugal and South Africa performed better.
 NDE can reflect the effectiveness of medical and other measures in various countries
 In terms of NIE, South Africa, Colombia and other countries performed better, and the Diamond Princess was the worst.
 NIE mainly reflects the influence of age distribution of infected persons on mortality.
 The country's ranking on the NDE and NIE indicators has little correlation, indicating that the country's epidemic prevention measures have little to do with the age distribution of infected persons.
 There is a strong correlation between the country's population age distribution and NIE, indicating that countries have failed to introduce effective epidemic prevention measures for different age groups.
 Of the 132 pairs of countries, the signs of NDE and NIE are different in 64 pairs, which leads to Simpson's paradox. This shows that focusing only on the overall COVID19 mortality rate of each country is not comprehensive and cannot reflect the effectiveness of national epidemic prevention measures. Factors such as the country's population age structure should also be considered.
6. Limitations of this paper and future work

The causal diagram designed in this paper is still relatively rough, and more mediating variables can be introduced, such as the amount of vaccination.

Different countries have different testing strategies, and different age groups may have different rates of testing due to different severity of symptoms. Therefore, only analyzing the data of confirmed patients may have a selection bias.

This article only analyzes the countries that have released relevant data, and these countries may be severely affected by the new crown, and the government has the ability to collect and publish data.

The time lag between infection and death also affects the accuracy of the results.
some thoughts
 I think the explanation of Simpson's paradox at the end of 5.2 should be because the notation of NDE and TCE is different.
 Negative NDE results in low mortality rates across all age groups in Italy.
 The positive TCE results in a relatively high overall mortality rate in Italy.
 NIE is positive and relatively large, which is the main reason why TCE is positive.
 But if NIE is positive but small enough to offset the negative NDE, then TCE may be negative, which does not constitute Simpson's paradox. So I think the explanation of Simpson's paradox should be that the sign of NDE and TCE are different.
 The first edition of this article was written in May 20, so the amount of data collected is not very large.
references
[17] D. Mackenzie, “Race, COVID mortality, and Simpson’s paradox,” Retrieved: Jul. 6, 2020. [Online]. Available: http://causality.cs.ucla.edu/blog/index.php/2020/07/06/racecovidmortalityandsimpsonsparadoxbydanamackenzie/
[18] J. Pearl, “Direct and indirect effects,” in Proc. 17th Conf. Uncertainty Artif. Intell., 2001, pp. 411–420
[55] J. Pearl et al. “External validity: From docalculus to transportability across populations,” Statist. Sci., vol. 29, no. 4, pp. 579–595, 2014.
[56] E. Bareinboim and J. Pearl, “Causal inference and the datafusion problem,” Proc. Nat. Acad. Sci. USA, vol. 113, no. 27, pp. 7345–7352, 2016.