Spring series order of causal reasoning: founding, collidar, mediation bias in data mining


Let’s talk about the preface. It took more than half a month to read over and over again, intermittently, Judea Pearl’sThe Book of WHY, I feel that it is easier to understand the relatively abstract content of the first three chapters by reading the case in Chapter 4 first. The urgent need for attribution in my work and the in-depth study in the past two years make me hope for the outbreak of causal reasoning in the next few years. Its greatest advantage is to be able to answer questions of fundamental significance to the actual business, such as’ why ‘and’ what if ‘. I am also a new person in this field, so I can only put forward some ideas for you to discuss.

Now! It’s time to test my carrying capacity. If you have the following problems in dealing with data, I recommend this book to you. It doesn’t necessarily answer your questions, but at least it lets you understand the root of the problem:

  • How to explain the incongruous or contradictory conclusions in data analysis? Why do you get different results when you group data and calculate it as a whole?
    Eg. the results of drug experiments show that the drugs are not effective for patients with hypertension and hypotension, but are they effective for all patients?

  • Known features\(X=x_1\)Sample presentation of\(Y=y_1\)Characteristics of, or\(Y=y_1\)Samples are available.\(X=x_1\)How to calculate the influence of intervention x on Y
    Eg. users are more active when they watch fast videos. Do they guide users to make comments to make them more active?

  • How to select modeling features and how they ultimately affect y
    I don’t like the modeling method of what to put, which not only increases the instability of the model, but also increases the difficulty of feature interpretation. Especially in the business, what we want to know more is how different characteristics affect y

  • How to approximate causality from observation data when AB experiment cannot be carried out
    Eg. the most common problem is the influence of sociology and medical experiments, such as military experience on income. But this also reminds us that some expensive AB experiments are actually possible to find approximate answers from existing data.

Here are a few differences between causal reasoning and statistics. We will expand them one by one in the following chapters

  • The statistical solution is p (y|x), which is more about the description of observation. Causal reasoning aims to solve the problem of what if. Using do caculus to express it is p (y|do (x)), which not only intervenes x, but also influences y. A colleague joked that causal reasoning is like opening the eyes of God

  • Statistics holds that data is everything, while causal reasoning insists that the process of data generation is necessary for data interpretation. If you want to feel the difference intuitively, please look at this toy example

  • Statistics are completely objective, and causal reasoning needs to rely on the analysis and calculation of DAG based on experience and other factors.

What is the most important as a preface? Eye catching! So in this chapter, through five classic cases in data analysis, we can see how causal reasoning transforms into Altman to fight small monsters when statistics is in a dilemma!

The following cases are only for intuitive perception of the practical significance of causal reasoning, not considering statistical significance, small sample distrust, etc

Confounding Bias – Simpson Paradox

Spring series order of causal reasoning: founding, collidar, mediation bias in data mining

Foundation is very common in data analysis. There are variables that affect treatment and outcome at the same time that are not controlled. It is one of the root causes of the variables to be controlled in statistical analysis. It is the logic behind the effectiveness of AB experiment, which also directly leads to\(P(Y|X) \neq p(Y|do(x))\)However, the existence of foundation is often thought of only when the analysis results are seriously illogical.

Discrete foundation – Case 1. Did you take any medicine today?

Here’s oneObservabilityThe results of medical experiments give the probability of heart attack of men and women after taking / not taking drugs respectively. Interestingly, the drug does not significantly reduceFemale sexThe incidence of the disease could not be significantly reducedMaleThe probability of disease, however, can be reducedWholeAre you an analyst? Is this drug useful?

The answer is no, this drug doesn’t work
This is the famous Simpson paradox. Using the above cause and effect diagram (DAG) analysis results will become obvious. Here treatment is medication, outcome is the probability of heart attack, and becauseObservabilitySo sex may become confiwonder. Note that I use the possibility here. To test this possibility, it depends on whether gender affects both treatment and outcome. First look at treatment, women are control group 20, experimental group 40, while men are control group 40, experimental group 20. Therefore, gender has a significant impact on the permeability of treatment – the proportion of drug users. When we look at outcome, the incidence rate of female patients in the same control group is 5% and that of male patients is 30%. Therefore, gender affects the incidence rate of outcome disease at the same time.

So to measure the impact of treatment on outcome, we need to control confound. The overall incidence is calculated as follows:
The overall effect of the control group was 0.5 * 5% + 0.5 * 40% = 17.5%
The overall effect of the experimental group was 0.5 * 7.5% + 0.5 * 40% = 23.75%
In this way, the overall conclusion is consistent with that of men and women respectively. Taking medicine does not reduce the risk of heart disease.

Continuous foundation – Case 2. Exercise leads to high cholesterol?

In the example above, confound is a discrete variable for men and women. Let’s take an example of continuous confound. The goal of the study was to determine the effect of weekly exercise time on cholesterol levels. Most of the “effects” in statistics can only depend on correlation, so let’s draw a scatter diagram.
Uh huh?! The longer you exercise, the higher your cholesterol level! This is the best reason to hate sports and insist that life lies in stillness.

Spring series order of causal reasoning: founding, collidar, mediation bias in data mining

Of course, at this time, experienced analysts will jump out and say they should control variables! In fact, this is not to control all controllable crowd differences, but just control the confounder variable. One of the most intuitive confounder variables is age. The higher the age, the higher the cholesterol level, and the shorter the exercise time, it affects both treatment and outcome. After group by age, we will get that exercise time and cholesterol levels are reversed in each age group.

Spring series order of causal reasoning: founding, collidar, mediation bias in data mining

The next time you come to a conclusion based on statistics, no matter how consistent the result is with your expectation [intuition, sixth sense, reasoning, experience], remember to think more. See if you’ve missed a possible confinder?

Mediation Bias

Spring series order of causal reasoning: founding, collidar, mediation bias in data mining

Mediation bias is most often caused by controlling variables that should not be controlled so that the impact is artificially weakened.In traditional statistics, because no causal reasoning is introduced, the principle of controlling all controllable variables for analysis is often inadvertently stepping into the pit of mediation. At the same time, mediation analysis is also a direction of high practical value in the follow-up analysis of AB experiment. I have a chance to have a good chat in the high-end play series of AB experiment.

Variable control is not as good as more – Case 3. Did you take medicine again today?

Do you remember the heart disease drug experiment above? At that time, we came to the conclusion that the experimental effect should be calculated separately for men and women, because gender is the foundation of drug effect. Let’s change the gender factor to the patient’s blood pressure, and tell you that the grouping calculation is not always correct.

The data is the same as case 1, except that the grouping variable here becomes the patient’s blood pressure.

A new hypothesis is added here. It is known that high blood pressure is one of the causes of heart attack, and the drug theoretically has the effect of lowering blood pressure. Therefore, doctors want to test the effect of the drug on prevention and treatment of heart disease.

Because it’s an observational experiment. From the perspective of traditional analysis, it seems that we should control all controllable variables to ensure the consistency of the population. But according to the hypothesis, combined with the data, we can find that the proportion of high blood pressure in the patients taking the medicine is significantly reduced. At this time, lowering blood pressure becomes a mediator for the drug to reduce the heart attack, that is to say, some drug effects reduce the probability of heart attack by lowering blood pressure. The cause and effect diagram is as follows

Spring series order of causal reasoning: founding, collidar, mediation bias in data mining

In this case, if we group patients according to blood pressure, it is equivalent to condition on mediator. If we remove drugs artificially and protect the heart by controlling blood pressure, the effect of drugs will be underestimated artificially. Therefore, it should be combined to calculate that the drug is effective in controlling heart disease.

When analyzing observation data, not all variables should be controlled. All variables on the treatment and outcome causal paths should not be controlled. It is reasonable to calculate the overall effect directly here


Spring series order of causal reasoning: founding, collidar, mediation bias in data mining

The most intuitive impact of collator is the pseudo correlation, which often occurs in the analysis of local samples, because the characteristics of the samples themselves are ignored to get some very wonderful correlation.

Negative correlation – Case 4. Pregnant mothers should smoke?!

Interesting data emerged from a 1959 study of newborns:

  • Studies have shown that smoking by pregnant mothers can cause low birth weight
  • Previous studies have shown that the survival rate of newborns who are underweight (< 5.5 pounds) is significantly lower
  • The data showed that the survival rate of infants whose mothers smoked was significantly higher than that of infants whose mothers did not smoke among infants who were underweight (< 5.5 pounds)

It’s a positive negative rhythm… >

Remember the above we said that collator bias is most likely to occur when analyzing local samples, and here the newborns who are too light weight are obviously local samples. Let’s draw a simple cause and effect diagram and the answer is obvious.

Spring series order of causal reasoning: founding, collidar, mediation bias in data mining

By only observing the survival rate of the newborn who is too light, we step into the trap of “collar = ‘birth weight is too light’. Because of the condition on collar, there is a negative relationship between the two unrelated reasons. In short, it is that both the defect of newborn and the smoking of mother may cause the weight of newborn to be too light, and the two factors change with each other. When the mother is known to smoke, the probability of defect of newborn will decrease. It is reasonable to conclude that the influence of underweight caused by inborn defects on infant survival rate is greater. Therefore, smoking by pregnant mothers will increase the survival rate.

The DAG above is not complete, for example, smoking by mothers may directly cause neonatal defects and so on. But at least the existence of collidar is very convincing here

Positive correlation – Case 5. Is respiratory disease related to orthopedic disease?

The variables that generate pseudo correlation because of collator are often negatively correlated, just like the above example, also known as explain away effect. The simple understanding is that both a and B lead to collidor. If you control collidor, more a and less B. But the following example is that collator produces a pseudo positive relationship.


It’s not hard to find that respiratory diseases have nothing to do with orthopedic diseases for ordinary people. But if we only look at inpatients, the probability of respiratory disease patients suffering from orthopedic diseases at the same time will be significantly increased by more than three times!

Spring series order of causal reasoning: founding, collidar, mediation bias in data mining

The DAG of this case is very good, but why is it not a negative effect but a positive one? One explanation is that the probability of hospitalization directly caused by respiratory disease alone or orthopedic disease is very small. Therefore, for collidar = ‘hospitalization’, the complementary effect rather than the alternative effect is formed between the two diseases. Patients with both diseases are more likely to be hospitalized. Therefore, only looking at inpatients produces a pseudo positive relationship.

The DAG mentioned above is not the only possibility. It is also possible that other diseases of the patient may lead to hospitalization, and increase the probability of respiratory and orthopedic diseases at the same time. Anyway can’t give a conclusion only when seeing the above data, so please be careful when analyzing the local samples

So many cases in the preface are shared, and people begin to doubt whether there is life?!


  1. https://towardsdatascience.com/why-every-data-scientist-shall-read-the-book-of-why-by-judea-pearl-e2dad84b3f9d
  2. Judea Pearl, The Book of Why, the new science of casue and effect