Two kinds of deviation in data analysis of Statistical Science Series


Today, I’d like to introduce two common biases in data analysis: selectivity bias and survivor bias.

1. Selectivity deviation

Selectivity bias refers to the bias of the conclusion due to the non randomness of sample selection in the process of research, which is the data bias caused by human subjective selection.

Let’s take an example of selectivity bias. Now there is a research institute that wants to study a topic: “can hospitals make people healthier?”. The organization randomly selected 100000 people to measure their health level, and then divided them into two groups according to whether they had been to the hospital in the latest year. The final statistical result is the latest yearI’ve never been there beforeThe health level of the group in the hospital is higher than that in the hospitalbeenThe group health level of the hospital is better, can we explain that the hospital makes people more unhealthy?

This is a result caused by a very typical selectivity bias, because the overall health level of the people who have not been to the hospital in the past year may have been better than those who have been to the hospital, so the test result is also the same, which does not mean that the hospital makes people more unhealthy.

We should try our best to avoid this kind of deviation in the process of daily analysis. A very important criterion to measure whether there is selectivity deviation is whether the two groups being compared are comparable.

2. Survivor bias

Survivor bias refers to only seeing the results of a certain screening, but not realizing the process of screening, so ignoring the key information screened out.

Let’s take a long-term example. In order to enhance the protection capability of fighter planes during World War II, the U.S. military studied the fighter planes that had participated in combat. It was found that most of the bullet holes of the aircraft were concentrated in the wings and tail. Therefore, the staff of the analysis center suggested that the most seriously damaged parts should be reinforced.

The statistician Abraham Wald came to the opposite conclusion. He found that the planes involved in the investigation were those that survived the fighting and they were not fatally attacked. On the contrary, places like cabin and engine that seem to be intact are more dangerous, because once these areas are hit, they will lead to aircraft crash. In fact, the airplanes we see are screened airplanes, and there are still some airplanes that have crashed that we can’t see. This is the survivor bias.

Another example is often seen on the platform of pulse and Zhihu. It seems that everyone earns a million a year, and only he is a scum with less than five combat effectiveness. In fact, it’s all a survivor bias. Those who make a million a year will show themselves, and a lot of people who don’t make a million a year are filtered out.

This is just like what we usually do in our work. You often encounter various kinds of Tucao, such as make complaints about the high price of your products. Can you solve the problem if you directly reduce the price of the products? People who really think your products are expensive may not complain to you at all. For example, people who buy a thousand yuan mobile phone will not go to Apple’s official website to complain that your iPhone is too expensive.

3. Finally

We often fall into the above two problems unconsciously in the ordinary data analysis or work, so how can we avoid the above two deviations? The way is to ask more why? The above deviation is also obtained through data analysis. After the conclusion is obtained through analysis, why should we ask more questions? Why does this happen? Why do these planes come back? Why do these people complain about the high price. If you can find the reason behind the data, you won’t make the above mistake.

The above two kinds of deviations are similar, but different. The former is caused by the inaccuracy of our artificial selection of research objects, while the latter is caused by the fact that we only see what others want us to see. The two have something in common, that is, they are biased because we do not see the full picture of the data.