The article comes from GZH data kaleidoscope

The article link is as follows:https://mp.weixin.qq.com/s/7u…

Click the blue word above to follow us

The causal inference series articles are divided into two parts. The directory structure is as follows. The previous article can be viewed by clicking on the original text.

Causal inference is divided into two parts by using dowhy framework. The directory structure is as follows

# Part I

1. Dowhy causal inference framework

2. Data source and preprocessing

3. Data correlation exploration

# Part II

Causal inference implementation

1. Calculate the expected frequency and preliminarily judge the causality

2. Create a cause and effect diagram based on assumptions

3. Identify causal effects

4. Estimating causal effects

5. Rebuttal results

This article is based on the analysis of the previous article. You can view the source code for the previous data preprocessing process by reading the full text. This article mainly focuses on the implementation of causal inference.

# Causal inference implementation

After data preprocessing and correlation analysis, preliminary results have been obtained on the correlation between various variables, but we do not know whether there is a causal relationship between variables. Further causal inference is needed, that is, using dowhy framework, through four steps of modeling, identification, estimation and refutation.

## 1. Calculate the expected frequency and preliminarily judge the causality

From the correlation analysis, it can be seen that the cancellation of customers’ reservation is highly correlated with the three factors of “parking space”, “total living days” and “the reserved room type is different from the allocated room type”. In addition to the above three factors, some factors have weak correlation with customer cancellation, such as “reservation change”, “special requirements” and so on.

Correlation is not necessarily equivalent to causality, and it can be seen from figure 9-10 that the proportion of positive and negative samples in the data set is unbalanced, so it is necessary to preliminarily explore causality here. Therefore, for the variables “Cancel” and “the reserved room type is different from the allocated room type”, 1000 observation data are randomly selected in the data set, and the number of times the values of the two variables are the same is counted, that is, if the hotel assigns a room different from the reserved room type to the customer, the customer cancels the order, repeats the above process for 10000 times, and takes the average value. The implementation code is as follows.

```
counts_sum=0
for i in range(1,10000):
counts_i = 0
rdf = data.sample(1000)
counts_i =rdf[rdf["is_canceled"]==rdf["different_room_assigned"]].shape[0]
counts_sum+= counts_i
counts_sum/10000
517.9752
```

Theoretically, this number should be 50% of the total observation times, because when the hotel allocates rooms that are inconsistent with the room type, the customer either cancels the reservation or accepts the room type adjustment. If this number is close to 50% of the total number of observations, it can preliminarily indicate that there may be a causal relationship between the two variables.

The final expected frequency is 518, that is, if the customer is assigned a room type different from the reservation, the customer will cancel the reservation with a probability of about 50%.

Reservation changes, namely the variable “booking \ _changes”, are also one of the influencing factors causing different room types during Hotel allocation and reservation, so it is also important to remove the influence of this variable. Therefore, 1000 users with a predetermined number of changes of 0 are randomly selected here, and the average value is taken after repeating the above random test for 10000 times. The implementation code is as follows.

```
counts_sum=0
for i in range(1,10000):
counts_i = 0
rdf =data[data["booking_changes"]==0].sample(1000)
counts_i =rdf[rdf["is_canceled"]==rdf["different_room_assigned"]].shape[0]
counts_sum+= counts_i
counts_sum/10000
492.0499
```

For customers with 0 scheduled changes, the final expected frequency is 492, accounting for about 50% of the sample, which is in line with the expectation.

For users with predetermined changes, 1000 customers are also selected to conduct the above random tests 10000 times. The implementation code is as follows.

```
counts_sum=0
for i in range(1,10000):
counts_i = 0
rdf =data[data["booking_changes"]>0].sample(1000)
counts_i =rdf[rdf["is_canceled"]==rdf["different_room_assigned"]].shape[0]
counts_sum+= counts_i
counts_sum/10000
663.4134
```

For customers whose scheduled change times are greater than 0, the final expected frequency is 663, which is quite different. This result suggests that “scheduled change” may be a confounding variable.

However, there may be more than “reservation change” as a confounding variable affecting customer cancellation. In this case, dowhy framework will infer unspecified variables as potential confounding variables.

## 2. Create a cause and effect diagram based on assumptions

Based on the exploration of expected frequency and the experience of data analysts, we make the following assumptions about the relationship between variables.

-The “market \ _segment” field includes two categories, namely “individual” and “travel agency”. This indicates the hotel reservation source. The reservation method will affect the time between the customer’s reservation and arrival at the hotel, that is, the “lead \ _time” field.

-Country, that is, the “country” field, refers to the target country of the customer’s travel. The tourism popularity of the target country will affect whether users will book hotels in advance, thus affecting “lead \ _time”; At the same time, different countries have different eating habits, so there is also a certain correlation between the target country and the delicious food, that is, the “meal” field.

For a complete article, please move to the data kaleidoscope,https://mp.weixin.qq.com/s/7u…