1 case background
at presentUber、NetflixThe common models of causal inference in business analysis are difference in difference and matching. At present, the independent analysis tools of related methods have been established in its platform. This article will introduce the real difference between the double difference method and the ordinary regression method in the evaluation of the effect quantity.
Here is a simple economic case to explain the principle of the double difference method did. According to the U.S. federal regulations, when the labor compensation plan of each state compensates the injured workers, the scope of compensation ranges from a certain “compensation rate” (usually two-thirds of the wages before the injury) to a certain maximum amount. For rational decision-makers, the higher the compensation for continuing disability, the less motivation to return to work.
In 1980, Kentucky raised the weekly maximum amount of work-related injury compensation to test whether changes in compensation have a significant impact on the decision to return to work. The main outcome variable we care about is log_ Duration (ldurat in the data), or the period of work-related injury compensation recorded (in weeks). Here, the variable log is chosen because there is a large deviation in the variable. Most people have been unemployed for several weeks, while some people have been unemployed for a long time. The purpose of this policy is that the increase of the upper limit will not affect the low-income workers, but will affect the high-income workers, so we will take the low-income workers as our control group and the high-income workers as our treatment group.
The data set is included in the Wooldridge R package
- Durat (duration): the duration of unemployment benefits, in weeks
- after_ 1980（after_ 1980): indicator variable, whether the observation was conducted before (0) or after (1) the policy change in 1980, time variable: before / after
- Highearn: indicator variable, used to mark whether the observed value is low (0) or high (1) income, grouping variable: treatment / control
2 loading and cleaning data
First, download the dataset and load the related libraries
library(tidyverse) # ggplot(), %>%, mutate(), and friends library(broom) # Convert models to data frames library(scales) # Format numbers with functions like comma(), percent(), and dollar() library(modelsummary) # Create side-by-side regression tables # Load the data. # It'd be a good idea to click on the "injury_raw" object in the Environment # panel in RStudio to see what the data looks like after you load it injury_raw <- read_csv("data/injury.csv") injury <- injury_raw %>% filter(ky == 1) %>% # The syntax for rename is `new_name = original_name` rename(duration = durat, log_duration = ldurat, after_1980 = afchnge)
3 exploratory data analysis
First, we can look at the distribution of unemployment compensation among the high and low income groups (control group and treatment group)
ggplot(data = injury, aes(x = duration)) + # binwidth = 8 makes each column represent 2 months (8 weeks) # boundary = 0 make it so the 0-8 bar starts at 0 and isn't -4 to 4 geom_histogram(binwidth = 8, color = "white", boundary = 0) + facet_wrap(vars(highearn))
Most of the people in the two groups can enjoy 0-8 weeks of benefits (and a few can enjoy more than 180 weeks of benefits! This is 3.5 years!)
If the logarithm of duration is used, a less skewed distribution can be obtained, which is more suitable for regression model
ggplot(data = injury, mapping = aes(x = log_duration)) + geom_histogram(binwidth = 0.5, color = "white", boundary = 0) + # Uncomment this line if you want to exponentiate the logged values on the # x-axis. Instead of showing 1, 2, 3, etc., it'll show e^1, e^2, e^3, etc. and # make the labels more human readable # scale_x_continuous(labels = trans_format("exp", format = round)) + facet_wrap(vars(highearn))
We should also examine the unemployment situation before and after the policy change
ggplot(data = injury, mapping = aes(x = log_duration)) + geom_histogram(binwidth = 0.5, color = "white", boundary = 0) + facet_wrap(vars(after_1980))
The distribution seems to be normal, but it is difficult to easily see the difference between the control group and the treatment group before and after treatment. We can plot the average. Using stat_ The summary () layer lets ggplot calculate summary statistics such as the average value of the summary. Here we only calculate the average value:
ggplot(injury, aes(x = factor(highearn), y = log_duration)) + geom_point(size = 0.5, alpha = 0.2) + stat_summary(geom = "point", fun = "mean", size = 5, color = "red") + facet_wrap(vars(after_1980))
The mean value and 95% confidence interval were calculated
ggplot(injury, aes(x = factor(highearn), y = log_duration)) + stat_summary(geom = "pointrange", size = 1, color = "red", fun.data = "mean_se", fun.args = list(mult = 1.96)) + facet_wrap(vars(after_1980))
You can start to see the classic difference comparison chart! It seems that the average time of unemployment of high-income people after 1980 is longer. You can also use group before sending data to ggplot_ By () and summarize () find out the group mean value
ggplot(plot_data, aes(x = after_1980, y = mean_duration, color = highearn)) + geom_pointrange(aes(ymin = lower, ymax = upper), size = 1) + # The group = highearn here makes it so the lines go across categories geom_line(aes(group = highearn))
Calculation principle of 4 times difference method did
We can find the real difference between the control group and the treatment group by filling in 2×2
diffs <- injury %>% group_by(after_1980, highearn) %>% summarize(mean_duration = mean(log_duration), # Calculate average with regular duration too, just for fun mean_duration_for_humans = mean(duration)) diffs ## # A tibble: 4 x 4 ## # Groups: after_1980  ## after_1980 highearn mean_duration mean_duration_for_humans ## <dbl> <dbl> <dbl> <dbl> ## 1 0 0 1.13 6.27 ## 2 0 1 1.38 11.2 ## 3 1 0 1.13 7.04 ## 4 1 1 1.58 12.9 before_treatment <- diffs %>% filter(after_1980 == 0, highearn == 1) %>% pull(mean_duration) before_control <- diffs %>% filter(after_1980 == 0, highearn == 0) %>% pull(mean_duration) after_treatment <- diffs %>% filter(after_1980 == 1, highearn == 1) %>% pull(mean_duration) after_control <- diffs %>% filter(after_1980 == 1, highearn == 0) %>% pull(mean_duration) diff_treatment_before_after <- after_treatment - before_treatment diff_control_before_after <- after_control - before_control diff_diff <- diff_treatment_before_after - diff_control_before_after diff_diff ## 0.19
The difference between the control group and the treatment group is estimated to be 0.19, which means that the policy plan increases the unemployment time by 19%.
Did regression with 5 times difference method
Like the above manual calculation is very cumbersome, so we can use regression to complete! Remember, we need to include indicator variables for treatment / control, pre – / post-1980, and interaction between the two:
Coefficient is the effect that we care about ultimately, namely did estimator.
model_small <- lm(log_duration ~ highearn + after_1980 + highearn * after_1980,data = injury) tidy(model_small) ## # A tibble: 4 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 1.13 0.0307 36.6 1.62e-263 ## 2 highearn 0.256 0.0474 5.41 6.72e-8 ## 3 after_1980 0.00766 0.0447 0.171 8.64e-1 ## 4 highearn:after_1980 0.191 0.0685 2.78 5.42e-3
highearn:after_ The coefficient of 1980 should be the same as that calculated manually! It is statistically significant, so we can be sure that there is a significant difference in the length of unemployment between high and low incomes.
Did regression with control variables
One of the advantages of did regression is that it can include control variables. For example, workers in construction or manufacturing tend to have longer claims periods than workers in other industries. We may also want to control the demographic information of workers, such as gender, marital status and age.
Let’s estimate the basic regression model with the following additional variables:
- Hosp (1 = hospitalization)
- Industry (1 = manufacturer, 2 = architecture, 3 = others)
- Injtype (1-8; different types of injuries)
- Lprewage (wage record before claim)
Tip: Industry and injtype are represented by numbers (1-3 and 1-8) in the dataset, but they are actually categories. They must be treated as categories (or factors) in R.
# Convert industry and injury type to categories/factors injury_fixed <- injury %>% mutate(indust = as.factor(indust), injtype = as.factor(injtype)) model_big <- lm(log_duration ~ highearn + after_1980 + highearn * after_1980 + male + married + age + hosp + indust + injtype + lprewage, data = injury_fixed) tidy(model_big) # Extract just the diff-in-diff estimate diff_diff_controls <- tidy(model_big) %>% filter(term == "highearn:after_1980") %>% pull(estimate) modelsummary(list("DID" = model_small, "DID+control" = model_big))
|highearn × after\_1980||0.191***||0.169***|
* p \< 0.1, ** p \< 0.05, *** p \< 0.01
After controlling for many demographic factors, the “comparison of differences” estimate decreased (0.169), indicating that the policy resulted in a 16.9% increase in unemployment time after workplace injury. The reason why it is small is that other independent variables can explain part of the log_ The change of duration.
For the rest of the articles, please visit the public H: datago datadog
In this paper, the blog group issued a multi article and other operational tools platformOpenWriterelease