# Extension data tecdat: R language uses lme4 multi-level (mixed effect) generalized linear model (GLM) and logistic regression to analyze the data of education repetition survey

Time：2021-8-2

This tutorial provides readers with   The basic introduction of generalized linear model (GLM) of frequency school. Specifically, this tutorial focuses on the use of logistic regression in the case of binary results and count / proportion results, as well as the method of model evaluation. This tutorial uses educational data examples to apply the model. In addition, this tutorial briefly demonstrates the multi-level extension of GLM model with R. Finally, more distribution and link functions in GLM framework are discussed.

This tutorial contains the following structure.
1. Preparation.
2. Introduce GLM.
4. Data preparation.
5. Binary (Bernoulli) logistic regression.
6. Binomial logistic regression.
7. Multilevel logistic regression.
8. Other families and link functions.

This tutorial describes:
-Basic knowledge of hypothesis testing and statistical inference.
-Basic knowledge of regression.
-Basic knowledge of R language coding.
-Basic knowledge of drawing and data processing.

# Introduction to generalized linear model (GLM)

For the case where y is a continuous value, we can handle it in this way, but when y is a discrete value (such as count data, binary data, see wiki statistical data type), it is not appropriate for us to use the ordinary linear model. At this time, we refer to another model – the generalized linear models.

In order to obtain the GLM model, we list three conditions:

1. That is, y|x is an exponential family distribution, and the distribution form of the exponential family is:

2. If we judge Y’s hypothesis is  , then

3. The relationship between natural parameters and input x is linear:

We do not discuss the reasons for these three conditions. We only know that such assumptions are based on the choice of “design”, not necessity.

We take Poisson regression as an example, y obeys Poisson distribution  , in the form of exponential family, we can get。 therefore

Then is the process of maximum likelihood method.

# Education data

The data used in this tutorial is educational data.

The data comes from the national primary education survey. Each row in the data refers to a student. The result variable repetition is a dichotomous variable, which indicates whether a student has stayed in grade during primary education. The school variable represents a student’s school. Predictors at the individual level include.    Gender (0 = female, 1 = male) and pre-school education (pre-school education, 0 = none, 1 = yes). The school level is the school average SES (socio-economic status) score.

The main research questions this tutorial attempts to answer using educational data are.

Ignoring the structure of the data, what are the effects of gender and preschool education on whether students repeat grades?
Ignoring the data structure, what is the impact of school average SES on students’ repetition rate?
Considering the structure of the data, what are the effects of gender, preschool education and school average SES on whether students repeat grades?
These three questions are answered by the following models: binary logistic regression; Binomial logistic regression; Multilevel binary logistic regression.

# Data preparation

``````#   If you haven't already installed these packages, please use install. Packages ("package")_ Name ") command.
library(lme4)  #  For multilevel models
library(tidyverse)  #  For data processing and drawing``````

## Import data

``head(Edu)``

## data processing

``````Mutate (school)  =  Factor,
Gender  =  if_ Else (gender)  ==  0  " girl",  " boy"),
Gender  =  Factor (gender,   levels  =  c("girl",  " boy")),
Pre school education  =  if_ Else (pre-school education)  ==  0  " no",  " yes"),
Pre school education  =  Factor (pre-school education,   levels  =  c("no",  " yes")))``````

## Check for missing data

``  summarise_each((~sum(is.na(.))``

In the data, 1066 observations of economic status variables are missing. The processing of missing data itself is a complex topic. For convenience, we will simply delete cases with missing data in this tutorial.

# Binary logistic regression

## Exploratory data: number of repeats by gender and preschool education

``````group_ By (gender)  %>%
Summarize  =  Sum (have you left a level)``````

It seems that the number of repeat students varies greatly between men and women, and more male students repeat. More students who have not received preschool education repeat grades. This observation suggests that gender and preschool education may have a predictive effect on repetition.

# Constructing Binary Logistic regression model

R installs the basic package by default, including the GLM function running GLM. The parameters of GLM are similar to those of LM: formula and data. However, GLM requires an additional parameter: family, which specifies the hypothetical distribution of the result variables; In the family, we also need to specify the link function. The default value of family is Gaussian (link = “identity”), which leads to a linear model, equivalent to the model specified by LM. In the case of binary logistic regression, GLM requires us to specify a binomial distribution with logit link, that is, family = binomial (link = “logit”).

``````glm(formula ,

# explain

From the above summary output, we can see that gender has a positive and significant prediction of students’ repetition probability, while preschool education has a negative and significant prediction. Specifically, boys are more likely to repeat grades than girls. Students who have attended school before are unlikely to repeat grades.

In order to interpret the parameter estimates, we need to index the estimates.

Note that the interpretation of parameter estimation is related to probability rather than probability. The definition of odds is. P (event occurred) / P (event did not occur). In this analysis, assuming that everything else remains the same, boys have a 54% increased chance of repeating grades compared with girls; Compared with no preschool education, assuming everything else remains the same, having preschool education reduces the repetition rate of (1-0.54)% = 46%.

# Visualization of parametric effects

In order to make the interpretation of parameter effects easier, we can visualize parameter effects.

``plot(Effects)``

Note that in these two figures, the Y scale refers to the probability of repetition, not the probability. Probability is easier to explain than probability. The probability score of each variable is calculated by assuming that other variables in the model are constants and taking their average values. As we can see, assuming that a student has an average preschool education, he has a higher repetition probability (~ 0.16) ~ 0.11) as a boy than as a girl. Similarly, assuming that a student has an average gender, students with preschool education have a lower probability of repeating grades than students without preschool education (~ 0.11) (~ 0.18). Please note that the confidence intervals of the estimated values are also included in these two figures to give us some understanding of the uncertainty of the estimated values.

Note that the concepts of average preschool education and gender may sound strange because they are categorical variables (i.e. factors). If you are surprised by the idea of assuming an average factor, you can specify your expected factor level as a reference point.

``predictors  =  list(   Values = C (sex boy = 0,   Preschool education yes  =  0))``

Setting gender boy = 0 means that the reference level of gender variables is set to 0 in the preschool education effect map; Preschool education yes = 0 causes 0 to become the reference level of preschool education variables in the gender effect map.

Therefore, as shown in the above two figures, assuming that students have not received preschool education, the repetition probability of boys (~ 0.20) is higher than that of girls (~ 0.14); Assuming that the students are female, the repetition probability with preschool education (~ 0.09) is lower than that without preschool education (~ 0.15).

# Model evaluation: fit

There are different methods to evaluate the fitting degree of logistic regression model.

## Likelihood ratio test

If a logistic regression model shows an improvement in fitting degree compared with the model with fewer predictors, the model has a better fitting degree to the data. This is carried out by likelihood ratio test, which compares the likelihood of data under the complete model with that under the model with fewer predictors. Deleting predictive variables from a model almost always reduces the fit of the model (i.e. the log likelihood of the model is low), but it is useful to test whether the observed differences in model fit are statistically significant.

``````#Specify a model with only ` gender 'variables
#Use the \ ` anova() \ ` function to run the likelihood ratio test
anova(ModelTest, Model, test ="Chisq")``````

We can see that the model containing the predictors of gender and preschool education is much better than the model containing only gender variables. Note that this method can also be used to determine whether it is necessary to include a variable or a set of variables.

## AIC

Akaike information criterion (AIC) is another measure of model selection. Different from likelihood ratio test, AIC calculation should consider not only the fitting degree of the model, but also the simplicity of the model. In this way, AIC deals with the trade-off between the fit and complexity of the model, so over fitting is not encouraged. Smaller AIC is preferred.

When the AIC value is small, the model with both gender and preschool education predictors is better than the model with only gender predictors.

## Correct classification rate

The correct classification rate is another useful measure of the suitability of the model to the data.

``````#Use the \ ` predict() \ ` function to calculate the prediction probability of students in the original data from the fitted model
Pred <- if_else(Pred > 0.5, 1, 0)
ConfusionMatrix <- table(Pred, TRUE)
#Correct classification rate``````

We can see that the model correctly classifies 85.8% of all observations. However, careful observation shows that the model predicts that all observations belong to “0”, that is, all students are predicted not to repeat grades. Considering that most categories of repeating variables are 0 (no), the classification performance of the model is not better than simply assigning all observations to most categories 0 (no).

## AUC (area under curve)

An alternative to using the correct classification rate is the area under the curve (AUC) measurement. AUC measures discrimination, that is, the ability to correctly classify people with and without target response. In the current data, the target variable is repetition. We randomly selected a student from the “repeat” group and the “no repeat” group. Students with high prediction probability should be students in the “repeat” group. AUC is the percentage of pairs drawn at random. The rate of change in the program does not depend on the rate of change in the AUC class. The value of 0.50 means that the classification effect of the model is not better than that of random. A good model should have an AUC score much higher than 0.50 (preferably higher than 0.80).

``````#   Calculate the AUC of the category predicted by the model

AUC <- performance(Pred, measure = "auc")
AUC <- [email protected]\[\[1\]\]
AUC``````

The AUC score is 0.60, and the discrimination ability of the model is not strong.

# Binomial logistic regression

As mentioned at the beginning, logistic regression can also be used to model count or proportional data. Binomial logistic regression assumes that the result variable comes from Bernoulli distribution (this is a special case of binomial distribution), in which the number of tests n is 1, so the result variable can only be 1 or 0. In contrast, binomial logistic regression assumes that the number of target events follows a binomial distribution, the number of trials n, and the probability Q. In this way, binomial logistic regression allows the result variable to take any non negative integer value, so it can process count data.

Educational data records the information of individual students concentrated in the school. By summarizing the number of repeat students in each school, we get a new data set, in which each row represents a school and has information about the proportion of repeat students in that school. The average socio-economic status of schools (average SES score) is also at the school level; Therefore, it can be used to predict the proportion or number of students who repeat grades in a school.

## Convert data

In this new data set, repetition refers to the number of students who repeat; Total refers to the total number of students in a school.

## Explore data

``````  ggplot(aes(x , y)) +
geom_smooth(method = "lm")``````

We can see that the proportion of repeat students is negatively correlated with the number of objections to the average socio-economic status of the school. Please note that we model the variable school average socio-economic status as its opposition number, because in the binomial regression model, we assume that there is a linear relationship between the opposition number of linear predictors and the results (i.e. event proportion), rather than between the predictors themselves and the results.

# Fitting binomial logistic regression model

In order to fit the binomial logistic regression model, we also use GLM function. The only difference is the description of the result variable in the formula. We need to specify the number of target events (repeat) and the number of non events (total repeat) and wrap them in cbind ().

``````GLM (cbind),   Total - whether the grade has been reserved)  ~  Average socio-economic status of schools,
family = binomial(logit))``````

# explain

The parameter interpretation of binomial regression model is the same as that of binomial logistic regression model. From the above model summary, we know that the average SES score of a school is negatively correlated with the probability of repetition. In order to improve the interpretability, we use summ () function again to calculate the exponential coefficient estimation of school average socio-economic status. Since the school’s average socio-economic status is a continuous variable, we can standardize the indexed estimate of the school’s average socio-economic status (by multiplying the original estimate by the SD of the variable, and then indexing the resulting number).

``````#Note that in order to use the sum () function for the binomial regression model, we need to take the result variable as the object.
Have you left a grade  <- ( filter(edu,  ! Is.na (average socio-economic status of schools),   Whether it has been reserved (grade)``````

We can see that as the SD of the average socio-economic status of the school increases, the probability of students repeating grades decreases by 1 – 85% = 15%.

We can intuitively see the effect of the average socio-economic status of schools.

``plot(allEffects)``

The chart above shows the expected impact of school average socio-economic status on students’ repetition probability. When other factors remain unchanged, with the increase of the average socio-economic status of the school, the probability of a student repeating a grade will decrease (from 0.19 to 0.10). The blue shaded area represents the 95% confidence interval of the predicted value of the average socio-economic status value of each school.

# Multilevel binary logistic regression

The binary logistic regression model introduced above is limited to modeling the influence of predictors at the student level; Binary logistic regression is limited to modeling the impact of predictors at the school level. In order to incorporate both student level and school level predictors, we can use multi-level models, especially multi-level binary logistic regression.

In addition to the above motives, there are more reasons to use multi-level models. For example, because the data are classified within schools, students from the same school are likely to be more similar than students from other schools. Because of this, in one school, the probability of a student repeating a grade may be very high, while in another school, it is very low. In addition, even the relationship between outcomes (i.e. repetition) and predictive variables (e.g. gender, preschool education, average socio-economic status of schools) may be different in different schools. It should also be noted that there are missing values in the variables of school average socio-economic status. The use of multi-level model can better solve these problems.

Take a look at the figure below as an example. The figure shows the proportion of repeat students in each school. We can see great differences between different schools. Therefore, we may need a multi-level model.

``````group_ By (school)  %>%
summarise(PROP  =  Sum (whether level left) / N ())  %>%
plot()``````

We can also map the relationship between gender and repetition through schools to understand whether the relationship between gender and repetition varies from school to school.

``````Mutation (gender)  =  if_ Else (gender)  == " boy",   1,   0))  %>%
ggplot(aes(x  =  Gender,   y  =  Have you ever left a grade,   color  =  As. Factor (school)  +``````

In the picture above, different colors represent different schools. We can see that the relationship between gender and repetition in different schools seems to be very different.

We can do the same for preschool education and repetition.

``````Mutation (gender)  =  if_ Else (gender)  == " girl",   0   1),
Pre school education  =  if_ Else (pre-school education)  == " yes",   1,   0))  %>%
group_ By (school)  %>%
Mutation (gender)  =  Gender  -  Mean,``````

The relationship between preschool education and repetition is quite different in different schools. However, we can also see that most of the relationships show a downward trend, from 0 (no previous school) to 1 (previous school), indicating that the relationship between preschool education and repetition is negative.

Due to the above observations, we can conclude that a multi-level model needs to be established in the current data, not only with random intercept (school), but also with random slope of gender and preschool education.

# Centralized variable

Before fitting the multi-level model, it is necessary to use an appropriate centralization method (i.e. mean centralization) to centralize the prediction variables, because the centralization method is very important for the interpretation of model estimation. According to Enders and tofighi (2007), we should use centralization for the gender and preschool education of the first level predictor, and use mean centralization for the school average socio-economic status of the second level predictor.

``````Pre school education  =  if_ Else (pre-school education)  == " yes",   1,   0))  %>%
group_ By (school)  %>%
Mutation (gender)  =  Gender  -  Mean,
Pre school education  =  Pre school education  -  Mean (pre-school education)  %>%
ungroup() %>%``````

## Intercept only model

To specify a multi-level model, we use the lme4 package. The random slope term and cluster term should be separated by |. Note that we use an additional parameter to specify a maximum number of iterations greater than the default value (10000). Because a multi-level model may require a lot of iterations to converge.

We first specify a pure intercept model to evaluate the impact of data clustering structure.

``````Glmer (have you left a grade  ~  one  + ( 1 | school),
optCtrl = list(maxfun=2e5))``````

Let’s calculate the ICC (intra class correlation) of the pure intercept model.

An ICC of 0.33 means that 33% of the change in the result variable can be explained by the clustering structure of the data. This provides evidence that the multi-level model may have an impact on the estimation of the model compared with the non multi-level model. Therefore, the use of multi-level model is necessary and guaranteed.

# Complete model

It is a good practice to build a multi-level model step by step. However, since the focus of this paper is not the multi-level model, we go directly from the pure intercept model to the full model we are ultimately interested in. In the complete model, we include not only the fixed effect term and a random intercept term of gender, preschool education and school average socio-economic status, but also the random slope term of gender and preschool education. Note that we specify family = binomial (link = “logit”) because this model is essentially a binary logistic regression model.

``Glmer (have you left a grade  ~  Gender  +  Pre school education  +  Average socio-economic status of schools  + ( one  +  Gender  +  (pre school education)``

The results (related to fixed effects) were similar to those of previous binary logistic regression and binomial logistic regression models. At the student level, gender has a significant positive impact on students’ repetition rate, while preschool education has a significant negative impact. At the school level, school status has a significant negative impact on the outcome variables. Let’s also look at the variance of the random effect term.

Similarly, we can use the summ () function to retrieve the exponential coefficient estimates for easy interpretation.

``sum(Model_Full)``

We can also show the effect of parameter estimation. Note that since the first level categorical variables (gender and preschool education) are centralized, they are treated as continuous variables in the model, as well as in the effect diagram below.

``plot((Model)``

In addition to the fixed effect term, let’s also look at the random effect term. From the previous ICC values, we know that it is necessary to include a random intercept. However, the need for a random slope including gender and preschool education is less clear. In order to clarify this point, we can use likelihood ratio test and AIC to judge whether the addition of random slope can improve the fitting of the model.

``Glmer (have you left a grade  ~  Gender  +  Pre school education  +  Average socio-economic status of schools  + ( one  +  Pre school education (school),``
``````#Fit an incomplete model and eliminate the random slope term of 'preschool education'
Glmer (have you left a grade  ~  Gender  +  Pre school education  +  Average socio-economic status of schools  + ( one  +  Gender (school),``````

## Likelihood ratio test

More complete models and models excluding ` gender ‘

The complete model was compared with the model excluding “pre-school education”

From all the insignificant likelihood ratio test results (PR (> chisq) > 0.05), we can conclude that adding any random slope term does not significantly improve the model fitting.

## AIC

``````AIC  # Full model
AIC # no gender model
AIC  #＃ Model without preschool education
AIC #: model without random slope``````

From the AIC results, we found that including the random slope term either did not significantly improve the AIC (expressed as a lower AIC value) or resulted in a worse AIC (i.e. higher). Therefore, we also conclude that it is not necessary to include random effects.

# Other families (distributions) and link functions

So far, we have introduced binary and binomial logistic regression, both of which come from the logit link of binomial family. However, there are many distribution families and link functions that we can use in GLM analysis. For example, in order to model binary results, we can also use probit links or log log (cloglog) instead of Logit links. In order to model the count data, we can also use Poisson regression, which assumes that the result variable comes from Poisson distribution and uses logarithm as the link function.

## reference

Bates, D., Maechler, M., Bolker, B., & Walker, S. (2015). _Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67_(1), 1-48. doi:10.18637/jss.v067.i01

Enders, C. K., & Tofighi, D. (2007). Centering predictor variables in cross-sectional multilevel models: A new look at an old issue. _Psychological Methods, 12_(2), 121-138. doi:10.1037/1082-989X.12.2.121

Most popular insights

## Less sass SCSS stylus of CSS preprocessor

Let’s ask ourselves a question: why do we need a preprocessor?The answer is self-evident, that is, CSS itself has some disadvantages: The syntax is not strong enough to be written nested, resulting in unclear style logic in the project. There is no variable and logical reuse mechanism. When it is necessary to reuse code, you […]