Environmental stressors often exhibit temporal lag effects, which requires the use of sufficiently flexible statistical models to describe the temporal dimension of the exposure-response relationship. Here, we develop the Distributed Lag Nonlinear Model (DLNM), a modeling framework that can simultaneously represent nonlinear exposure-response dependence and lag effects. This approach is based on the definition of "cross-benchmark", which is a two-dimensional function space that describes the shape of the relationship along both the prediction space and the dimension of the lag in which it occurs.
In this way, the method provides a unified framework for a range of models previously used in this environment. To illustrate this approach, we use the example of DLNMs to represent the relationship between temperature and mortality, using data from the National Morbidity, Mortality, and Air Pollution Study for the period 1987-2000.
Sometimes the effects of a particular exposure event are not limited to the observed time period, but are delayed in time. This brings up the problem of modeling the relationship between an exposure event and a series of future outcomes, specifying the distribution of effects at different times after the event (defined lag periods). Ultimately, this step entails defining additional lag dimensions of the exposure-response relationship, describing the temporal structure of effects.
This happens frequently when assessing the short-term effects of environmental stressors: Several time-series studies have reported that exposure to high levels of air pollution or extreme temperatures can affect health in the days following the onset. Furthermore, such a phenomenon occurs when a stressor primarily affects a group of vulnerable individuals whose events are only briefly advanced by the effects of exposure.
Among the various approaches that have been proposed to deal with aftereffects, distributed lag models (DLMs) have played a major role and have recently been used to quantify health effects in air pollution and temperature studies. The main advantage of this approach is that it allows the model to include a detailed representation of the time course of the exposure-response relationship, which in turn provides an estimate of the overall effect where there is a lagged contribution or harvest.
Although conventional DLMs are suitable for describing the hysteresis structure of linear effects, they show some limitations when used to represent nonlinear relationships. We propose a solution that further relaxes assumptions about relationships and extends this approach to distributed lag nonlinear models (DLNMs), a family of models that can describe in a flexible way along the predictor space and its occurrence The effect of simultaneous changes in the lag dimension of . In this way, the DLNM classes also provide a unified framework for existing simpler methods.
DLNMs have only been briefly described in epidemiological contexts before: the aim of this paper is to rigorously develop this approach and to describe the implementation in the statistical software R, a specially written package dlnm, to provide an application using real datasets instance. We briefly describe the basic models used in time series analysis and introduce the underlying concepts as a general way to describe nonlinear relationships between variables and dependent variables. We outline the complexity of lag effects in time and provide a simple general representation of DLMs. The application of this approach to modeling the effect of temperature on mortality is then illustrated. Finally we provide some discussion and suggest possible further developments.
The general model representation of the time series (t=1,…,n) describing the result Yt is
where ≡E(Y ), g is a monotonic function and Y is assumed to come from a distribution belonging to the exponential family. The function sj represents the smooth relationship between the variable xj and the linear predictor, defined by the parameter vector bj. The variable uk includes other predictors whose linear effects are specified by the correlation coefficient k. The function sj can also be specified by nonparametric methods based on generalized additive models. However, in the current development we rely on a fully parametric approach.
In a time series analysis of environmental factors, the outcome Yt is usually daily counts, assumed to come from a so-called overdispersed Poisson distribution. The studies take advantage of significant improvements in statistical methods over the past few years to quantify the short-term effects of air pollution. Typically, these methods include a smoothed function of time to identify the effects of confounding factors that vary slowly over time, manifesting as seasonal or long-term trends. The nonlinear effects of meteorological factors such as temperature and humidity are also included. Categorical variables such as days of the week or age groups were modeled as factors. Although air pollution is usually described by a linear relationship, this assumption can be relaxed in order to assess nonlinear effects.
Here, we focus on a general function s that specifies the potential nonlinearity and hysteresis effects of the predictor x, usually referring to air pollution or temperature, but without loss of generality.
The relationship between x and g() is represented by s(x), which is included in the linear predictor of generalized linear models as a sum of linear terms. This can be achieved by choosing a base, which is a function space in which we consider s to be an element. The relevant basis functions consist of the transformation of a set of fully known original variables x, resulting in a new set of variables, called basis variables. The complexity of the estimated relationship depends on the type of cardinality and its dimensions. Several different basis functions are used to describe the potential nonlinear effects of environmental factors on health, the choice of which depends on assumptions about the shape of the relationship, the degree of approximation required for the specific purpose of the investigation, and interpretation issues. In fully parameterized methods, the main choice usually relies on a function describing a smooth curve, such as a polynomial or spline function, or using a linear threshold parameterization, represented by a truncated linear function (x-)+, when x > Equal to (x-), otherwise equal to 0. The general representation of the above simple model is
Results at a given time t can be explained by past exposure xt- in the presence of a lag effect, which represents the time elapsed between exposure and response. A relatively simple method is to transform the original vector x of the ordered exposure, resulting in an n × (L+1) matrix Q, such as
This step specifies an additional lag dimension of the exposure-response relationship. Ultimately, the purpose of the modeling framework presented here is to simultaneously describe the dependencies in two dimensions: the usual predictor space and the new lag dimension.
Distributed lag model
When a linear relationship is assumed, the lag effect can naturally be described by a distributed lag model (DLM). This approach allows to distribute the effects of a single exposure event over a specific time period, with several parameters explaining the contributions of different lag periods. These models have been widely used to assess the lag effects of environmental factors. The simplest formulation is an unconstrained DLM, specified by adding a parameter for each lag period. Unfortunately, due to the high correlation between exposures on adjacent days and the concatenation in the resulting models, the precision of estimates of effects for specific lag periods tends to be very poor.
To make the estimation of distributed lag curves more accurate, some constraints can be imposed, such as assuming constant effects within the lag interval, or using continuous functions such as polynomials or splines to describe smooth curves. A simple model with a moving average of previous L days of exposure as a predictor can be considered a special case of DLM: such models have been widely used in the field of air pollution epidemiology and sometimes to quantify the effect of temperature. Such models have previously only been given to polynomial DLMs. It is possible to formulate a simpler and general definition of DLM in which the shape of the distributional effect along the lag period is specified by an appropriate basis. in matrix notation
we can define
By constructing the implicit linear effect b for each lag period, one can help explain the estimated parameter gˆ as follows.
Distributed Lag Nonlinear Model
There are well-established methods to describe flexible exposure-response relationships for simple lag models, or flexible DLMs for simple linear effects, but it is rare to model both components simultaneously. Extended methods for describing nonlinear effects have been proposed, by applying a constraint matrix C to each term of a threshold or piecewise function or to linear and quadratic terms separately, a DLM can be constructed. Nonetheless, these methods are still somewhat limited in their ability to describe such complex dependencies. A useful generalization can be achieved by generating a new model framework that can describe the nonlinear relationship between predictor space and lag period, which is the DLNM family.
While the algebraic notation for DLNM can be quite complex, involving three-dimensional arrays, the basic concepts are built on the definition of cross-cardinality and are simple. The intersection point can be depicted as a two-dimensional function space, describing both the shape of the relationship along x and the hysteresis effect of its distribution. Choosing a cross basis point is equivalent to choosing two sets of basis functions, which will be combined to produce a cross basis function.
In order to model the relational shape of the two spaces we consider, we need to apply both transformations described simultaneously. First, as described in (2), we choose a basis for x to define dependencies in the predictor space, specifying Z. Then, as described in (3), we create additional lag dimensions for each derived base variable of x stored in Z. The structure is symmetric, i.e. the order of the two transformations can be reversed, applying the basis functions directly to each column of the matrix Q.
Despite the complexity of parameterization, the estimation and inference of DLNM parameters pose no more problems than any other generalized linear model, and can be performed with ordinary statistical software after specifying cross-basis variables. However, while the interpretation of the simpler DLM in (4) is straightforward, including reporting the estimated linear effect bˆ for each lag in (6), the results of the more complex DLNM are difficult to summarize with smooth nonlinear dependencies. One solution is to build a forecast grid for each lag period and appropriate exposure value, using a three-dimensional plot to provide an overall picture of the impact of changes along both dimensions.
The prediction grid, represented by the m × (L+1) matrix of predicted effects E and the associated standard error Esd matrix, can be computed from the fitted model including the cross-basis function matrix W using a vector of estimated coefficients gˆ.
And, given V(gˆ) is the variance matrix of the estimated coefficients
This grid is useful for calculating an estimate of the exposure effect at lag p or the exposure effect at lag xp, just take ep and ex p-, respectively. Finally, an estimate of the overall effect can be calculated by adding up all the contributions from different lags. The vector etot and the associated standard error esd tot, obtained by summing the contributions of each lag period, account for the effect of exposure across the lag period.
Data and Model Selection
We applied DLNMs to study the effect of temperature on overall mortality over the period 1987-2000. The dataset is from the National Morbidity, Mortality and Air Pollution Study.
It includes 5,114 daily observations of overall and cause-specific mortality, weather, and pollution data.
The analysis is based on the model in (1), fitted by a generalized linear model of the quasi-Poisson family, with the following options in controlling for confounders: Temporal natural solid splines with 7 degrees of freedom (df) per year to describe the long-term Trend and seasonality; day-of-week indicator variables; natural stereo splines with 3 degrees of freedom for dew point temperature mean at lags 0-1; linear terms for mean values for ozone and CO at lags 0-1.
glm(death ~ ns.basis + ns(dp01,df=3) + dow + o301 + co01 + ns(date,df=14*7),family=quasipoisson(), data)
These selections are based on several methodological and substantive papers on time series analysis. The effect of mean temperature is investigated by choosing two base points to describe the relationship between temperature and lag-period space; we illustrate a flexible model using natural solid splines to describe the relationship in each dimension. Nodes are placed equidistantly across the temperature range for sufficient flexibility in the tails, and equidistantly on the logarithmic scale of the lags to have more in the first part of the distributed lag curve. Flexibility as more changes are expected there. The maximum lag period L is set to 30 days. For comparison, we fit a simpler model with a moving average of the previous days' temperatures.
We choose the number of knots according to the modified Akaike and Bayesian information criteria, which define the df in each dimension, for models with overdispersed responses fitted by the quasi-likelihood method, as follows.
All analyses were performed with R software.
# 3-D graph plot(ns.pred,label="Temperature")
When used to compare different modeling choices, QAIC resulted in a relatively complex model with 11df in the predictor space and 5df in the lag dimension, with a total of 55 parameters used to define the relationship. In contrast, QBIC shows a 5×5df model with 25df describing the overall effect. Without any knowledge of the performance of these criteria within the DLNM framework, we chose the latter as our final model.
Figure 1 provides an overall picture of the effect of temperature on mortality, showing a three-dimensional plot of relative risk (RR) along temperature and lag compared to a reference value of 21◦C (the overall lowest mortality point). The graph shows a very strong and direct effect of heat, and shows that there is a more hysteretic effect on extremely hot temperatures. The maximum effect of cold temperatures is reached around a 2-3 year lag.
Although a 3-D plot is a useful tool for summarizing the overall relationship of two dimensions, it cannot include the uncertainty of the estimate. For a more specific assessment of this relationship, we can plot the effect of a specific temperature or lag period. Figure 2 shows the RRs for temperature at specific lag periods (0, 5, 15, and 28) and lag periods at specific temperatures (-10.8, -2.4, 26.5, and 31.3 ◦C), approximately corresponding to the 0.1, 5th temperature distribution. , 95th and 99.9th percentiles (known as moderate and extreme cold and heat). The overall effect of temperature, summing the contributions of the 30-day lag period considered in the analysis, is included below. The temperature-mortality relationship appears to vary with lag, with a different minimum mortality point for lags 0 and 5 (top two graphs in the upper left). The graph confirms that if the effects of extreme heat are more delayed compared to moderate heat, the significant risk persists for 10 and 3 days, respectively (third and fourth graphs on the top right). Still, only the extreme heat indicated a possible harvest effect, which started after a 15-day lag. The overall estimated RRs relative to 21◦C were 1.24 (95% CI: 1.13-1.36) and 1.07 (95% CI: 1.03-1.11) for extreme and moderate heat. Cold temperatures show a completely different pattern, with the effects of moderate cold lasting until a lag of 25 days (top two graphs in the upper right). Furthermore, the effect of cold appears to be leveling off, with a slightly higher overall RR of 1.30 (95% CI: 1.20-1.40) for moderate cold and 1.20 (95% CI: 1.04-1.39) for extreme cold (see graph below) ).
To compare this DLNM with simpler alternatives, models were fitted to the moving average and the same spline function of the temperature space for lags 0-1 and 0-30. The former effect of high temperature provided similar estimates, but showed a weaker effect of low temperature, with an estimated RR of 1.06 (95% CI: 1.03–1.09) for moderate cold. This difference may be due to underestimation, as the effects of low temperature lasted longer than 2 days. Conversely, moving average models with lags 0-30 had similar effects on cold but lower estimates on heat, with RRs of 1.01 (95%CI: 0.97-1.04) and 1.06 (95%CI) for moderate and extreme heat, respectively : 0.97-1.17). Considering that each prior exposure in the lag period is assumed to have contributed equally to the effect on each day, it is plausible that the average 31-day estimate may introduce some bias. The above criteria show that the DLNM fits better, if compared to the lag 0-1 and 0-30 moving average models, the difference is 571 and 517 for QAIC and 468 and 445 for QBIC.
Sensitivity analyses have been performed to assess the impact of model selection. In particular, we evaluated the change in the estimated overall effect associated with changing the df used to specify the cross-basis function (along two dimensions) and the seasonal and long-term trend sections. Increasing the number of knots in the temperature space produced much less smooth curves, probably due to overfitting, while choosing a different spline in the lag dimension did not change significantly. Using more df to control for seasonal and long-term trends did not affect the estimates, except for a less pronounced drop in the temperature-mortality curve at very low temperatures.
Furthermore, examination of lags and temperature-specific curves revealed that the negative effects at long lags completely disappeared when seasonal control was added. Because the effects of models with longer lags are more sensitive to seasonal components.
In this paper, we describe a class of DLNMs that can be used to model the effects of factors that exhibit both nonlinear dependencies and hysteresis effects. DLNMs are conceptually simple, yet flexible enough to allow a wide range of models, including previously used simple models and more complex new variants.
Conceptual simplicity allows building an R package to fit such a wide range of models. One difficulty highlighted by this rich selection (base type, number and location of knots, maximum lag) is what criteria can be used to select alternatives.
In the above example, we used information criteria to guide the choice of the number of nodes, but we used a priori arguments in choosing the base type and maximum lag. The choice of DLNM has been discussed previously from an epidemiological perspective, and since there is no consensus on what is the "best" model, sensitivity analyses are particularly important to assess the dependence of key conclusions on model choice.
DLMN is broad in scope, which helps to achieve this. Regression diagnostics, such as residuals and partial autocorrelation plots, may also be helpful. Furthermore, we have discussed the choice of DLNM, assuming it focuses on the variable of interest (in our case temperature). There is also a model selection issue of covariates, some of which may also be DLNMs.
This problem is sometimes referred to as adjustment uncertainty. Likewise, there is no consensus on what method is best, and sensitivity analysis for this part of model selection is also important.
1. Zanobetti A, Schwartz J, Samoli E, Gryparis A, Touloumi G, Atkinson R, Le Tertre A, Bobros J, Celko M, Goren A, Forsberg B, Michelozzi P, Rabczenko D, Aranguez Ruiz E, Katsouyanni K. The temporal pattern of mortality responses to air pollution: a multicity assessment of mortality displacement. Epidemiology 2002; 13(1):87–93.
2. Braga AL, Zanobetti A, Schwartz J. The time course of weather-related deaths. Epidemiology 2001; 12(6):662–667.
3. Schwartz J. Is there harvesting in the association of airborne particles with daily deaths and hospital admissions? Epidemiology 2001; 12(1):55–61.
Most Popular Insights