Fundamentals of survival analysis
survival analysis It can be used in many fields, such as:
- For patient survival time analysiscancer research ，
- Of “event history analysis”Sociology，
- stayIn EngineeringUsed for “time to failure analysis”.
In cancer research, typical research questions are as follows:
- How do some clinical features affect patient survival?
- What is the probability of a person surviving for three years?
- Is there a difference in survival between the two groups?
The purpose of this chapter is to describe the basic concepts of survival analysis. In cancer research, most survival analyses use the following methods:
- Kaplan Meier diagram(Kaplan Meier plots) visual survival curve
- Log-Rank Test (log rank test) to compare the survival curves of two or more groups
- Cox proportional hazards regression(Cox proportional hazards region) describes the impact of variables on survival. The next chapter discusses the Cox model:Cox proportional hazards model。
Here, we will start by explaining the basic concepts of survival analysis, including:
- How to generate and interpret survival curves,
- And how to quantify and test survival differences between two or more groups of patients.
Then we will continue to useCox proportional hazards modelDescribe multivariate analysis.
Here, we first define the basic terms of survival analysis, including:
- Survival time and event
- Survival function and hazard function
Survival time and event type in cancer research
There are different types of events, including:
From “response to treatment” (complete remission) to the occurrence of events of concerntimeCommonly known assurvival time (or time of event).
The two most important scales in cancer research include: I)Time of death； II) noneRecurrence survival time, corresponding to the time between response to treatment and disease recurrence. also known asDisease free survival time(disease free survival time) and noneEvent lifetime（event-free survival time）。
As mentioned above, survival analysis looks at the expected duration until the event of interest (recurrence or death) occurs. However, some people may not have observed the event during the study period, resulting in the so-called censoring phenomenon.
The review may occur in the following ways:
- The patient has not (has not) experienced an event of interest during the study period, such as recurrence or death;
- Patients lost follow-up during the study;
- The patient experienced another event and therefore could not be followed up further.
This phenomenon will be encountered in survival analysis, which is calledRight deletion。
Here’s a supplementRight deletionandLeft deletionMeaning:
Taking the right side as an example, when the above situation occurs to the patient, the data point on the right side of the time point of the time axis (i.e. after the time point) is missing, so it is called right deletion. Right deletion is often encountered in clinical bed.
In this way, the left deletion is also easy to understand. Since the followed-up person failed to participate in the follow-up before a certain point in the time axis for some reasons, the data before the time point (i.e. the left side of the time axis) is missing, so it is called left deletion.
For example, researchers want to track the visual changes of teenagers from 12 to 18 years old. If a respondent starts follow-up at the age of 14, there will be left loss. If a respondent cannot continue to participate in follow-up due to excessive playing games at the age of 14, there will be right loss.
Survival and risk function
Two related probabilities are used to describe survival data:Survival probabilityandHazard probability。
Survival function, also known as the survivor function s (T), is the probability of survival from the time origin (e.g. cancer diagnosis) to the specified future time t.
The risk function, recorded as H (T), is the probability of an event occurring to the individual observed at time t.
Note that the survival function focuses on the absence of events and the hazard function focuses on the occurrence of events.
Kaplan Meier survival estimate
The Kaplan Meier (km) method is a nonparametric method used to estimate the probability of survival based on the observed survival time (Kaplan and Meier, 1958).
The survival probability at time point t (I) is calculated as follows:
The estimated probability (s (T)) is a step function that changes the value only when each event occurs. The confidence interval of survival probability can also be calculated.
Km survival curve is the relationship between KM survival probability and time. It provides a useful data summary, which can be used to estimate indicators such as median survival time.
Survival analysis in R
Install and load the required R package
We will use two R packages:
survivalUsed to calculate survival analysis
survminerUsed to summarize and visualize survival analysis results
- Load package
We will use the lung cancer data provided in the survival package.
inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss 1 3 306 2 74 1 1 90 100 1175 NA 2 3 455 2 68 1 0 90 90 1225 15 3 3 1010 1 56 1 0 90 90 NA 15 4 5 210 2 57 1 1 90 60 1150 11 5 1 883 2 60 1 0 100 90 NA 0 6 12 1022 1 74 1 1 50 80 513 0
- Organization: organization code
- Time: survival time in days
- Status: review status 1 = review, 2 = failure
- Age: years old
- Gender: male = 1 female = 2
- Ph.ecog: ECoG score (0 = good, 5 = dead)
- Ph.karno: physician’s score of Karnofsky (poor = 0 – good = 100)
- pat. Karno: Karnofsky performance score, by patient
- meal. Cal: calories consumed during meals
- Wt.loss: weight loss in the last six months
Calculate survival curve: survfit ()
We want to calculate survival by sex.
existenceFunctions in software packagesurvfit() can be used to calculate Kaplan Meier survival estimates. Its main points include:
- useSurvSurvival object created by () function
- And a dataset containing variables.
To calculate the survival curve, type:
fit <- survfit(Surv(time, status) ~ sex, data = lung) print(fit)
Call: survfit(formula = Surv(time, status) ~ sex, data = lung) n events median 0.95LCL 0.95UCL sex=1 138 112 270 212 310 sex=2 90 53 426 348 550
By default, the function print () displays a short summary of the survival curve. It prints observations, number of events, median survival, and confidence limits for median values.
If you want to display a more complete summary of the survival curve, type the following:
# Summary of survival curves summary(fit) # Access to the sort summary table summary(fit)$table
Access the value returned by survfit()
functionsurvfit() returns a list of variables, including the following components:
- n: The total number of objects in each curve.
- Time: the point in time on the curve.
- n。 Risk: number of subjects at risk at time t
- n. Event: the number of events that occurred at time t.
- n。 Reviewer: the number of reviewers who exit the event without risk at time t.
- Lower, upper: upper and lower confidence limits of the curve.
- Layering: indicates the layering of curve estimation. If the stratum is not null, there are multiple curves in the result. The level of the hierarchy (a factor) is the label of the curve.
You can access components in the following ways:
d <- data.frame(time = fit$time, n.risk = fit$n.risk, n.event = fit$n.event, n.censor = fit$n.censor, surv = fit$surv, upper = fit$upper, lower = fit$lower ) head(d)
time n.risk n.event n.censor surv upper lower 1 11 138 3 0 0.9782609 1.0000000 0.9542301 2 12 135 1 0 0.9710145 0.9994124 0.9434235 3 13 134 2 0 0.9565217 0.9911586 0.9230952 4 15 132 1 0 0.9492754 0.9866017 0.9133612 5 26 131 1 0 0.9420290 0.9818365 0.9038355 6 30 130 1 0 0.9347826 0.9768989 0.8944820
Visual survival curve
We will use this featureggsurvplot() [atSurvminerThe survival curves of the two groups of subjects were] generated in the R package.
You can also display:
- Use parametersconf.int = TRUE95% of survival functionConfidence limit。
- Using optionrisk.tableBy timedivideAt riskofindividualNumber and / or percentage of.risk. TableAllowable values include:
- True or false specifies whether to display the risk table. The default value is false.
- “Absolute” or “percentage”: displays the number of subjects at risk by time, respectivelyAbsolute numberandpercentage。 Use abs_pct to display absolute numbers and percentages.
- yesnumberRank testP value, usepval = TRUEComparison group.
- Use parameterssurv.median.linestayMedian survivalHorizontal / vertical lines. Allowed values include one of C (“None”, “HV”, “H”, “V”). v: Vertical, H: horizontal.
# Change color, linetype by strata, risk.table color by strata ggsurvplot(fit, pval = TRUE, conf.int = TRUE, risk.table = TRUE, # Add risk table risk.table.col = "strata", # Change risk table color by groups linetype = "strata", # Change line type by groups surv.median.line = "hv", # Specify median survival ggtheme = theme_bw(), # Change ggplot2 theme palette = c("#E7B800", "#2E9FDF"))
The diagram can be further customized with the following parameters:
- Conf.int.style = “step”To change the style of the confidence interval band.
- xlabChange the x-axis label.
- break.time.by = 200Disconnect the x-axis 200 at time intervals.
- risk.table =“ abs_pct”，To show the absolute number and percentage of individuals at risk.
- risk.table.y.text.col = TRUE，risk.table.y.text = FALSEProvide a bar instead of a name in the text note of the risk table legend.
- ncensor.plot = TRUE，To plot the number of inspected objects at time t. just asMarcin KosinskiAs suggested, this is a good additional feedback to the survival curve, so people can realize: how does the survival curve look, what is the number of risk sets, and what is the reason why the risk sets become smaller? Was it caused by an event or an event under review?
- legend.labsChange the legend label.
ggsurvplot( fit, # survfit object with calculated statistics. pval = TRUE, # show p-value of log-rank test. conf.int = TRUE, # show confidence intervals for # point estimaes of survival curves. conf.int.style = "step", # customize style of confidence intervals xlab = "Time in days", # customize X axis label. break.time.by = 200, # break X axis in time intervals by 200. ggtheme = theme_light(), # customize plot and risk table with a theme. risk.table = "abs_pct", # absolute number and percentage at risk. risk.table.y.text.col = T,# colour risk table text annotations. risk.table.y.text = FALSE,# show bars instead of names in text annotations # in legend of risk table. ncensor.plot = TRUE, # plot the number of censored subjects at time t surv.median.line = "hv", # add the median survival pointer. legend.labs = c("Male", "Female"), # change legend labels. palette = c("#E7B800", "#2E9FDF") # custom color palettes. )
Kaplan Meier diagram can be explained as follows:
The horizontal axis (x axis) represents the time in days, and the vertical axis (Y axis) represents the possibility of survival or the proportion of the population living. The line represents the survival curve of the two groups. A vertical drop in the curve indicates an event. The vertical tick mark on the curve indicates that the patient has been examined at this time.
- At zero, the probability of survival was 1.0 (or 100% of participants were still alive).
- At time 250, the survival probability of sex = 1 was about 0.55 (or 55%), and the survival probability of sex = 2 was about 0.75 (or 75%).
- The median survival of sex = 2 was about 270 days and that of sex = 2 was about 426 days, indicating that the survival of sex = 2 was higher than that of sex = 1
The median survival time for each group can be obtained using the following code:
records n.max n.start events *rmean *se(rmean) median 0.95LCL 0.95UCL sex=1 138 138 138 112 325.0663 22.59845 270 212 310 sex=2 90 90 90 53 458.2757 33.78530 426 348 550
The median survival time of each group represents the time when the survival probability s (T) is 0.5.
The median survival time was 270 days for gender = 1 (male) and 426 days for gender = 2 (female). Compared with men, women with lung cancer seem to have a survival advantage. However, to assess whether this difference is statistically significant, a formal statistical test is required, which will be discussed in the next section.
Note that the confidence limit is very wide at the tail of the curve, so it is difficult to explain it meaningfully. This can be explained by the fact that in practice, some patients usually lose follow-up or survive at the end of follow-up. Therefore, it may be wise to shorten the map on the x-axis before the end of follow-up (Pocock et al., 2002).
Parameters can be usedxlimShorten the survival curve as follows:
ggsurvplot(fit, conf.int = TRUE, risk.table.col = "strata", # Change risk table color by groups ggtheme = theme_bw(), # Change ggplot2 theme palette = c("#E7B800", "#2E9FDF"), xlim = c(0, 600))
Note that you can use parametersfunSpecify three frequently used transformations:
- “Log”: logarithmic conversion of survivor function,
- “Events”: plot cumulative events (f (y) = 1-y). Also known as cumulative incidence
- “Cumhaz” plots the cumulative hazard function (f (y) = – log (y))
For example, to plot cumulative events, type:
ggsurvplot(fit, conf.int = TRUE, risk.table.col = "strata", # Change risk table color by groups ggtheme = theme_bw(), # Change ggplot2 theme palette = c("#E7B800", "#2E9FDF"), fun = "event")
Cumulative hazardIt is commonly used to estimate the probability of danger. Defined as:
The cumulative risk (H (T)) can be interpreted as the cumulative force of death. In other words, if the event is a repeatable process, it corresponds to the number of events expected by each individual at time t.
To plot cumulative hazards, type the following:
ggsurvplot(fit, conf.int = TRUE, risk.table.col = "strata", # Change risk table color by groups ggtheme = theme_bw(), # Change ggplot2 theme palette = c("#E7B800", "#2E9FDF"), fun = "cumhaz")
Kaplan Meier life table: summary of survival curve
As mentioned above, you can use functionssummary() to obtain a complete summary of the survival curve:
Functions can also be usedsurv_summary() [atsurvminerPackage] get a summary of the survival curve. Compared with the default summary () function, surv_ Summary () creates a data frame that contains a good summary from the survfit results.
res.sum <- surv_summary(fit) head(res.sum)
time n.risk n.event n.censor surv std.err upper lower strata sex 1 11 138 3 0 0.9782609 0.01268978 1.0000000 0.9542301 sex=1 1 2 12 135 1 0 0.9710145 0.01470747 0.9994124 0.9434235 sex=1 1 3 13 134 2 0 0.9565217 0.01814885 0.9911586 0.9230952 sex=1 1 4 15 132 1 0 0.9492754 0.01967768 0.9866017 0.9133612 sex=1 1 5 26 131 1 0 0.9420290 0.02111708 0.9818365 0.9038355 sex=1 1 6 30 130 1 0 0.9347826 0.02248469 0.9768989 0.8944820 sex=1 1
functionsurv_summary() returns a data frame containing the following columns:
- Time: the time point at which the curve has a step.
- n。 Risk: number of subjects at t risk.
- n. Event: the number of events that occurred at time t.
- n. Censor: number of audit events.
- Surv: estimated survival probability.
- Std.err: survival standard error.
- Upper: upper limit of confidence interval
- Lower: lower limit of confidence interval
- Layering: indicates the layering of curve estimation. The level of the hierarchy (a factor) is the label of the curve.
When the survival curve has been fitted to one or more variables, surv_ The summary object contains additional columns that represent variables. This allows the output of ggsurvplot to be considered hierarchically or in combination with some factors.
surv_summaryThe object also has an attribute called “table”, which contains information about the survival curve, including the median survival with confidence interval, the total number of subjects and the number of events in each curve. To access the property sheet, enter the following command:
Log rank test for comparing survival curves: survdiff ()
log-rank testIt is the most widely used method to compare two or more survival curves. The null hypothesis was that there was no difference in survival between the two groups. Log rank test is a nonparametric test, which makes no assumptions about survival distribution. In essence, the log rank test compares the number of events observed in each group with the expected number of events if the original hypothesis is true (i.e., the survival curve is the same). Log rank statistics are approximately distributed as chi square test statistics.
functionsurvdiff() [atexistencePackage] can be used to compare pairs of two or more survival curvesNumber rank test。
survdiff() can be used as follows:
surv_diff <- survdiff(Surv(time, status) ~ sex, data = lung) surv_diff
Call: survdiff(formula = Surv(time, status) ~ sex, data = lung) N Observed Expected (O-E)^2/E (O-E)^2/V sex=1 138 112 91.6 4.55 10.3 sex=2 90 53 73.4 5.68 10.3 Chisq= 10.3 on 1 degrees of freedom, p= 0.00131
This function returns a list of components, including:
- n: Number of subjects per group.
- OBS: weighted number of observations of events in each group.
- Exp: the number of weighted expected events in each group.
- Chisq: Chi square statistic used to test equality.
- Hierarchy: (optional) the number of topics contained in each hierarchy.
The log rank test of survival difference gives a p value of P = 0.0013, indicating that the survival difference between gender groups is significant.
Fitting complex survival curves
In this section, we will use a combination of multiple factors to calculate the survival curve. Next, we will combine several factors to study the output of ggsurvplot()
- Fitting (complex) survival curves using colon data sets
require("survival") fit2 <- survfit( Surv(time, status) ~ sex + rx + adhere, data = colon )
- Use survminer to visualize the output. The following figure shows the survival curve by sex according to RX & adhesion value.
# Plot survival curves by sex and facet by rx and adhere ggsurv <- ggsurvplot(fit2, fun = "event", conf.int = TRUE, ggtheme = theme_bw()) ggsurv$plot +theme_bw() + theme (legend.position = "right")+ facet_grid(rx ~ adhere)
Survival analysis is a set of statistical methods for data analysis, in which the outcome variable of interest is the time before the event.
Survival data is usually described and modeled according to two related functions:
The survivor function represents the probability of a person’s survival from the time of origin to a certain time beyond time t. It is usually estimated by Kaplan Meier method. The log rank test can be used to test the difference between the survival curves of the test group (e.g. the treatment group).
The hazard function gives the instantaneous probability of an event and the survival rate at that time. It is mainly used as a diagnostic tool or a mathematical model for specifying survival analysis.
In this article, we demonstrated how to use two R packages（survival(for analysis) andsurvminer(for visualization)) to perform and visualize survival analysis.