Subjects/Math/Statistics and Discrete Math/Statistics/Survival analysis

Introduction to Survival Analysis

Learn the key concepts of survival analysis, including censoring, Kaplan‑Meier estimation, and Cox proportional‑hazards modeling.

Summary

Read Summary

Flashcards

Save Flashcards

Quiz

Take Quiz

Quick Practice

What is the primary focus of survival analysis in terms of data type?

1 of 19

Summary

Fundamentals of Survival Analysis Introduction Survival analysis is a specialized branch of statistics designed to answer a specific type of question: how long until something happens? Unlike many statistical methods that focus on whether an event occurs or what value a variable takes, survival analysis emphasizes the timing of events. The event of interest might be death in a medical study, disease recurrence, mechanical failure in engineering, or customer churn in business. What unifies these diverse applications is that we care about the duration from a defined starting point to when the event finally occurs. Understanding Time-to-Event Data In survival analysis, each observation includes a follow-up duration—the time elapsed from a defined starting point (such as treatment initiation or diagnosis) until either the event occurs or the observation ends. A crucial reality of real-world data is that not all subjects are followed for the same length of time. Some may experience the event quickly, others may not experience it before the study ends, and some may leave the study before completion. This variability in follow-up duration is central to survival analysis, and the methods are specifically designed to handle it appropriately. Key Terminology: The Survival Function and Hazard Function Two mathematical functions form the foundation of survival analysis. The survival function, denoted $S(t)$, represents the probability that the event has not occurred by time $t$. In other words, it answers the question: "What is the probability that an individual 'survives' (meaning does not experience the event) beyond time $t$?" The survival function always starts at $S(0) = 1$ (everyone survives at the initial moment) and decreases toward zero as time increases, because more events eventually occur. For example, if $S(60) = 0.75$, this means 75% of individuals are event-free at time 60 months. The hazard function, denoted $h(t)$, describes the instantaneous risk of the event at time $t$, given that the event has not yet occurred. Think of it as the "force of mortality" or "instantaneous failure rate" at any given moment. Unlike the survival function, which is a cumulative probability, the hazard function represents the risk at an infinitesimal instant. A high hazard at a particular time means the event is very likely to occur soon; a low hazard means the event is unlikely in the immediate future. These two functions are mathematically related: high hazards lead to faster declines in the survival function, and low hazards mean survival probabilities remain high. Censoring: The Defining Challenge of Survival Analysis The most distinctive feature of survival data is censoring. Censoring occurs when we know an observation's event time is at least as long as the follow-up duration, but we don't know the exact time when the event occurred—or even if it will occur. Right-censoring is the most common type. It happens when: A subject is still alive (or has not experienced the event) at the end of the study A subject drops out or leaves the study before the event occurs The event occurs after the study ends The key insight is that censored observations contain genuine information: they tell us the event did not happen at least until the censoring time. Standard statistical methods that ignore this information would introduce bias. Survival analysis methods are specifically built to incorporate censored observations correctly, using all available information without bias. For example, if a patient enters a cancer study but leaves the hospital after 2 years without a recurrence, that patient is censored at 2 years. We know they survived at least that long, even if we never learn whether they eventually experienced recurrence. Summarizing Survival Data: The Kaplan-Meier Estimator The Kaplan-Meier Estimator When you have survival data with censoring, how do you estimate the survival function $S(t)$? The Kaplan-Meier estimator is the standard, nonparametric approach. It provides a step-function estimate of $S(t)$ that handles censoring elegantly. The logic is intuitive: at each time point where an event occurs, the survival probability drops by a proportion equal to (number experiencing event) / (number still at risk). The "number at risk" includes everyone who has not yet experienced the event or been censored before that time point. Importantly, censored observations reduce the number at risk in future time periods but do not themselves cause the survival curve to drop. The Step-Function Shape The Kaplan-Meier curve has a distinctive appearance: it remains perfectly flat (horizontal) between observed event times and then drops downward at each observed event time. This stepwise shape reflects the discrete nature of the observed event times in your data. There is no assumption that events are smoothly distributed across time; instead, the estimate directly reflects what you actually observe. This step-function shape also means that the survival estimate only changes at times when events actually occur in your data. Between these times, the estimated survival probability remains constant. In the example above, the Kaplan-Meier curves for two treatment groups in acute myeloid leukemia (AML) clearly show this step-function pattern. The steeper drops in the "Not Maintained" group indicate that events (presumably relapses or death) occur more frequently in that group. Handling Censoring in Kaplan-Meier Estimation The Kaplan-Meier method handles censored observations in a straightforward way: when a subject is censored at time $t$, they are removed from the "number at risk" for all times after $t$, but they do not cause the survival curve to drop. This is appropriate because being censored tells us the subject was event-free at least until time $t$; their removal from the at-risk set simply reflects that we can no longer observe them after that point. This design ensures that we don't bias the survival estimates downward due to censoring. If we ignored censoring and simply treated censored observations as if they experienced the event, we would dramatically underestimate survival. Comparing Survival Between Groups Plotting and Visually Comparing Survival Curves One of the most common questions in applied survival analysis is: Does survival differ between groups? For example, does a new treatment improve survival compared to a standard treatment? The simplest approach is to draw Kaplan-Meier curves separately for each group on the same plot. When you overlay multiple Kaplan-Meier curves, visual differences become apparent: Steeper decline: A group whose curve drops more steeply experiences events more rapidly—this group has worse survival. Shallower decline or curve remaining high: A group with a flatter curve experiences events slowly—this group has better survival. Overlapping curves: If curves largely overlap, survival experiences are similar between groups. Return to the AML example image. The "Not Maintained" group (dashed line) shows a steeper, more dramatic decline than the "Maintained" group (solid line), visually suggesting that patients who did not maintain remission had substantially worse survival. Formal Testing: The Log-Rank Test While visual comparison is informative, it is subjective. The log-rank test provides a formal statistical test of the null hypothesis that survival curves are equal between groups. The log-rank test compares the observed number of events in each group to the expected number of events under the null hypothesis of no group difference. If groups truly had identical survival, we would expect events to be distributed proportionally to group sizes. Departures from this expectation provide evidence against the null hypothesis. The test produces a test statistic and a p-value. A small p-value (typically $p < 0.05$) provides evidence that the survival curves are genuinely different, not just different by chance. This output shows Cox regression results; note the "Score (logrank) test" line at the bottom, which reports the log-rank test result. The test statistic was 6.47 with p-value = 0.0110, providing evidence of a significant difference in survival between groups. Modeling Survival with Covariates: The Cox Proportional-Hazards Model Moving Beyond Group Comparisons Kaplan-Meier curves and log-rank tests excel at comparing two or a few groups. But real data often contains multiple covariates (explanatory variables)—age, treatment type, disease stage, gender, and more. How do you assess how multiple covariates jointly affect survival while properly accounting for censoring? The Cox proportional-hazards model (also called Cox regression) extends survival analysis to the multiple-covariate setting. It relates the hazard function to a set of explanatory variables without requiring assumptions about the underlying distribution of survival times. The Cox Model Structure The Cox model specifies that the hazard for an individual with covariate vector $\mathbf{X}$ is: $$h(t|\mathbf{X}) = h0(t) \, e^{\beta^\top \mathbf{X}}$$ Breaking this down: $h0(t)$ is the baseline hazard function: the instantaneous risk at time $t$ when all covariates equal zero (or their reference levels). The Cox model does not require you to specify what $h0(t)$ is—you do not assume it follows a particular distribution. This makes the Cox model semi-parametric (partially specified). $e^{\beta^\top \mathbf{X}}$ is the relative risk multiplier: the exponential of the linear combination of coefficients and covariates. This term captures how covariates adjust the baseline hazard. The model assumes covariates act multiplicatively on the baseline hazard. Interpreting Coefficients and Hazard Ratios Each coefficient $\betai$ in the model quantifies the log-hazard change associated with a one-unit increase in covariate $i$, holding other covariates constant. More intuitively, the hazard ratio is: $$HR = e^{\beta}$$ The hazard ratio indicates how many times faster (or slower) the event risk changes with a one-unit increase in a covariate. HR = 1.5: A one-unit increase in the covariate multiplies the hazard by 1.5, meaning the risk of the event increases by 50%. HR = 0.8: A one-unit increase multiplies the hazard by 0.8, meaning the risk decreases by 20%. HR = 1: No effect—a one-unit increase doesn't change the hazard. For example, suppose you fit a Cox model to survival after cancer diagnosis with age as a covariate, and you estimate $\beta{\text{age}} = 0.05$. Then $HR{\text{age}} = e^{0.05} \approx 1.05$. This means each additional year of age multiplies the death hazard by 1.05—roughly a 5% increase in risk per year. The output shows $\text{coef} = 0.662$ for the "sex" variable, so $HR = e^{0.662} = 1.94$. The 95% confidence interval for the hazard ratio is (1.15, 3.26). This suggests that one group (encoded as the higher category for sex) has roughly 1.94 times the hazard compared to the reference group—a substantial increase in risk. The Proportional-Hazards Assumption A crucial assumption underlying the Cox model is the proportional-hazards assumption: the hazard ratio between any two individuals remains constant over time. In other words, if one person has a higher hazard than another at time $t$, they have a higher hazard at every other time $t'$. This assumption is plausible in many applications but not universal. For example, if a treatment has side effects early but long-term benefits, the hazard ratio might not be constant—the treatment might appear harmful initially but beneficial later. Violation of the proportional-hazards assumption can be assessed using: Graphical diagnostics: Plotting $\log(-\log S(t))$ against $\log(t)$ for each group. Under proportional hazards, these "log-log survival curves" should be roughly parallel. Formal statistical tests: Various tests compare observed to expected events over time to detect nonproportional hazards. If the assumption is severely violated, alternative approaches (like stratified Cox models or time-varying coefficients) may be more appropriate. <extrainfo> Applications of Survival Analysis Survival analysis is applied across numerous disciplines whenever researchers need to understand the timing of events in the presence of censoring. Medicine and Clinical Trials: Survival analysis evaluates time to outcomes such as death, disease recurrence, or treatment failure in clinical studies. This is the original and still most common application. Engineering and Reliability: In reliability engineering, survival methods assess the time until component failure or system breakdown. These methods are also called "failure-time analysis" in this context. Economics and Labor Studies: Economists use survival analysis to study durations such as time until unemployment ends, time to job turnover, or time to loan default. Social Sciences and Demography: Social scientists apply survival techniques to investigate durations like length of marriage, time to residential moves, or adoption of new technologies or behaviors. </extrainfo>

Flashcards

What is the primary focus of survival analysis in terms of data type?

Time-to-event data.

How is the duration in time-to-event data defined?

As the duration from a defined starting point to the occurrence of the event of interest.

What does the survival function $S(t)$ represent?

The probability of surviving beyond time $t$.

What does the hazard function $h(t)$ describe?

The instantaneous risk of the event occurring at time $t$.

When does censoring occur in survival analysis?

When the exact event time is unknown but exceeds a certain follow-up time.

What is the most common form of censoring, occurring when a subject is still alive at the study's end?

Right-censoring.

What kind of function does the Kaplan–Meier estimator use to estimate the survival function $S(t)$?

A step-function.

How does the Kaplan–Meier curve behave between observed event times?

It remains flat.

How do censored observations affect the Kaplan–Meier survival estimate?

They reduce the number at risk but do not cause a drop in the estimate.

What is the purpose of the log-rank test?

To provide a formal statistical test for equality of survival curves between groups.

What does the log-rank test compare to determine differences between groups?

The observed number of events versus the expected number of events under the null hypothesis.

How do covariates affect the baseline hazard function in a Cox model?

Multiplicatively.

Why is the Cox model considered a semi-parametric approach?

Because it does not require explicit specification of the baseline hazard function $h0(t)$.

What is the mathematical formula for the hazard in a Cox model with covariate vector $\mathbf{X}$?

$h(t|\mathbf{X}) = h0(t)e^{\beta^\top \mathbf{X}}$ (where $h0(t)$ is the baseline hazard and $\beta$ is the coefficient vector).

What does the baseline hazard $h0(t)$ describe?

The instantaneous risk when all covariates are set to zero.

How is the hazard ratio ($HR$) calculated from the Cox model coefficient $\beta$?

$HR = e^{\beta}$.

What does a hazard ratio ($HR$) greater than 1 signify?

Increased risk of the event.

What does a hazard ratio ($HR$) less than 1 signify?

Reduced risk of the event.

What is required by the proportional-hazards assumption in survival modeling?

That hazard ratios remain constant over time.

Quiz

What does survival analysis study?

1 of 5

Key Concepts

Survival Analysis Concepts

Survival analysis

Time-to-event data

Censoring

Survival function

Hazard function

Statistical Methods

Kaplan–Meier estimator

Log‑rank test

Cox proportional‑hazards model

Hazard ratio

Proportional hazards assumption

Definitions

Survival analysis

A branch of statistics that models the time until an event of interest occurs.

Time-to-event data

Observations that record the duration from a defined start point to the occurrence of a specific event.

Censoring

A condition where the exact event time is unknown but is known to exceed a certain follow‑up time.

Survival function

The probability that an individual survives beyond a given time point.

Hazard function

The instantaneous risk of the event occurring at a particular time, conditional on survival up to that time.

Kaplan–Meier estimator

A non‑parametric method that estimates the survival function using observed event times and censored observations.

Log‑rank test

A statistical test that compares the equality of survival curves between two or more groups.

Cox proportional‑hazards model

A semi‑parametric regression model that relates covariates to the hazard function under the proportional‑hazards assumption.

Hazard ratio

The exponentiated coefficient from a Cox model representing the relative risk of the event per unit change in a covariate.

Proportional hazards assumption

The requirement that hazard ratios between groups remain constant over time in a Cox model.