Subjects/Math/Statistics and Discrete Math/Statistics/Survival analysis

Foundations of Survival Analysis

Understand survival analysis basics, the core survival and hazard functions, and how to handle censored data.

Summary

Read Summary

Flashcards

Save Flashcards

Quiz

Take Quiz

Quick Practice

What is the primary focus of study in survival analysis?

1 of 27

Summary

Introduction to Survival Analysis What is Survival Analysis? Survival analysis is a collection of statistical methods for analyzing data where the outcome of interest is the time until an event occurs. The "event" doesn't have to be death—it could be disease occurrence, machine failure, contract termination, or any well-defined outcome you want to study. The field goes by different names depending on the discipline. Engineers call it reliability analysis, economists call it duration modeling, and sociologists call it event history analysis. Despite the different terminology, they all use the same mathematical and statistical tools. The key assumption in standard survival analysis is that each subject experiences at most one event of interest. If you want to study recurring events (like repeated hospital admissions), you would use specialized recurrent-event models, but those are beyond the scope here. What Can Survival Analysis Do? Survival analysis has three main applications: Describe survival patterns in a group using tools like life tables, Kaplan–Meier curves, the survival function, and the hazard function. Compare survival between two or more groups using tests like the log-rank test. Model how variables affect survival using methods like Cox proportional hazards regression, parametric survival models, survival trees, or survival random forests. These allow you to evaluate the effect of both categorical variables (like treatment group) and quantitative variables (like age or disease severity). Essential Terminology Before diving into the mathematics, you need to understand four key concepts: Event: This is the outcome of interest you're studying. Common examples include death, disease occurrence, disease recurrence, recovery, or system failure. Whatever you choose as your event must be well-defined so there's no ambiguity about whether it has occurred. Time: This is the follow-up period from when observation begins until the event occurs, the study ends, the subject is lost to follow-up, or the subject withdraws. In survival analysis, we always measure from a well-defined starting point. Censoring: This is the crucial concept that makes survival analysis different from ordinary statistics. Censoring occurs when we know the event hasn't happened yet (the subject is still "at risk"), but we don't know exactly when it will happen. The key insight: a censored subject still gives us useful information—we know they survived at least until the censoring time. After the censoring time, we have no data because the subject is no longer being observed. Survival function S(t): This is the probability that a subject survives (i.e., the event has not yet occurred) beyond time $t$. It's the core object of interest in survival analysis. The Mathematics of Survival Analysis The Survival Function S(t) The survival function is defined as: $$S(t) = \Pr(T > t)$$ where $T$ is the random time at which the event occurs. In other words, $S(t)$ answers the question: "What's the probability that the event hasn't happened by time $t$?" The survival function has two important properties: S(0) = 1: At the start of observation, everyone is alive (or hasn't yet experienced the event), so the probability of surviving beyond time 0 is certain. Non-increasing: As time increases, the survival probability cannot increase. Mathematically, if $u \geq t$, then $S(u) \leq S(t)$. This makes intuitive sense: fewer people survive to time $u$ than to the earlier time $t$. The survival function typically decreases from 1 toward 0 as time increases, eventually reaching some positive value if some subjects never experience the event during the study period. The Lifetime Distribution Function F(t) While $S(t)$ gives the probability of surviving beyond time $t$, we also need a function for the probability that the event has occurred by time $t$. This is the cumulative distribution function of event times, defined as: $$F(t) = 1 - S(t)$$ So $F(t) = \Pr(T \leq t)$ is the probability that the event occurs by time $t$. The relationship $F(t) = 1 - S(t)$ is simply saying that either the event has happened by time $t$ or it hasn't—the probabilities must sum to 1. The Event Density f(t) If we assume $F(t)$ is smooth (differentiable), we can define the event density as: $$f(t) = \frac{dF(t)}{dt}$$ This is the probability density function of event times. It tells us the relative likelihood of the event occurring at different times. However, you'll rarely work directly with $f(t)$ in survival analysis practice—it's more useful as a mathematical building block. The Hazard Function h(t) The hazard function is arguably the most important quantity in survival analysis: $$h(t) = \frac{f(t)}{S(t)}$$ This is the instantaneous event rate given that the subject has survived to time $t$. Think of it as the conditional probability of experiencing the event in the next small time interval, given that you've survived this long. The hazard function is particularly useful because: It directly measures risk over time—an increasing hazard means risk is going up. Different diseases or failure modes have characteristic hazard shapes. For example, infant mortality in humans shows a decreasing hazard (danger decreases as you get older), while accidental death shows an increasing hazard. It's the foundation for the Cox proportional hazards model, one of the most popular methods for analyzing survival data. The hazard function is always non-negative and can take any value from 0 to infinity. Unlike the survival function (which must be between 0 and 1), hazards can be large. The Cumulative Hazard H(t) Just as we accumulated the hazard density into a distribution function, we accumulate the hazard function over time: $$H(t) = \int0^t h(u) \, du$$ The cumulative hazard $H(t)$ tells us the total accumulated risk from time 0 to time $t$. A crucial mathematical relationship connects the cumulative hazard back to the survival function: $$S(t) = \exp(-H(t))$$ This relationship is fundamental to survival analysis. It says that the survival probability decays exponentially with accumulated risk. This formula appears constantly in survival analysis formulas and models. Working With Survival Distributions Quantiles: Summarizing Survival with a Single Number Just as you might describe a distribution with its median, you can summarize a survival distribution using quantiles. The q-th quantile lifetime $tq$ is defined such that: $$S(tq) = 1 - q$$ In other words, $tq$ is the time by which a fraction $q$ of the population has experienced the event. The most commonly used quantiles are: The median survival ($q = 0.5$): the time by which half the population has experienced the event High-percentile survivals like 90% or 99% survival time ($q = 0.90$ or $0.99$): the time by which 90% or 99% have experienced the event, or equivalently, the time by which 10% or 1% still haven't Quantiles are useful because they give a single interpretable number ("patients have a 50% chance of surviving 5 years") rather than requiring you to look at a full curve. <extrainfo> Expected Future Lifetime Another summary measure is the expected future lifetime at age $t$, also called the mean residual lifetime. This is the expected amount of additional time someone will live, given that they've already survived to time $t$: $$E[T - t \mid T > t] = \frac{\intt^{\infty} S(u) \, du}{S(t)}$$ This is less commonly used than quantiles but appears in some applications. The key insight is that it depends on the entire survival function beyond time $t$, not just a single point. </extrainfo> Understanding Censoring and Missing Data This is where survival analysis becomes special. In most statistics, if you don't have complete data, you have a problem. In survival analysis, incomplete data is expected and handled gracefully through censoring. Right Censoring: The Most Common Type Right censoring is the most common censoring mechanism. It occurs when you know the event time $T$ is greater than some observed time $C$, but you don't know the exact value. The typical scenario: A study ends on a fixed date Some subjects haven't experienced the event yet You know they survived at least until the study end date, but not when they'll experience the event For example, in a cancer survival study, a patient is still alive at study end. You know their survival time is at least 5 years, but you don't know if they'll live to 6 or 10 years. Right censoring is called "right" because the unknown part of the timeline is to the right—we know the lower bound of survival time but not the upper bound. Left Censoring: Events Before Observation Begins Left censoring occurs when the event happened before observation began, but you don't know exactly when. This is less common but appears in specific contexts. A dental example: You're studying when children's permanent teeth erupt. You examine a child and find a tooth has already erupted, but you don't know when the eruption occurred—it was sometime before the examination. Left censoring means the unknown part is to the left of the observation timeline—we know the upper bound of event time but not the lower bound. Interval Censoring: Events Between Observations Interval censoring occurs when you know the event occurred between two observation times, but you don't know exactly when. A classic example is HIV seroconversion studies: You test subjects at regular intervals. If a subject is HIV-negative on one test and positive on the next, you know seroconversion occurred between these tests, but not exactly when. Truncation: A Different Kind of Incomplete Data Be careful not to confuse truncation with censoring. Truncation occurs when subjects whose event times fall outside a specified range are never observed at all. The most important example is left truncation (also called delayed entry). This happens when subjects can only enter the study if they haven't yet experienced the event. For example: A study of retirement age can only include people who are currently employed People who died before the study started were never included in the data The study only observes people whose event times fall after they entered the study The key difference from right censoring: with right censoring, we observe the subject and know they survived at least until the censoring time. With left truncation, subjects who would have had early events were never included in the dataset—there's no way to know about them. How We Use Censored Data: Likelihood Construction Here's the magic of survival analysis: we can write down a likelihood function that gracefully incorporates all types of censoring. The key insight is that different types of observations contribute different factors to the likelihood. The Principle Assuming we have independent observations, the overall likelihood is: $$L = \prod{i=1}^{n} Li$$ where $Li$ is the likelihood contribution from observation $i$. Different types of observations contribute differently: For uncensored data (we observed the exact event time $ti$): $$Li = f(ti)$$ The contribution is the probability density at the observed time—this makes sense because we need to estimate the density of when events occur. For right-censored data (the event happened after time $ri$): $$Li = S(ri)$$ The contribution is the survival probability at the censoring time. This makes intuitive sense: we're saying the probability of surviving at least as long as the observed censoring time. For left-censored data (the event happened before time $li$): $$Li = F(li)$$ The contribution is the cumulative distribution function—the probability that the event occurred by the censoring time. For interval-censored data (the event occurred between times $li$ and $ui$): $$Li = F(ui) - F(li)$$ The contribution is the probability of the event occurring in the interval—the difference between the probabilities at the two endpoints. This likelihood framework is powerful because it naturally handles mixed censoring: some subjects might have uncensored event times, others might be right-censored, and others might be interval-censored, all in the same analysis. Summary of Key Concepts You now understand the foundational concepts of survival analysis: What it is: Statistical analysis of time-to-event data The survival function S(t): The probability of not experiencing the event by time $t$ The hazard function h(t): The instantaneous risk of the event at time $t$, given survival to that point The relationship: $S(t) = \exp(-H(t))$ connects cumulative hazard to survival probability Censoring types: Right, left, and interval censoring, each requiring different likelihood contributions Why censoring matters: Incomplete data is expected and informative, not a problem Truncation: A fundamentally different concept where subjects are never observed These concepts provide the foundation for all the methods you'll learn for estimating survival curves, comparing groups, and building models to understand how variables affect survival.

Flashcards

What is the primary focus of study in survival analysis?

The time until a defined event (such as death or system failure) occurs.

What are three alternative names for survival analysis used in engineering, economics, and sociology?

Reliability analysis (engineering) Duration modelling (economics) Event history analysis (sociology)

How many events per subject does survival analysis typically assume?

One well-defined event.

Which four tools are commonly used to describe the survival times of a group?

Life tables Kaplan–Meier curves Survival function Hazard function

Which statistical test is used to compare the survival times of two or more groups?

Log-rank test.

In the context of survival analysis, what is an 'event'?

An outcome of interest such as death, disease occurrence, recovery, or failure.

How is 'time' defined in a survival study?

The interval from the start of observation to an event, study end, loss to follow‑up, or withdrawal.

What does 'censoring' represent in a survival dataset?

Occurs when the exact survival time is unknown but partial information is available.

When does right censoring occur?

When the event time is known only to exceed a lower bound (e.g., subject is alive at study end).

When does left censoring occur?

When the event happened before study entry but the exact time is unknown.

When does interval censoring occur?

When the event is known to lie between two specific observation times.

What is the formal definition of the survival function $S(t)$?

$S(t) = Pr(T > t)$ (the probability that a subject survives longer than time $t$).

What are the two mathematical properties that the survival function $S(t)$ must satisfy?

$S(0) = 1$ It is non-increasing ($S(u) \le S(t)$ for $u \ge t$)

How is the lifetime distribution function $F(t)$ related to the survival function $S(t)$?

$F(t) = 1 - S(t)$.

How is the event density $f(t)$ calculated from the lifetime distribution function $F(t)$?

$f(t) = \frac{dF(t)}{dt}$ (the derivative of the distribution function).

What is the definition of the hazard function $h(t)$ in terms of $f(t)$ and $S(t)$?

$h(t) = \frac{f(t)}{S(t)}$.

What does the hazard function $h(t)$ represent conceptually?

The instantaneous event rate conditional on survival to time $t$.

How is the cumulative hazard $H(t)$ calculated from the hazard function $h(u)$?

$H(t) = \int{0}^{t} h(u) du$.

How is the survival function $S(t)$ expressed in terms of the cumulative hazard $H(t)$?

$S(t) = \exp(-H(t))$.

What is the formula for the expected future lifetime (mean residual life) at age $t$?

$E[T - t | T > t] = \frac{\int{t}^{\infty} S(u) du}{S(t)}$.

What equation defines the $q$-th quantile lifetime $tq$ in survival analysis?

$S(tq) = 1 - q$.

What is the definition of 'truncation' in survival data?

When subjects whose event times fall outside a specified range are never observed at all.

What is left truncation (delayed entry)?

The exclusion of individuals who experience the event (e.g., die) before they can enter the study.

What is the likelihood contribution for an uncensored observation at time $ti$?

$f(ti)$ (the event density).

What is the likelihood contribution for a right-censored observation at time $ri$?

$S(ri)$ (the survival function).

What is the likelihood contribution for a left-censored observation at time $li$?

$F(li)$ (the distribution function).

What is the likelihood contribution for an interval-censored observation between $li$ and $ui$?

$F(ui) - F(li)$.

Quiz

Foundations of Survival Analysis Quiz Question 1: What does survival analysis study?

Time until a defined event such as death or failure occurs (correct)
Relationships among variables in a cross‑sectional dataset
Distribution of static measurements taken at a single time point
Frequency of categorical outcomes without a time component

Foundations of Survival Analysis Quiz Question 2: What characterizes right censoring in survival data?

The event time is known only to exceed a lower bound (correct)
The event time is known only to be less than an upper bound
The event is known to lie between two observation times
The subject was never observed in the study

Foundations of Survival Analysis Quiz Question 3: In the likelihood for censored data, what term represents a right‑censored observation with censoring time rᵢ?

S(rᵢ) (correct)
f(rᵢ)
F(rᵢ)
F(uᵢ) − F(lᵢ)

Foundations of Survival Analysis Quiz Question 4: Which scenario best illustrates left censoring?

The event occurred before study entry and the exact time is unknown (correct)
The exact event time is known to lie between two observation times
A subject is lost to follow‑up after a known event time
A subject enters the study after the event has already been recorded

Foundations of Survival Analysis Quiz Question 5: In survival analysis, what is meant by the future lifetime at age t?

The remaining time T – t given survival to time t (correct)
The total lifetime T regardless of survival status
The time until study end irrespective of the event
The probability of experiencing the event after time t

Foundations of Survival Analysis Quiz Question 6: Which assumption enables the overall likelihood for censored survival data to be expressed as a product of individual contributions?

Observations are independent (correct)
Observations are identically distributed
The hazard function is constant over time
Censoring is informative

Foundations of Survival Analysis Quiz Question 7: Which test is routinely employed to assess whether two or more groups have different survival experiences?

Log‑rank test (correct)
Wilcoxon signed‑rank test
Chi‑square test for independence
Student’s t‑test

Foundations of Survival Analysis Quiz Question 8: How is the cumulative distribution function F(t) related to the survival function S(t)?

F(t) = 1 – S(t) (correct)
F(t) = S(t) – 1
F(t) = S(t) × 1
F(t) = log (S(t))

Foundations of Survival Analysis Quiz Question 9: How can the survival function S(t) be expressed in terms of the cumulative hazard H(t)?

S(t) = exp(−H(t)) (correct)
S(t) = 1 − H(t)
S(t) = H(t) / (1 + H(t))
S(t) = ln(1 + H(t))

What does survival analysis study?

1 of 9

Key Concepts

Survival Analysis Concepts

Survival analysis

Survival function

Hazard function

Cumulative hazard function

Mean residual life

Statistical Methods and Models

Censoring

Kaplan–Meier estimator

Cox proportional hazards model

Log‑rank test

Truncation (survival analysis)

Definitions

Survival analysis

Statistical methods for analyzing the time until an event of interest occurs, such as death or system failure.

Censoring

A condition in which the exact event time is unknown but partial information is available, common in survival data.

Survival function

The probability that a subject survives longer than a specified time t, denoted S(t).

Hazard function

The instantaneous event rate at time t conditional on survival up to t, denoted h(t).

Kaplan–Meier estimator

A non‑parametric method for estimating the survival function from censored data.

Cox proportional hazards model

A semi‑parametric regression model that relates covariates to the hazard function assuming proportional hazards.

Log‑rank test

A statistical test for comparing the survival distributions of two or more groups.

Cumulative hazard function

The integral of the hazard function over time, denoted H(t), related to the survival function by S(t)=exp(–H(t)).

Mean residual life

The expected remaining lifetime given survival to a certain time, calculated as E[T–t | T>t].

Truncation (survival analysis)

A sampling scheme where subjects with event times outside a specified range are never observed, e.g., left truncation (delayed entry).