Survival analysis Study Guide
Study Guide
📖 Core Concepts
Survival function $S(t)=\Pr(T>t)$ – probability a subject lives past time $t$. Starts at 1, never increases.
Hazard function $h(t)=\dfrac{f(t)}{S(t)}$ – instantaneous event rate conditional on having survived to $t$. Not a probability.
Cumulative hazard $H(t)=\int0^t h(u)\,du$ and the link $S(t)=\exp[-H(t)]$.
Censoring – exact event time unknown, but we know it lies after (right), before (left), or between (interval) a certain point. Censored subjects contribute information up to the censoring time.
Kaplan–Meier (KM) estimator – step‑wise product of survival probabilities using both censored and uncensored observations.
Log‑rank test – compares two (or more) KM curves; test statistic $\chi^21$ under $H0$: identical survival functions.
Cox proportional hazards (PH) model
$$hi(t)=h0(t)\exp(\beta1x{i1}+…+\betapx{ip})$$
– assumes hazard ratios are constant over time.
Proportional‑hazards assumption can be checked with scaled Schoenfeld residuals (cox.zph()); $p<0.05$ ⇒ violation.
Parametric survival models (exponential, Weibull, log‑logistic, generalized gamma…) impose a functional form on $h(t)$.
Stratified Cox – separate baseline hazards $h{0k}(t)$ for each stratum, common $\beta$’s.
Time‑varying covariates – covariate values that change during follow‑up are entered as $xi(t)$.
Survival trees / random forests – non‑linear, interaction‑rich alternatives that aggregate many trees for stable predictions.
Cure models – mixture of a cured fraction (plateau of $S(t)$) and a susceptible component with its own hazard.
---
📌 Must Remember
KM estimate: $ \hat S(t)=\prod{j:tj\le t}\dfrac{nj-dj}{nj}$ where $nj$ = at‑risk, $dj$ = events at $tj$.
Greenwood’s SE: $\hat{\operatorname{Var}}[\hat S(t)]=\hat S(t)^2\sum{j:tj\le t}\dfrac{dj}{nj(nj-dj)}$.
Log‑rank statistic: $ \displaystyle \chi^2=\frac{\big(\sum(Oi-Ei)\big)^2}{\sum Vi}$, $df=1$ for two groups.
Cox partial likelihood: only the order of events matters; baseline hazard $h0(t)$ cancels out.
Hazard ratio (HR) interpretation: $HR=e^{\beta}$ = factor by which the hazard multiplies for a one‑unit increase in the predictor (assuming PH).
Right censoring is the default; left‑truncation = delayed entry, subjects counted only after they become “at risk”.
Exponential ⇢ constant $h(t)$; Weibull shape $>1$ = increasing hazard, $<1$ = decreasing.
AIC: $\text{AIC}= -2\log L + 2k$; lower AIC ⇒ better fit (but only among comparable models).
Cure fraction $pc$ = $\lim{t\to\infty}S(t)$; survival curve plateaus at $pc$.
---
🔄 Key Processes
Build a KM curve
Sort times, compute $nj$, $dj$.
Multiply successive survival factors.
Plot steps; add Greenwood SE‑based 95 % CI.
Perform a log‑rank test
At each event time compute expected events $Ei$ under $H0$.
Sum $(Oi-Ei)$ and its variance $Vi$.
Compute $\chi^2$ and compare to $\chi^2{1,\alpha}$.
Fit a Cox PH model
Specify covariates (continuous, dummy for categorical).
Use partial likelihood to estimate $\beta$.
Check PH assumption: cox.zph() → plot Schoenfeld residuals; $p<0.05$ ⇒ violation.
Add stratification / time‑varying covariates
Stratify: coxph(Surv(time, status) x + strata(strataVar), data).
Time‑varying: reshape data into start–stop intervals, include x(t).
Construct likelihood for censored data
Uncensored: contribute $f(ti)$.
Right‑censored: contribute $S(ri)$.
Left‑censored: contribute $F(li)$.
Interval‑censored: contribute $F(ui)-F(li)$.
Multiply over all subjects (or sum log‑likelihood) for MLE.
Fit a parametric model
Choose distribution (e.g., Weibull).
Use MLE (or built‑in survreg).
Diagnose with hazard plots or AIC.
Build a survival random forest
Draw many bootstrap samples, grow survival trees, average cumulative hazard or survival predictions.
---
🔍 Key Comparisons
KM vs. Nelson–Aalen – KM estimates $S(t)$ directly; Nelson–Aalen estimates $H(t)$ then $S(t)=e^{-H(t)}$.
Cox PH vs. Parametric – Cox needs no baseline hazard form (flexible), parametric gives explicit $h(t)$ and can extrapolate.
Right vs. Left vs. Interval censoring – direction of unknown time relative to observation window.
Proportional‑hazards vs. Accelerated Failure Time – PH scales hazards, AFT stretches the time axis.
Cure model vs. Standard survival – cure model allows $S(\infty)>0$; standard forces $S(\infty)=0$.
Survival tree vs. Random forest – single tree is interpretable but high variance; forest reduces variance by averaging.
---
⚠️ Common Misunderstandings
“Hazard = probability of event” – hazard is a rate per unit time, can exceed 1.
Censored subjects provide no information – they contribute survival information up to the censoring time.
Log‑rank works even when curves cross – crossing violates PH; log‑rank loses power and can be misleading.
HR = risk ratio – HR compares instantaneous risks, not cumulative incidence.
AIC alone decides model – must also examine residuals, plausibility of hazard shape.
Cure fraction = 0 in all medical studies – many cancers exhibit a long‑term plateau.
---
🧠 Mental Models / Intuition
Survival curve = “staircase of survivors” – each step drops exactly where an event occurs; censored points are just “ticks” on the stairs.
Hazard as a “speedometer” – tells how fast the clock is ticking at a given moment, given you’re still alive.
Cox partial likelihood = “who dies next?” – only the ordering of deaths matters, not the actual times.
Cure model plateau = “immune crowd” – imagine two groups: one that will eventually die, another that never does; the plateau height tells the size of the latter.
---
🚩 Exceptions & Edge Cases
Non‑proportional hazards – use stratified Cox, add interaction with time, or switch to AFT/parametric models.
Left truncation (delayed entry) – subjects enter risk set only after their entry time; must adjust risk set counts.
Interval censoring – standard KM fails; use Turnbull estimator or parametric likelihood.
Sparse events – chi‑square approximation for log‑rank may be inaccurate; consider exact tests or permutation.
High‑dimensional covariates – Cox may overfit; prefer regularized Cox (lasso) or survival random forests.
---
📍 When to Use Which
| Situation | Preferred Method |
|-----------|------------------|
| Estimate survival curve (no covariates) | Kaplan–Meier (with Greenwood CI) |
| Compare two groups, no covariates | Log‑rank test (or Wilcoxon if early differences matter) |
| Assess effect of several predictors (continuous & categorical) | Cox PH (check PH, add strata if needed) |
| Baseline hazard shape known / extrapolation needed | Parametric model (choose based on hazard plot, AIC) |
| Baseline hazards differ across a factor but covariate effects are common | Stratified Cox |
| Covariates change over follow‑up | Cox with time‑varying covariates |
| Many predictors, possible interactions, non‑linear effects | Survival random forest or survival tree |
| Evidence of cured subpopulation (plateau) | Cure model (mixture of logistic + hazard component) |
| Data are left‑truncated or interval‑censored | Turnbull estimator or likelihood‑based parametric approach |
---
👀 Patterns to Recognize
Step‑wise drops in KM at exactly the event times – confirms correct handling of ties.
Crossing KM curves – suspect PH violation; consider time‑dependent effects.
Straight line on log–negative–log plot – Weibull (shape constant).
Hazard curve rising then falling – log‑logistic or generalized gamma likely.
Flat tail of survival curve – possible cure fraction.
Large χ² but small number of events – check for sparse‑data bias.
---
🗂️ Exam Traps
Choosing log‑rank when curves cross – the test may give a non‑significant p‑value even if groups differ early/late.
Interpreting HR > 1 as “higher survival” – it actually means higher hazard (worse survival).
Treating censored observations as “alive at end” – they only survive up to censoring time; they are removed thereafter.
Confusing $f(t)$ (density) with $h(t)$ (hazard) – remember $h(t)=f(t)/S(t)$.
Assuming exponential fit because “simple” – check constant‑hazard assumption; otherwise bias.
Using AIC to compare models with different censoring structures – AIC is only comparable when the likelihood is built on the same censoring scheme.
Forgetting to center/scale continuous covariates in Cox – can cause numerical instability and misleading Wald tests.
---
or
Or, immediately create your own study flashcards:
Upload a PDF.
Master Study Materials.
Master Study Materials.
Start learning in seconds
Drop your PDFs here or
or