Subjects/Math/Statistics and Discrete Math/Statistics/Survival analysis

Survival analysis Study Guide

Study Guide

📖 Core Concepts Survival function $S(t)=\Pr(T>t)$ – probability a subject lives past time $t$. Starts at 1, never increases. Hazard function $h(t)=\dfrac{f(t)}{S(t)}$ – instantaneous event rate conditional on having survived to $t$. Not a probability. Cumulative hazard $H(t)=\int0^t h(u)\,du$ and the link $S(t)=\exp[-H(t)]$. Censoring – exact event time unknown, but we know it lies after (right), before (left), or between (interval) a certain point. Censored subjects contribute information up to the censoring time. Kaplan–Meier (KM) estimator – step‑wise product of survival probabilities using both censored and uncensored observations. Log‑rank test – compares two (or more) KM curves; test statistic $\chi^21$ under $H0$: identical survival functions. Cox proportional hazards (PH) model $$hi(t)=h0(t)\exp(\beta1x{i1}+…+\betapx{ip})$$ – assumes hazard ratios are constant over time. Proportional‑hazards assumption can be checked with scaled Schoenfeld residuals (cox.zph()); $p<0.05$ ⇒ violation. Parametric survival models (exponential, Weibull, log‑logistic, generalized gamma…) impose a functional form on $h(t)$. Stratified Cox – separate baseline hazards $h{0k}(t)$ for each stratum, common $\beta$’s. Time‑varying covariates – covariate values that change during follow‑up are entered as $xi(t)$. Survival trees / random forests – non‑linear, interaction‑rich alternatives that aggregate many trees for stable predictions. Cure models – mixture of a cured fraction (plateau of $S(t)$) and a susceptible component with its own hazard. --- 📌 Must Remember KM estimate: $ \hat S(t)=\prod{j:tj\le t}\dfrac{nj-dj}{nj}$ where $nj$ = at‑risk, $dj$ = events at $tj$. Greenwood’s SE: $\hat{\operatorname{Var}}[\hat S(t)]=\hat S(t)^2\sum{j:tj\le t}\dfrac{dj}{nj(nj-dj)}$. Log‑rank statistic: $ \displaystyle \chi^2=\frac{\big(\sum(Oi-Ei)\big)^2}{\sum Vi}$, $df=1$ for two groups. Cox partial likelihood: only the order of events matters; baseline hazard $h0(t)$ cancels out. Hazard ratio (HR) interpretation: $HR=e^{\beta}$ = factor by which the hazard multiplies for a one‑unit increase in the predictor (assuming PH). Right censoring is the default; left‑truncation = delayed entry, subjects counted only after they become “at risk”. Exponential ⇢ constant $h(t)$; Weibull shape $>1$ = increasing hazard, $<1$ = decreasing. AIC: $\text{AIC}= -2\log L + 2k$; lower AIC ⇒ better fit (but only among comparable models). Cure fraction $pc$ = $\lim{t\to\infty}S(t)$; survival curve plateaus at $pc$. --- 🔄 Key Processes Build a KM curve Sort times, compute $nj$, $dj$. Multiply successive survival factors. Plot steps; add Greenwood SE‑based 95 % CI. Perform a log‑rank test At each event time compute expected events $Ei$ under $H0$. Sum $(Oi-Ei)$ and its variance $Vi$. Compute $\chi^2$ and compare to $\chi^2{1,\alpha}$. Fit a Cox PH model Specify covariates (continuous, dummy for categorical). Use partial likelihood to estimate $\beta$. Check PH assumption: cox.zph() → plot Schoenfeld residuals; $p<0.05$ ⇒ violation. Add stratification / time‑varying covariates Stratify: coxph(Surv(time, status) x + strata(strataVar), data). Time‑varying: reshape data into start–stop intervals, include x(t). Construct likelihood for censored data Uncensored: contribute $f(ti)$. Right‑censored: contribute $S(ri)$. Left‑censored: contribute $F(li)$. Interval‑censored: contribute $F(ui)-F(li)$. Multiply over all subjects (or sum log‑likelihood) for MLE. Fit a parametric model Choose distribution (e.g., Weibull). Use MLE (or built‑in survreg). Diagnose with hazard plots or AIC. Build a survival random forest Draw many bootstrap samples, grow survival trees, average cumulative hazard or survival predictions. --- 🔍 Key Comparisons KM vs. Nelson–Aalen – KM estimates $S(t)$ directly; Nelson–Aalen estimates $H(t)$ then $S(t)=e^{-H(t)}$. Cox PH vs. Parametric – Cox needs no baseline hazard form (flexible), parametric gives explicit $h(t)$ and can extrapolate. Right vs. Left vs. Interval censoring – direction of unknown time relative to observation window. Proportional‑hazards vs. Accelerated Failure Time – PH scales hazards, AFT stretches the time axis. Cure model vs. Standard survival – cure model allows $S(\infty)>0$; standard forces $S(\infty)=0$. Survival tree vs. Random forest – single tree is interpretable but high variance; forest reduces variance by averaging. --- ⚠️ Common Misunderstandings “Hazard = probability of event” – hazard is a rate per unit time, can exceed 1. Censored subjects provide no information – they contribute survival information up to the censoring time. Log‑rank works even when curves cross – crossing violates PH; log‑rank loses power and can be misleading. HR = risk ratio – HR compares instantaneous risks, not cumulative incidence. AIC alone decides model – must also examine residuals, plausibility of hazard shape. Cure fraction = 0 in all medical studies – many cancers exhibit a long‑term plateau. --- 🧠 Mental Models / Intuition Survival curve = “staircase of survivors” – each step drops exactly where an event occurs; censored points are just “ticks” on the stairs. Hazard as a “speedometer” – tells how fast the clock is ticking at a given moment, given you’re still alive. Cox partial likelihood = “who dies next?” – only the ordering of deaths matters, not the actual times. Cure model plateau = “immune crowd” – imagine two groups: one that will eventually die, another that never does; the plateau height tells the size of the latter. --- 🚩 Exceptions & Edge Cases Non‑proportional hazards – use stratified Cox, add interaction with time, or switch to AFT/parametric models. Left truncation (delayed entry) – subjects enter risk set only after their entry time; must adjust risk set counts. Interval censoring – standard KM fails; use Turnbull estimator or parametric likelihood. Sparse events – chi‑square approximation for log‑rank may be inaccurate; consider exact tests or permutation. High‑dimensional covariates – Cox may overfit; prefer regularized Cox (lasso) or survival random forests. --- 📍 When to Use Which | Situation | Preferred Method | |-----------|------------------| | Estimate survival curve (no covariates) | Kaplan–Meier (with Greenwood CI) | | Compare two groups, no covariates | Log‑rank test (or Wilcoxon if early differences matter) | | Assess effect of several predictors (continuous & categorical) | Cox PH (check PH, add strata if needed) | | Baseline hazard shape known / extrapolation needed | Parametric model (choose based on hazard plot, AIC) | | Baseline hazards differ across a factor but covariate effects are common | Stratified Cox | | Covariates change over follow‑up | Cox with time‑varying covariates | | Many predictors, possible interactions, non‑linear effects | Survival random forest or survival tree | | Evidence of cured subpopulation (plateau) | Cure model (mixture of logistic + hazard component) | | Data are left‑truncated or interval‑censored | Turnbull estimator or likelihood‑based parametric approach | --- 👀 Patterns to Recognize Step‑wise drops in KM at exactly the event times – confirms correct handling of ties. Crossing KM curves – suspect PH violation; consider time‑dependent effects. Straight line on log–negative–log plot – Weibull (shape constant). Hazard curve rising then falling – log‑logistic or generalized gamma likely. Flat tail of survival curve – possible cure fraction. Large χ² but small number of events – check for sparse‑data bias. --- 🗂️ Exam Traps Choosing log‑rank when curves cross – the test may give a non‑significant p‑value even if groups differ early/late. Interpreting HR > 1 as “higher survival” – it actually means higher hazard (worse survival). Treating censored observations as “alive at end” – they only survive up to censoring time; they are removed thereafter. Confusing $f(t)$ (density) with $h(t)$ (hazard) – remember $h(t)=f(t)/S(t)$. Assuming exponential fit because “simple” – check constant‑hazard assumption; otherwise bias. Using AIC to compare models with different censoring structures – AIC is only comparable when the likelihood is built on the same censoring scheme. Forgetting to center/scale continuous covariates in Cox – can cause numerical instability and misleading Wald tests. ---

Or, immediately create your own study flashcards:

Upload a PDF.
Master Study Materials.

Start learning in seconds

Drop your PDFs here or