RemNote Community
Community

Survival analysis - Parametric and Semi‑Parametric Modeling

Learn how to use Cox regression and its extensions, select and fit parametric (including cure) survival models, and address censoring and advanced survival methods.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz

Quick Practice

What is the primary purpose of Cox proportional hazards regression?
1 of 33

Summary

Cox Proportional Hazards Regression and Survival Analysis Introduction Cox proportional hazards regression is a powerful statistical method for analyzing survival data—time until an event occurs. Unlike standard regression, survival analysis must account for censoring, where we know some subjects haven't experienced the event by the end of the study. Cox regression allows you to examine how quantitative predictors (like age, gene expression, or blood counts) and categorical variables affect the probability of an event occurring over time. This is one of the most widely used approaches in medical research, epidemiology, and reliability studies. Fundamentals of Survival Analysis and Censoring What is Censoring? Censoring occurs when we don't observe the exact time of the event for some subjects. The most common type is right-censoring, where a subject leaves the study or the study ends before the event occurs. For example, in a cancer study, a patient might still be alive when the research ends—we know they survived at least until that point, but not how much longer they'll live. Right-censoring is crucial to understand because it means we have incomplete information. Standard statistical methods can't simply ignore these subjects (that would bias results) nor can they treat them like subjects who experienced the event (that would also be wrong). Survival analysis methods like Cox regression properly handle this uncertainty. Another form is left truncation (delayed entry), where subjects don't enter the study until some time after the origin. For instance, you might only enroll patients after their disease diagnosis. Left truncation doesn't bias estimates, but it does affect how you structure the analysis. The Hazard and Survival Functions The core concept in survival analysis is the hazard function, which represents the instantaneous risk of the event occurring at time $t$, given that the subject has survived to time $t$. Think of it as the "force" pushing toward the event. The survival function, denoted $S(t)$, is the probability of surviving beyond time $t$. As time increases, the survival function decreases (or stays flat), eventually approaching either zero or some positive value if there's a cure fraction. The Kaplan-Meier estimator is the standard non-parametric method to estimate the survival function from censored data. It directly estimates $S(t)$ without assuming any particular distribution for survival times. This makes it flexible and easy to interpret: at each event time, you update the survival probability based on the proportion of subjects still at risk who experience the event. <extrainfo> The Kaplan-Meier estimator is calculated as: $$S(t) = \prod{ti \leq t} \left(1 - \frac{di}{ni}\right)$$ where $di$ is the number of events at time $ti$ and $ni$ is the number at risk just before time $ti$. </extrainfo> Cox Proportional Hazards Regression When to Use Cox Regression Cox regression is used when you want to assess the effect of one or more predictors on survival time. Your predictors can be: Continuous variables: gene expression level, age, blood pressure, or any measured quantity Categorical variables: represented as indicator (dummy) variables (e.g., treatment group, gender) Mixed: a combination of both types The key requirement is that you have survival data with event times and censoring indicators. Cox regression answers questions like: "Does higher gene expression predict longer survival?" or "Does treatment A improve survival compared to treatment B?" The Cox Model and Proportional Hazards Assumption The Cox model expresses the hazard function as: $$h(t | X) = h0(t) \exp(\beta1 X1 + \beta2 X2 + \cdots + \betap Xp)$$ where: $h0(t)$ is the baseline hazard (the hazard when all predictors are zero) $X1, X2, \ldots, Xp$ are your predictors $\beta1, \beta2, \ldots, \betap$ are the regression coefficients (what you estimate) $\exp(\betaj)$ is the hazard ratio for predictor $j$ The crucial feature of this model is that it doesn't require you to specify a particular form for $h0(t)$—it's "non-parametric" in that sense. However, the model assumes something very important: the hazard ratio between any two subjects remains constant over time. This is the proportional hazards assumption. Understanding the Proportional Hazards Assumption What does this assumption really mean? Suppose you're comparing two treatment groups. The proportional hazards assumption says that if Treatment A reduces the hazard by 30% compared to Treatment B at time 1, then it reduces the hazard by 30% at time 5, time 10, and all other times. The relative difference stays proportional. This assumption can be violated. For instance, if a treatment is very effective early but wears off, the assumption breaks down. You should test this assumption using the cox.zph() function, which tests each predictor separately. A p-value less than 0.05 suggests violation of the assumption. If the assumption is violated, you have options: use stratification (discussed below) for problematic predictors, use time-varying coefficients, or consider alternative models. Interpreting Cox Regression Coefficients The coefficient $\betaj$ itself is on a log scale. The meaningful interpretation comes from the hazard ratio: $\exp(\betaj)$. If $\exp(\betaj) = 1.5$, then a one-unit increase in $Xj$ multiplies the hazard by 1.5 (a 50% increase in risk) If $\exp(\betaj) = 0.8$, then a one-unit increase multiplies the hazard by 0.8 (a 20% decrease in risk) If $\exp(\betaj) = 1$, then $Xj$ has no effect You'll typically see this reported with a 95% confidence interval around the hazard ratio to show the range of plausible effects. In the example above, for the variable "sex," the hazard ratio is 1.94 with a 95% CI from 1.15 to 3.26. This means males have roughly double the hazard of females, and we're 95% confident the true ratio is between 1.15 and 3.26. Overall Model Tests After fitting a Cox model, you should assess whether the model as a whole is significant. Three tests are available, and they're asymptotically equivalent (give similar results with large samples): Likelihood-ratio test: Compares the likelihood of the full model to the null model (no predictors). This is often the most reliable. Wald test: Tests whether the estimated coefficients differ significantly from zero. This is computationally simple but can be less reliable in small samples. Score test (also called the log-rank test): Tests the slope of the likelihood at the null value. This is equivalent to the log-rank test for comparing groups. All three test the null hypothesis that no predictors have an effect on survival. A p-value below 0.05 indicates the model is significant overall. In the example, all three tests give p-values around 0.011–0.013, indicating strong evidence that sex predicts survival. Connection to the Log-Rank Test The log-rank test is a special case of Cox regression. When you have a single binary predictor (two groups), the log-rank test and Cox regression give identical results. The log-rank test directly compares survival curves of two groups; Cox regression with a binary predictor estimates the hazard ratio between groups. This graph shows Kaplan-Meier survival curves for two treatment groups in an AML study. The log-rank test or Cox regression with treatment as the predictor would compare these curves quantitatively. This connection is conceptually important: Cox regression is a generalization of the log-rank test to multiple predictors and continuous variables. Extensions to the Basic Cox Model Stratification Stratification addresses violations of the proportional hazards assumption for specific predictors. Instead of including a problematic variable as a regular predictor, you divide subjects into homogeneous strata based on that variable. Each stratum gets its own baseline hazard function $h{0,k}(t)$, but the regression coefficients remain the same across strata. For example, if the proportional hazards assumption fails for gender (perhaps because men and women have different baseline risks at different ages), you could stratify by gender. This allows gender-specific baseline hazards while still estimating a common effect of, say, a drug treatment across genders. The trade-off: you lose the ability to estimate a direct effect of the stratification variable on hazard. <extrainfo> Stratification is useful when you want to account for a confounding variable that violates proportional hazards but isn't the focus of your analysis. </extrainfo> <extrainfo> Time-Varying Covariates Some variables change during follow-up—serum protein levels, medication dose, or disease status might evolve over time. Time-varying covariates allow you to include these changing values in a Cox model. Instead of a single value per subject, you provide the covariate value as it changes over time. The Cox model then estimates how the current value of the covariate affects the instantaneous hazard. Time-varying covariates require more complex data structure (one row per time interval per subject rather than one row per subject) and special handling in software, but they're powerful for reflecting how changing conditions affect survival. </extrainfo> Discrete-Time Survival Models In some settings, events are recorded in discrete time intervals (e.g., "did the event occur in year 1, year 2, year 3?") rather than exact event times. Discrete-time survival models handle this naturally. In a discrete-time model, time is divided into intervals, and for each interval, you record a binary indicator: did the event occur (1) or not (0)? You then fit a logistic regression model where the probability of the event depends on time and your predictors. The advantages: straightforward interpretation as logistic regression, handles tied event times easily, and naturally accommodates censoring within an interval. The disadvantage: you lose information if you have exact event times but artificially discretize them. Cure Models in Survival Analysis The Problem with Standard Survival Models In standard Cox regression or parametric survival models, the survival probability converges to zero as time approaches infinity: $\lim{t \to \infty} S(t) = 0$. This assumes everyone will eventually experience the event. But in many real applications—cancer remission, disease elimination, or product reliability—some subjects may never experience the event. They're "cured" or immune. The Kaplan-Meier survival curve might plateau at some positive level (say, 0.15) rather than reaching zero. A standard Cox model forced onto such data produces biased estimates and poor predictions at long follow-up times. The Two-Component Cure Model Cure models address this by splitting the population into two latent groups: Cured individuals: A proportion $p$ of the population who will never experience the event, surviving indefinitely Susceptible individuals: A proportion $(1-p)$ who are at risk and follow a standard survival model The model has two linked components: Logistic regression estimates the probability $p$ that an individual belongs to the cured group, based on predictors Hazard model (often discrete-time logistic regression) estimates the conditional event probability at each time, given that the individual is susceptible The combined survival function reflects both groups: $$S(t) = p + (1-p) \cdot S{\text{susceptible}}(t)$$ where $S{\text{susceptible}}(t)$ is the survival function for susceptible individuals. Interpreting the Cure Fraction The cure fraction is the estimated proportion of the population that is cured. If $\hat{p} = 0.15$, you estimate that 15% of subjects will never experience the event, while 85% are susceptible. A larger cure fraction means a higher plateau in the survival curve at long follow-up times. This is the key difference from standard models: the curve plateaus rather than approaching zero. <extrainfo> Cure fractions make most sense when you have long follow-up times where the plateau becomes apparent. With short follow-up, it's hard to distinguish between a low cure fraction and simply not enough time for events to accumulate. </extrainfo> Choosing a Parametric Distribution In some analyses, you assume survival times follow a specific probability distribution. Common choices include: Exponential Distribution The exponential distribution assumes a constant hazard over time: $h(t) = \lambda$ for all $t$. This is the simplest assumption but often unrealistic. It's useful primarily as a baseline comparison. Weibull and Gamma Distributions The Weibull distribution allows the hazard to increase monotonically, decrease monotonically, or stay constant (a special case is the exponential). Its shape parameter controls the direction of change. The gamma distribution and the more flexible generalized gamma distribution also model monotonic hazard changes. The generalized gamma is particularly useful because it includes exponential, Weibull, and log-normal distributions as special cases, offering flexibility to fit various hazard shapes. Other Distributions The log-logistic distribution models a hazard that rises to a peak and then declines—useful when risk increases early (e.g., post-surgery) then decreases as recovery occurs. The plot above shows an example of transforming data to assess distribution fit. The left histogram shows raw melanoma thickness, right-skewed. The right histogram shows log-transformed thickness, more symmetric. This transformation assessment helps you choose an appropriate parametric distribution. Choosing Among Distributions How do you decide which distribution fits best? Visual inspection: Plot the empirical hazard or log(-log(S(t))) vs. log(t). Different distributions produce different patterns: Exponential: flat empirical hazard Weibull: straight line on log-log plot Log-logistic: curved log-log plot Akaike Information Criterion (AIC): Fit competing models and compare AIC values. Lower AIC indicates better fit while penalizing model complexity. This is objective and routine in practice. Goodness-of-fit tests: The Cox-Snell residual plot tests whether residuals follow an exponential distribution. Systematic deviation from the 45-degree line suggests poor fit. Don't over-interpret small differences in AIC; practical significance matters too. A slightly worse fit might be acceptable if the distribution is simpler to interpret. Summary of Key Relationships Among Methods Cox regression is the most flexible of these approaches. It makes fewer assumptions (no assumed distribution) but requires the proportional hazards assumption. Parametric models assume a specific distribution but are more flexible if the assumption holds. The log-rank test is Cox regression with a binary predictor. Discrete-time models are an alternative when time is naturally discretized. Cure models address populations with a fraction that never experiences the event. <extrainfo> Advanced Model Types Accelerated failure time (AFT) models are parametric alternatives to Cox regression. Instead of modeling hazards directly, they model how predictors stretch or compress the time axis. A predictor might "accelerate" the disease progression, making it occur sooner. AFT models have a different interpretation than hazard ratios but can be useful when the time-acceleration view is more natural. Bayesian survival analysis incorporates prior information (from expert knowledge or previous studies) to improve inference with limited data. This is especially valuable in rare diseases or early-stage research. Random survival forests extend ensemble methods to survival data, using decision trees to handle high-dimensional predictors (e.g., gene expression data with thousands of variables). They're computationally intensive but don't require distributional or proportionality assumptions. </extrainfo>
Flashcards
What is the primary purpose of Cox proportional hazards regression?
To analyze the effect of quantitative predictors (like gene expression or age) on survival.
How are categorical predictors incorporated into a Cox regression model?
As indicator (dummy) variables.
The log‑rank test is a special case of Cox regression with what specific type of predictor?
A single binary predictor.
Which three tests are asymptotically equivalent and used to assess overall model significance in Cox regression?
Likelihood‑ratio test Wald test Score (log‑rank) test
Which R function is used to test the proportional hazards assumption?
cox.zph()
In the cox.zph() test, what does a p-value less than $0.05$ ($p < 0.05$) indicate?
A violation of the proportional hazards assumption.
What is the purpose of stratification in Cox models?
To divide subjects into homogeneous strata while allowing a common set of regression coefficients.
How does the baseline hazard function behave across different strata in a stratified Cox model?
Each stratum may have its own unique baseline hazard function.
How are variables that change over the follow-up period (e.g., medication dose) incorporated into a Cox model?
As time-varying covariates.
In discrete-time survival models, what does the binary indicator for each interval record?
Whether the event occurs during that specific interval.
In a cure model, what is the role of the logistic regression component?
To estimate the probability that an individual will never experience the event (the "cured" fraction).
In a cure model, what does the hazard model (discrete-time logistic regression) estimate?
The conditional event hazard at each time point for susceptible individuals.
According to a cure model, the survival function is the sum of which two groups?
Cured individuals (who survive indefinitely) Susceptible individuals (whose survival declines according to the hazard)
What happens to the survival probability as time approaches infinity in a standard survival model without a cure fraction?
The survival probability is forced to zero.
In a cure model, what value does the survival probability converge to as time becomes large?
The proportion of cured individuals (the cure fraction).
Visually, how does a larger cure fraction affect a survival curve at long follow-up times?
It results in a higher plateau in the curve.
What are the two latent groups that a population is divided into under the assumptions of a cure model?
Cured (immune) and susceptible (at risk).
What does the exponential distribution assume regarding the hazard over time?
It assumes a constant hazard.
Which three distributions are included as special cases of the generalized gamma distribution?
Exponential distribution Weibull distribution Log-normal distribution
What shape of hazard does the log-logistic distribution model?
A hazard that rises to a peak and then declines.
Which parameter determines whether a Weibull distribution represents an increasing or decreasing monotonic hazard?
The shape parameter.
What are two common visual methods used to assess the shape of a survival distribution?
Plotting the empirical hazard Plotting the log-negative-log survival curve
Which numerical criterion is used to compare the relative quality of competing parametric fits?
Akaike information criterion (AIC).
Which type of residual plot is used as a goodness-of-fit test for parametric survival models?
Cox–Snell residual plot.
What is an accelerated failure time model?
A parametric model that stretches or compresses the time axis.
What is the core assumption of proportional hazards models?
That hazard ratios between groups remain constant over time.
What is the most common form of censoring in survival analysis?
Right-censoring.
What is left truncation (also known as delayed entry)?
When subjects enter a study after the initial origin time.
What is the purpose of the Kaplan–Meier estimator?
To non-parametrically estimate the survival function from censored data.
What statistical method is used to find parameter values that maximize the likelihood of observed data in parametric models?
Maximum likelihood estimation.
How is the mortality rate typically expressed in population studies?
Deaths per 1,000 individuals per year.
What is a key benefit of using Bayesian survival analysis when dealing with limited data?
It incorporates prior information to improve inference.
Which advanced method uses ensemble decision trees to handle high-dimensional censored data?
Random survival forests.

Quiz

In a discrete‑time survival model, what does the binary indicator for each time interval represent?
1 of 10
Key Concepts
Survival Analysis Models
Cox proportional hazards regression
Stratified Cox model
Time‑varying covariates
Discrete‑time survival model
Cure model
Accelerated failure time model
Weibull distribution
Survival Estimation Techniques
Log‑rank test
Kaplan–Meier estimator
Random survival forest
Bayesian survival analysis