Subjects/Math/Statistics and Discrete Math/Statistics/Survival analysis

Survival analysis - Parametric and Semi‑Parametric Modeling

Learn how to use Cox regression and its extensions, select and fit parametric (including cure) survival models, and address censoring and advanced survival methods.

Summary

Read Summary

Flashcards

Save Flashcards

Quiz

Take Quiz

Quick Practice

What is the primary purpose of Cox proportional hazards regression?

1 of 33

Summary

Cox Proportional Hazards Regression and Survival Analysis Introduction Cox proportional hazards regression is a powerful statistical method for analyzing survival data—time until an event occurs. Unlike standard regression, survival analysis must account for censoring, where we know some subjects haven't experienced the event by the end of the study. Cox regression allows you to examine how quantitative predictors (like age, gene expression, or blood counts) and categorical variables affect the probability of an event occurring over time. This is one of the most widely used approaches in medical research, epidemiology, and reliability studies. Fundamentals of Survival Analysis and Censoring What is Censoring? Censoring occurs when we don't observe the exact time of the event for some subjects. The most common type is right-censoring, where a subject leaves the study or the study ends before the event occurs. For example, in a cancer study, a patient might still be alive when the research ends—we know they survived at least until that point, but not how much longer they'll live. Right-censoring is crucial to understand because it means we have incomplete information. Standard statistical methods can't simply ignore these subjects (that would bias results) nor can they treat them like subjects who experienced the event (that would also be wrong). Survival analysis methods like Cox regression properly handle this uncertainty. Another form is left truncation (delayed entry), where subjects don't enter the study until some time after the origin. For instance, you might only enroll patients after their disease diagnosis. Left truncation doesn't bias estimates, but it does affect how you structure the analysis. The Hazard and Survival Functions The core concept in survival analysis is the hazard function, which represents the instantaneous risk of the event occurring at time $t$, given that the subject has survived to time $t$. Think of it as the "force" pushing toward the event. The survival function, denoted $S(t)$, is the probability of surviving beyond time $t$. As time increases, the survival function decreases (or stays flat), eventually approaching either zero or some positive value if there's a cure fraction. The Kaplan-Meier estimator is the standard non-parametric method to estimate the survival function from censored data. It directly estimates $S(t)$ without assuming any particular distribution for survival times. This makes it flexible and easy to interpret: at each event time, you update the survival probability based on the proportion of subjects still at risk who experience the event. <extrainfo> The Kaplan-Meier estimator is calculated as: $$S(t) = \prod{ti \leq t} \left(1 - \frac{di}{ni}\right)$$ where $di$ is the number of events at time $ti$ and $ni$ is the number at risk just before time $ti$. </extrainfo> Cox Proportional Hazards Regression When to Use Cox Regression Cox regression is used when you want to assess the effect of one or more predictors on survival time. Your predictors can be: Continuous variables: gene expression level, age, blood pressure, or any measured quantity Categorical variables: represented as indicator (dummy) variables (e.g., treatment group, gender) Mixed: a combination of both types The key requirement is that you have survival data with event times and censoring indicators. Cox regression answers questions like: "Does higher gene expression predict longer survival?" or "Does treatment A improve survival compared to treatment B?" The Cox Model and Proportional Hazards Assumption The Cox model expresses the hazard function as: $$h(t | X) = h0(t) \exp(\beta1 X1 + \beta2 X2 + \cdots + \betap Xp)$$ where: $h0(t)$ is the baseline hazard (the hazard when all predictors are zero) $X1, X2, \ldots, Xp$ are your predictors $\beta1, \beta2, \ldots, \betap$ are the regression coefficients (what you estimate) $\exp(\betaj)$ is the hazard ratio for predictor $j$ The crucial feature of this model is that it doesn't require you to specify a particular form for $h0(t)$—it's "non-parametric" in that sense. However, the model assumes something very important: the hazard ratio between any two subjects remains constant over time. This is the proportional hazards assumption. Understanding the Proportional Hazards Assumption What does this assumption really mean? Suppose you're comparing two treatment groups. The proportional hazards assumption says that if Treatment A reduces the hazard by 30% compared to Treatment B at time 1, then it reduces the hazard by 30% at time 5, time 10, and all other times. The relative difference stays proportional. This assumption can be violated. For instance, if a treatment is very effective early but wears off, the assumption breaks down. You should test this assumption using the cox.zph() function, which tests each predictor separately. A p-value less than 0.05 suggests violation of the assumption. If the assumption is violated, you have options: use stratification (discussed below) for problematic predictors, use time-varying coefficients, or consider alternative models. Interpreting Cox Regression Coefficients The coefficient $\betaj$ itself is on a log scale. The meaningful interpretation comes from the hazard ratio: $\exp(\betaj)$. If $\exp(\betaj) = 1.5$, then a one-unit increase in $Xj$ multiplies the hazard by 1.5 (a 50% increase in risk) If $\exp(\betaj) = 0.8$, then a one-unit increase multiplies the hazard by 0.8 (a 20% decrease in risk) If $\exp(\betaj) = 1$, then $Xj$ has no effect You'll typically see this reported with a 95% confidence interval around the hazard ratio to show the range of plausible effects. In the example above, for the variable "sex," the hazard ratio is 1.94 with a 95% CI from 1.15 to 3.26. This means males have roughly double the hazard of females, and we're 95% confident the true ratio is between 1.15 and 3.26. Overall Model Tests After fitting a Cox model, you should assess whether the model as a whole is significant. Three tests are available, and they're asymptotically equivalent (give similar results with large samples): Likelihood-ratio test: Compares the likelihood of the full model to the null model (no predictors). This is often the most reliable. Wald test: Tests whether the estimated coefficients differ significantly from zero. This is computationally simple but can be less reliable in small samples. Score test (also called the log-rank test): Tests the slope of the likelihood at the null value. This is equivalent to the log-rank test for comparing groups. All three test the null hypothesis that no predictors have an effect on survival. A p-value below 0.05 indicates the model is significant overall. In the example, all three tests give p-values around 0.011–0.013, indicating strong evidence that sex predicts survival. Connection to the Log-Rank Test The log-rank test is a special case of Cox regression. When you have a single binary predictor (two groups), the log-rank test and Cox regression give identical results. The log-rank test directly compares survival curves of two groups; Cox regression with a binary predictor estimates the hazard ratio between groups. This graph shows Kaplan-Meier survival curves for two treatment groups in an AML study. The log-rank test or Cox regression with treatment as the predictor would compare these curves quantitatively. This connection is conceptually important: Cox regression is a generalization of the log-rank test to multiple predictors and continuous variables. Extensions to the Basic Cox Model Stratification Stratification addresses violations of the proportional hazards assumption for specific predictors. Instead of including a problematic variable as a regular predictor, you divide subjects into homogeneous strata based on that variable. Each stratum gets its own baseline hazard function $h{0,k}(t)$, but the regression coefficients remain the same across strata. For example, if the proportional hazards assumption fails for gender (perhaps because men and women have different baseline risks at different ages), you could stratify by gender. This allows gender-specific baseline hazards while still estimating a common effect of, say, a drug treatment across genders. The trade-off: you lose the ability to estimate a direct effect of the stratification variable on hazard. <extrainfo> Stratification is useful when you want to account for a confounding variable that violates proportional hazards but isn't the focus of your analysis. </extrainfo> <extrainfo> Time-Varying Covariates Some variables change during follow-up—serum protein levels, medication dose, or disease status might evolve over time. Time-varying covariates allow you to include these changing values in a Cox model. Instead of a single value per subject, you provide the covariate value as it changes over time. The Cox model then estimates how the current value of the covariate affects the instantaneous hazard. Time-varying covariates require more complex data structure (one row per time interval per subject rather than one row per subject) and special handling in software, but they're powerful for reflecting how changing conditions affect survival. </extrainfo> Discrete-Time Survival Models In some settings, events are recorded in discrete time intervals (e.g., "did the event occur in year 1, year 2, year 3?") rather than exact event times. Discrete-time survival models handle this naturally. In a discrete-time model, time is divided into intervals, and for each interval, you record a binary indicator: did the event occur (1) or not (0)? You then fit a logistic regression model where the probability of the event depends on time and your predictors. The advantages: straightforward interpretation as logistic regression, handles tied event times easily, and naturally accommodates censoring within an interval. The disadvantage: you lose information if you have exact event times but artificially discretize them. Cure Models in Survival Analysis The Problem with Standard Survival Models In standard Cox regression or parametric survival models, the survival probability converges to zero as time approaches infinity: $\lim{t \to \infty} S(t) = 0$. This assumes everyone will eventually experience the event. But in many real applications—cancer remission, disease elimination, or product reliability—some subjects may never experience the event. They're "cured" or immune. The Kaplan-Meier survival curve might plateau at some positive level (say, 0.15) rather than reaching zero. A standard Cox model forced onto such data produces biased estimates and poor predictions at long follow-up times. The Two-Component Cure Model Cure models address this by splitting the population into two latent groups: Cured individuals: A proportion $p$ of the population who will never experience the event, surviving indefinitely Susceptible individuals: A proportion $(1-p)$ who are at risk and follow a standard survival model The model has two linked components: Logistic regression estimates the probability $p$ that an individual belongs to the cured group, based on predictors Hazard model (often discrete-time logistic regression) estimates the conditional event probability at each time, given that the individual is susceptible The combined survival function reflects both groups: $$S(t) = p + (1-p) \cdot S{\text{susceptible}}(t)$$ where $S{\text{susceptible}}(t)$ is the survival function for susceptible individuals. Interpreting the Cure Fraction The cure fraction is the estimated proportion of the population that is cured. If $\hat{p} = 0.15$, you estimate that 15% of subjects will never experience the event, while 85% are susceptible. A larger cure fraction means a higher plateau in the survival curve at long follow-up times. This is the key difference from standard models: the curve plateaus rather than approaching zero. <extrainfo> Cure fractions make most sense when you have long follow-up times where the plateau becomes apparent. With short follow-up, it's hard to distinguish between a low cure fraction and simply not enough time for events to accumulate. </extrainfo> Choosing a Parametric Distribution In some analyses, you assume survival times follow a specific probability distribution. Common choices include: Exponential Distribution The exponential distribution assumes a constant hazard over time: $h(t) = \lambda$ for all $t$. This is the simplest assumption but often unrealistic. It's useful primarily as a baseline comparison. Weibull and Gamma Distributions The Weibull distribution allows the hazard to increase monotonically, decrease monotonically, or stay constant (a special case is the exponential). Its shape parameter controls the direction of change. The gamma distribution and the more flexible generalized gamma distribution also model monotonic hazard changes. The generalized gamma is particularly useful because it includes exponential, Weibull, and log-normal distributions as special cases, offering flexibility to fit various hazard shapes. Other Distributions The log-logistic distribution models a hazard that rises to a peak and then declines—useful when risk increases early (e.g., post-surgery) then decreases as recovery occurs. The plot above shows an example of transforming data to assess distribution fit. The left histogram shows raw melanoma thickness, right-skewed. The right histogram shows log-transformed thickness, more symmetric. This transformation assessment helps you choose an appropriate parametric distribution. Choosing Among Distributions How do you decide which distribution fits best? Visual inspection: Plot the empirical hazard or log(-log(S(t))) vs. log(t). Different distributions produce different patterns: Exponential: flat empirical hazard Weibull: straight line on log-log plot Log-logistic: curved log-log plot Akaike Information Criterion (AIC): Fit competing models and compare AIC values. Lower AIC indicates better fit while penalizing model complexity. This is objective and routine in practice. Goodness-of-fit tests: The Cox-Snell residual plot tests whether residuals follow an exponential distribution. Systematic deviation from the 45-degree line suggests poor fit. Don't over-interpret small differences in AIC; practical significance matters too. A slightly worse fit might be acceptable if the distribution is simpler to interpret. Summary of Key Relationships Among Methods Cox regression is the most flexible of these approaches. It makes fewer assumptions (no assumed distribution) but requires the proportional hazards assumption. Parametric models assume a specific distribution but are more flexible if the assumption holds. The log-rank test is Cox regression with a binary predictor. Discrete-time models are an alternative when time is naturally discretized. Cure models address populations with a fraction that never experiences the event. <extrainfo> Advanced Model Types Accelerated failure time (AFT) models are parametric alternatives to Cox regression. Instead of modeling hazards directly, they model how predictors stretch or compress the time axis. A predictor might "accelerate" the disease progression, making it occur sooner. AFT models have a different interpretation than hazard ratios but can be useful when the time-acceleration view is more natural. Bayesian survival analysis incorporates prior information (from expert knowledge or previous studies) to improve inference with limited data. This is especially valuable in rare diseases or early-stage research. Random survival forests extend ensemble methods to survival data, using decision trees to handle high-dimensional predictors (e.g., gene expression data with thousands of variables). They're computationally intensive but don't require distributional or proportionality assumptions. </extrainfo>

Flashcards

What is the primary purpose of Cox proportional hazards regression?

To analyze the effect of quantitative predictors (like gene expression or age) on survival.

How are categorical predictors incorporated into a Cox regression model?

As indicator (dummy) variables.

The log‑rank test is a special case of Cox regression with what specific type of predictor?

A single binary predictor.

Which three tests are asymptotically equivalent and used to assess overall model significance in Cox regression?

Likelihood‑ratio test Wald test Score (log‑rank) test

Which R function is used to test the proportional hazards assumption?

cox.zph()

In the cox.zph() test, what does a p-value less than $0.05$ ($p < 0.05$) indicate?

A violation of the proportional hazards assumption.

What is the purpose of stratification in Cox models?

To divide subjects into homogeneous strata while allowing a common set of regression coefficients.

How does the baseline hazard function behave across different strata in a stratified Cox model?

Each stratum may have its own unique baseline hazard function.

How are variables that change over the follow-up period (e.g., medication dose) incorporated into a Cox model?

As time-varying covariates.

In discrete-time survival models, what does the binary indicator for each interval record?

Whether the event occurs during that specific interval.

In a cure model, what is the role of the logistic regression component?

To estimate the probability that an individual will never experience the event (the "cured" fraction).

In a cure model, what does the hazard model (discrete-time logistic regression) estimate?

The conditional event hazard at each time point for susceptible individuals.

According to a cure model, the survival function is the sum of which two groups?

Cured individuals (who survive indefinitely) Susceptible individuals (whose survival declines according to the hazard)

What happens to the survival probability as time approaches infinity in a standard survival model without a cure fraction?

The survival probability is forced to zero.

In a cure model, what value does the survival probability converge to as time becomes large?

The proportion of cured individuals (the cure fraction).

Visually, how does a larger cure fraction affect a survival curve at long follow-up times?

It results in a higher plateau in the curve.

What are the two latent groups that a population is divided into under the assumptions of a cure model?

Cured (immune) and susceptible (at risk).

What does the exponential distribution assume regarding the hazard over time?

It assumes a constant hazard.

Which three distributions are included as special cases of the generalized gamma distribution?

Exponential distribution Weibull distribution Log-normal distribution

What shape of hazard does the log-logistic distribution model?

A hazard that rises to a peak and then declines.

Which parameter determines whether a Weibull distribution represents an increasing or decreasing monotonic hazard?

The shape parameter.

What are two common visual methods used to assess the shape of a survival distribution?

Plotting the empirical hazard Plotting the log-negative-log survival curve

Which numerical criterion is used to compare the relative quality of competing parametric fits?

Akaike information criterion (AIC).

Which type of residual plot is used as a goodness-of-fit test for parametric survival models?

Cox–Snell residual plot.

What is an accelerated failure time model?

A parametric model that stretches or compresses the time axis.

What is the core assumption of proportional hazards models?

That hazard ratios between groups remain constant over time.

What is the most common form of censoring in survival analysis?

Right-censoring.

What is left truncation (also known as delayed entry)?

When subjects enter a study after the initial origin time.

What is the purpose of the Kaplan–Meier estimator?

To non-parametrically estimate the survival function from censored data.

What statistical method is used to find parameter values that maximize the likelihood of observed data in parametric models?

Maximum likelihood estimation.

How is the mortality rate typically expressed in population studies?

Deaths per 1,000 individuals per year.

What is a key benefit of using Bayesian survival analysis when dealing with limited data?

It incorporates prior information to improve inference.

Which advanced method uses ensemble decision trees to handle high-dimensional censored data?

Random survival forests.

Quiz

Survival analysis - Parametric and Semi‑Parametric Modeling Quiz Question 1: In a discrete‑time survival model, what does the binary indicator for each time interval represent?

Whether the event occurred in that interval (correct)
The subject’s treatment group
The censoring status at the start of the interval
The value of the baseline hazard

Survival analysis - Parametric and Semi‑Parametric Modeling Quiz Question 2: Which parametric survival distribution assumes a constant hazard over time?

Exponential (correct)
Weibull
Gamma
Log‑logistic

Survival analysis - Parametric and Semi‑Parametric Modeling Quiz Question 3: What key assumption do proportional‑hazards models, such as the Cox model, make about hazard ratios?

Hazard ratios are constant over time (correct)
Hazard ratios change linearly with time
Hazard ratios are equal to zero
Hazard ratios follow a normal distribution

Survival analysis - Parametric and Semi‑Parametric Modeling Quiz Question 4: Which parameter of the Weibull distribution determines whether the hazard function increases or decreases over time?

The shape parameter (correct)
The scale parameter
The location parameter
The mean parameter

Survival analysis - Parametric and Semi‑Parametric Modeling Quiz Question 5: In Cox models, variables that change value over the follow‑up period are called what?

Time‑varying covariates (correct)
Baseline covariates
Stratification variables
Static predictors

Survival analysis - Parametric and Semi‑Parametric Modeling Quiz Question 6: What is the most common form of censoring in survival analysis?

Right‑censoring (correct)
Left‑censoring
Interval‑censoring
Informative censoring

Survival analysis - Parametric and Semi‑Parametric Modeling Quiz Question 7: Which method uses an ensemble of decision trees to analyze high‑dimensional censored survival data?

Random survival forests (correct)
Cox proportional hazards model
Bayesian survival analysis
Parametric Weibull regression

Survival analysis - Parametric and Semi‑Parametric Modeling Quiz Question 8: How are categorical predictors typically incorporated into a Cox proportional hazards model?

As indicator (dummy) variables (correct)
As continuous covariates with linear splines
By stratifying the model on each category
Through a log‑transformation of the categories

Survival analysis - Parametric and Semi‑Parametric Modeling Quiz Question 9: In a stratified Cox model, which element is permitted to vary across the defined strata?

The baseline hazard function (correct)
The regression coefficients for covariates
The censoring distribution
The time scale of the analysis

Survival analysis - Parametric and Semi‑Parametric Modeling Quiz Question 10: What does the mortality rate measure in survival analysis?

Number of deaths per 1,000 individuals per year (correct)
Proportion of surviving individuals at a given time
Hazard ratio between two groups
Average time until death

In a discrete‑time survival model, what does the binary indicator for each time interval represent?

1 of 10

Key Concepts

Survival Analysis Models

Cox proportional hazards regression

Stratified Cox model

Time‑varying covariates

Discrete‑time survival model

Cure model

Accelerated failure time model

Weibull distribution

Survival Estimation Techniques

Log‑rank test

Kaplan–Meier estimator

Random survival forest

Bayesian survival analysis

Definitions

Cox proportional hazards regression

A semi‑parametric model that relates covariates to the hazard function assuming constant hazard ratios over time.

Log‑rank test

A non‑parametric hypothesis test that compares survival curves of two or more groups.

Stratified Cox model

An extension of the Cox model that allows separate baseline hazards for predefined strata while sharing covariate effects.

Time‑varying covariates

Variables whose values can change during follow‑up and are incorporated into survival models to reflect dynamic risk.

Discrete‑time survival model

A survival analysis approach that partitions follow‑up into intervals and models the probability of event occurrence in each interval.

Cure model

A survival model that combines a logistic component for the cured fraction with a hazard component for susceptible individuals.

Weibull distribution

A flexible parametric survival distribution that can represent increasing or decreasing monotonic hazards depending on its shape parameter.

Accelerated failure time model

A parametric survival model that scales the time axis, assuming covariates accelerate or decelerate event times.

Kaplan–Meier estimator

A non‑parametric method that estimates the survival function from censored data.

Random survival forest

An ensemble learning technique that builds multiple decision trees to predict survival outcomes in high‑dimensional censored data.

Bayesian survival analysis

A framework that incorporates prior information with observed data to estimate survival model parameters probabilistically.