The R packages used here are as follows

library(corrplot)
library(jtools)
library(margins)
library(patchwork)
library(prediction)
library(ROCR)
library(survival)
library(survminer)
library(stargazer)
library(tidyverse)

1. What is Survival Analysis?

Also called time-to-event analysis
A branch of statistics that analyzes the expected duration until an event occurs, such as biological death or machine system failure.
Different disciplines use different terminology:

Engineering	Reliability Analysis
Economics	Duration Analysis
Sociology	Event History Analysis

1.1 What Survival Analysis Can Tell Us

What proportion of people survive beyond a certain point in time
Among survivors, what proportion eventually die (or experience system failure)
Whether multiple causes of death or failure can be accounted for
How specific conditions or characteristics increase (or decrease) the probability of survival

1.2 Survival Analysis Methods by Purpose

Example: Medical Studies

Research Question	Appropriate Method	Overview
Overall trends in treatment duration	Kaplan–Meier (KM) Survival Curve	Visualizes the number of patients surviving until the end of treatment using survival curves.
Differences in mortality rates between men and women	Log-rank Test	Tests whether survival curves differ statistically across two or more groups.
Impact of factors such as age on mortality	Cox Proportional Hazards Model	Allows simultaneous analysis of factors affecting the hazard (risk of death).
Significance of explanatory variables in the Cox model	Likelihood Ratio Test	Evaluates the significance of Cox regression coefficients. The likelihood ratio test assesses the overall model, while the Wald test examines the effect of each variable individually.

1.3 Common Terminology in Survival Analysis

Event	Death, onset of disease, relapse, recovery, or any other outcome of interest
Time	The duration from the start of observation (e.g., surgery or treatment) until (i) the occurrence of the event, (ii) the end of the study, or (iii) loss to follow-up/withdrawal from the study
Censoring	Occurs when some information on survival time is available, but the exact survival time is unknown
Survival Function: S(t)	The probability that a subject survives longer than time t

1.4 Data

Here, we use the Acute Myelogenous Leukemia (AML) survival dataset included in the survival package in R.

library(survival)

The survival package includes aml.csv
This is the Acute Myelogenous Leukemia survival dataset
Load the aml.csv data

aml <- read_csv("data/aml.csv")

Check the variables included in the aml dataset

DT::datatable(aml)

time	Survival time or censoring time (weeks)
status	Cancer relapse: 0 = no event (censored), 1 = event occurred (relapse)
x	Treatment group: Nonmaintained (no maintenance chemotherapy), Maintained (with maintenance chemotherapy)

Try sorting by time in ascending order

From patient 12 through patient 3, organize the situation of these 9 patients by time and status.

Among the 23 patients, 8 are confirmed dead … so is the mortality rate \(\frac{8}{23}\)?
Patient 3 was censored on day 13.
After censoring, it is unknown whether patient 3 eventually died or survived.
If we calculate the mortality rate without confirming this fact … it is \(\frac{8}{23}\).
However, if patient 3 actually died after censoring, then the true mortality rate would be \(\frac{9}{23}\).
In that case, reporting a mortality rate of \(\frac{8}{23}\) would underestimate the true mortality of \(\frac{9}{23}\).
To avoid underestimating mortality due to censored cases being ignored,
→ the vertical axis of the Kaplan–Meier curve is calculated as the cumulative survival probability.

What is “cumulative survival probability”?

In the figure below, the horizontal axis is survival time (weeks), and the vertical axis is the cumulative survival probability.
By multiplying survival probabilities at each event time, we can compute the cumulative probability of surviving up to the end of that time period.

Example:

On day 1, survival probability = \(\frac{23}{23} = 1\)
On day 5, survival probability = \(\frac{21}{23} = 0.91\)
Cumulative survival probability through day 5 = \(1 \times 0.91 = 0.91\)

# of risk = n_risk
→ The number of patients who have not yet died and have not yet been censored.

At day 0 (Time = 0) nobody has died yet

Probability of death = 0
Survival probability = \(\frac{23}{23} = 1\)

At day 5, two patients (patient 12 and 13) die

Survivors = 21 out of 23
Survival probability = \(\frac{21}{23} = 0.91\)

\[Cumulative.survival.probability (day\_5) \\= s.prob (day\_0)× s.prob (day\_5)」\\ = 1×0.91 = 0.91\]

At day 8, two more patients (patient 14 and 15) die

Survivors = 19 out of 21
Survival probability = \(\frac{19}{21} = 0.9047\) (note that the denominator is now 21)
Which is approximately \(1 - 0.095 = 0.905\)

\[Cumulative.survival.probability (Day\_8)\\= s.prob (day\_0)× s.prob (day\_5) × s.prob (day\_8)」\\ = 1×0.91 × 0.905= 0.823\]

Let’s check the actual “cumulative survival probabilities”:

# KM estimate (overall)
fit_all <- survfit(Surv(time, status) ~ 1, data = aml)

# Show summary (survival probability at each time)
summary(fit_all)

Call: survfit(formula = Surv(time, status) ~ 1, data = aml)

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    5     23       2   0.9130  0.0588       0.8049        1.000
    8     21       2   0.8261  0.0790       0.6848        0.996
    9     19       1   0.7826  0.0860       0.6310        0.971
   12     18       1   0.7391  0.0916       0.5798        0.942
   13     17       1   0.6957  0.0959       0.5309        0.912
   18     14       1   0.6460  0.1011       0.4753        0.878
   23     13       2   0.5466  0.1073       0.3721        0.803
   27     11       1   0.4969  0.1084       0.3240        0.762
   30      9       1   0.4417  0.1095       0.2717        0.718
   31      8       1   0.3865  0.1089       0.2225        0.671
   33      7       1   0.3313  0.1064       0.1765        0.622
   34      6       1   0.2761  0.1020       0.1338        0.569
   43      5       1   0.2208  0.0954       0.0947        0.515
   45      4       1   0.1656  0.0860       0.0598        0.458
   48      2       1   0.0828  0.0727       0.0148        0.462

Key points of the Kaplan–Meier curve ・When calculating survival probability, individuals who are censored are excluded from the denominator.
→ This allows us to estimate how many events would have occurred if those censored cases had remained under observation.
→ This approach avoids underestimating the risk.

1.5 Kaplan–Meier Survival Curve

The survival function S(t) represents the probability that a subject survives longer than time t.
Although S(t) is theoretically a smooth curve, in practice it is usually estimated using the Kaplan–Meier (KM) curve.

Research question ・Whether patients who received maintenance therapy experienced delayed mortality compared to those who did not receive maintenance therapy.

library(survival)
library(survminer)

# Load data
aml <- read_csv("data/aml.csv")  # Load the uploaded aml.csv file

# Create survival object
# Time to relapse: time, Event occurrence: status, Treatment: x (Maintained or Nonmaintained)
fit <- survfit(Surv(time, status) ~ x, data = aml)

# Plot KM curve
km_1 <- ggsurvplot(fit, data = aml,
           pval = TRUE,              # p-value (log-rank test)
           conf.int = TRUE,          # Show confidence interval
           risk.table = TRUE,        # Show risk table
           xlab = "Time to Relapse",
           ylab = "Survival Probability",
           legend.title = "Therapy",
           legend.labs = c("Maintained", "Non-maintained"),
           palette = c("blue", "red"),
           ggtheme = theme_minimal())
km_1

Key Points for Interpreting the Results

The higher a KM curve is, the greater the proportion remaining relapse-free.
If the blue line (Maintained) lies above the red line (Non-maintained), maintenance therapy may be delaying relapse.
If the log-rank test yields p-value < 0.05, the difference between treatment groups is statistically significant.

Interpreting the Plot

The x-axis shows time from 0 (start of observation) to the last observed time.
The y-axis shows the proportion of subjects who have not experienced the event (cumrative survival probability).
At time zero, 100% of subjects are event-free (no relapses yet).
The stepwise solid lines (blue and red) track the occurrence of events over time.
Blue represents “maintenance chemotherapy = yes (Maintained).”
Red represents “maintenance chemotherapy = no (Non-maintained).”
Vertical drops indicate an event occurred.
Two events at week 5 ⇒ first drop ↓
Two events at week 8 ⇒ second drop ↓
One event at week 9 ⇒ third drop ↓
A “+” where it intersects a horizontal segment indicates a patient was censored at that time.
+ at week 13 (blue) … a censored patient in the Maintained group
+ at week 16 (red) … a censored patient in the Non-maintained group
As time passes, the curves trend downward to the right
⇒ survival probability decreases
⇒ the Non-maintained group declines more steeply
⇒ the Maintained group maintains a higher survival probability.
In the aml dataset, five subjects were censored at weeks 13, 16, 28, 45, and 161, respectively.
In the KM survival curve, there are five tick marks corresponding to these censored observations.

1.6 Log-rank Test

The log-rank test is a “test of differences in survival rates.”
It compares survival times across two or more groups.
Here, we use the log-rank test to compare survival between the Maintained and Non-maintained groups.
・Null hypothesis of the log-rank test:
the survival rates of the two treatment groups are the same.
the survival rates of the two treatment groups are the same.]{style=“color:red”}**
At each event time, the expected number of survivors is adjusted to reflect the number of subjects at risk within each treatment group.
The test evaluates whether the number of observed events (Observed) differs significantly from the expected number (Expected) in each group.
The test is based on the chi-squared distribution.
A larger log-rank test statistic provides stronger evidence that survival differs between groups.
The log-rank test statistic is approximately chi-squared with 1 degree of freedom, and the p-value is calculated from this distribution.
In the Kaplan–Meier survival curve shown above, the result of the log-rank test is p-value = 0.065.
⇒ The null hypothesis cannot be rejected.
⇒ There is no statistically significant difference in survival between the two treatment groups.
For confirmation, we can use the survdiff() function to examine the log-rank test results in more detail.

survdiff(Surv(time, status) ~ x, data = aml)

Call:
survdiff(formula = Surv(time, status) ~ x, data = aml)

                 N Observed Expected (O-E)^2/E (O-E)^2/V
x=Maintained    11        7    10.69      1.27       3.4
x=Nonmaintained 12       11     7.31      1.86       3.4

 Chisq= 3.4  on 1 degrees of freedom, p= 0.07

Results:

N	Sample size
Observed	Number of relapses
(O–E)² / E	Expected value (Expected)
(O–E)² / V	Chi-squared test statistic

x = Maintained Group (with maintenance therapy)

The expected number of relapses was 10.69, but in reality only 7 relapses occurred.
→ Fewer relapses (7) than expected were observed.
⇒ This suggests a possible treatment effect.

x = Non-maintained Group (without maintenance therapy)

The expected number of relapses was 7.31, but in reality 11 relapses occurred.
More relapses (11) than expected were observed.

Now, the result is…

Log-rank Test Result ・p-value = 0.07 is slightly above the 0.05 significance level, and therefore not statistically significant.
⇒ There is some indication that patients receiving maintenance therapy experience delayed relapse, but this cannot be stated with statistical certainty.

Note:

The sample size of 23 subjects is too small.
⇒ The test has very little power to detect differences between treatment groups.
⇒ The chi-squared test is based on asymptotic approximation.
⇒ With small sample sizes, the p-value must be interpreted with caution.

1.7 Comparison of survival probabilities at a specified time

In long-term follow-up data, the number of events decreases, and the estimates in the later period become unstable.
→ By restricting the analysis to a clinically important time point (for example, up to 40 weeks), a more stable comparison becomes possible.
The interpretation in this case is: “If we limit the follow-up to 40 weeks, do the survival curves differ significantly between groups?” Survival probability up to 40 weeks

Survival probability up to 40 weeks

library(survival)
library(survminer)
library(tidyverse)

# --- KM estimation (by group) ---
fit <- survfit(Surv(time, status) ~ x, data = aml)

# --- Survival probability at 40 weeks ---
s40 <- summary(fit, times = 40)
df40 <- data.frame(
  group = sub("^x=", "", s40$strata),  # "x=Maintained" → "Maintained"
  surv  = s40$surv
)

# Labels (percent format) and colors (to match ggsurvplot palette)
df40$label <- paste0(df40$group, ": ", sprintf("%.1f%%", df40$surv * 100))
df40$col   <- ifelse(df40$group == "Maintained", "blue", "red")

# --- Plot ---
p <- ggsurvplot(
  fit, data = aml,
  conf.int = TRUE,
  risk.table = TRUE,
  xlab = "Time to Relapse (weeks)",
  ylab = "Survival Probability",
  legend.title = "Therapy",
  legend.labs = c("Maintained", "Non-maintained"),
  palette = c("blue", "red")
)

# Add vertical line, points, and labels at 40 weeks
p$plot <- p$plot +
  geom_vline(xintercept = 40, linetype = 2) +
  geom_point(
    data = df40,
    aes(x = 40, y = surv),
    color = df40$col, size = 3, inherit.aes = FALSE
  ) +
  geom_text(
    data = df40,
    aes(x = 40, y = surv, label = label),
    vjust = -0.8, color = df40$col, fontface = "bold",
    inherit.aes = FALSE
  )

p

The figure above consists of two parts.

Upper curve (Survival Probability & Time to Relapse: weeks)

Estimated probability of surviving without relapse up to 40 weeks
・Patients who received maintenance therapy (Maintained): 0.36 (36.8%)
・Patients who did not receive maintenance therapy (Non-maintained): 0.194 (19.4%)

Lower risk table (Therapy & Number at risk)

The factor Therapy has two levels: Maintained and Non-maintained
Number at risk: the number of patients who have not yet experienced the event and remain alive at each time point
At t = 0, patients with therapy (11) + patients without therapy (12) = 23 in total
At t = 40, patients with therapy (3) + patients without therapy (2) = 5 in total
Using the survdiff() function, we can check the results of the log-rank test up to 40 weeks

(Note: survdiff() does not take a data frame as the first argument, so the formula and data = … must be explicitly specified.)

aml_40 <- aml |>      
  filter(time <= 40) 

  survdiff(Surv(time, status) ~ x,
    data = aml_40)

Call:
survdiff(formula = Surv(time, status) ~ x, data = aml_40)

                 N Observed Expected (O-E)^2/E (O-E)^2/V
x=Maintained     8        6      8.3     0.639       1.6
x=Nonmaintained 10        9      6.7     0.793       1.6

 Chisq= 1.6  on 1 degrees of freedom, p= 0.2

分析結果の表示：

N	Number of samples	→ Total sample size up to 40 weeks: 18
Observed	Number of deaths	→ Total number of deaths up to 40 weeks: 15
(O-E)^2/E	Expected value (Expected)
(O-E)^2/V	Chi-square test statistic

x = Maintained group (with maintenance therapy)

Expected number of deaths: 8.3, but only 6 actually occurred
→ Fewer deaths (6) than expected were observed
⇒ This suggests a possible treatment effect

x = Non-maintained group (without maintenance therapy)

Expected number of deaths: 6.7, but as many as 9 actually occurred
More deaths (9) than expected were observed
Now, the results are・・・

Log-rank test results up to 40 weeks ・The p-value = 0.2 is slightly above the significance level of 0.05, and thus cannot be considered statistically significant.
⇒ This indicates some tendency that patients receiving maintenance therapy experience delayed mortality, but the result is not statistically conclusive

1.8 Time until half of the patients survive

Median Survival Time

The time point at which the survival probability reaches 50%
・Patients who received maintenance therapy (Maintained): 31 weeks
・Patients who did not receive maintenance therapy (Non-maintained): 23 weeks
→ Half of the patients in the non-maintained group reached this point 7 weeks earlier than those in the maintained group.

library(survival)
library(survminer)
library(ggplot2)

# --- Kaplan–Meier estimation ---
fit <- survfit(Surv(time, status) ~ x, data = aml)

# --- Extract median survival times ---
tb <- summary(fit)$table
df_med <- data.frame(
  group  = gsub("^x=", "", rownames(tb)),   # → "Maintained" / "Nonmaintained"
  median = as.numeric(tb[, "median"]),
  stringsAsFactors = FALSE
)

# Exclude groups whose median is not reached (NA)
df_med2 <- subset(df_med, !is.na(median))

# Split by group
df_m <- subset(df_med2, group == "Maintained")
df_n <- subset(df_med2, group == "Nonmaintained")   # ← Note: no hyphen

# --- Draw KM curves ---
p <- ggsurvplot(
  fit, data = aml,
  conf.int = TRUE, risk.table = TRUE,
  xlab = "Time to Relapse (weeks)", ylab = "Survival Probability",
  legend.title = "M.S.Time",
  legend.labs = c("Maintained", "Non-maintained"),
  palette = c("blue", "red"),
  xlim = c(0, 50)
)

# --- Horizontal line at y=0.5, intersection dots & numeric labels ---
p$plot <- p$plot +
  geom_hline(yintercept = 0.5, linetype = 2) +
  # Maintained: dot & number (blue)
  geom_point(data = df_m, aes(x = median, y = 0.5),
             inherit.aes = FALSE, color = "blue", size = 3) +
  geom_text(data = df_m, aes(x = median, y = 0.55, label = sprintf("%.0f", median)),
            inherit.aes = FALSE, color = "blue", fontface = "bold", hjust = -0.3) +
  # Nonmaintained: dot & number (red)
  geom_point(data = df_n, aes(x = median, y = 0.5),
             inherit.aes = FALSE, color = "red", size = 3) +
  geom_text(data = df_n, aes(x = median, y = 0.55, label = sprintf("%.0f", median)),
            inherit.aes = FALSE, color = "red", fontface = "bold", hjust = -0.3)

p

When examining whether the median survival times differ significantly between groups
→ There is no direct test such as a t-test applicable to medians themselves.
→ Instead, in addition to the difference in medians (= 8), one statistically tests whether the survival curves differ using the log-rank test.

survdiff(Surv(time, status) ~ x, data = aml)

Call:
survdiff(formula = Surv(time, status) ~ x, data = aml)

                 N Observed Expected (O-E)^2/E (O-E)^2/V
x=Maintained    11        7    10.69      1.27       3.4
x=Nonmaintained 12       11     7.31      1.86       3.4

 Chisq= 3.4  on 1 degrees of freedom, p= 0.07

Now, the results are・・・

Test for the difference in median survival times ・The p-value = 0.07 is slightly above the 0.05 significance level and therefore cannot be regarded as statistically significant.
⇒ This means that there is not sufficient evidence to conclude that the time until half of the patients survive differs between the maintenance and non-maintenance groups

2. Analysis Using the Cox Proportional Hazards Model

For example, suppose we want to know:
Even after accounting for age and gender, do patients receiving maintenance therapy tend to relapse later?
However, the KM method can only handle one variable (grouping factor).
⇒ Multivariable analysis is not possible with KM.

Solution ⇒ Use the Cox Proportional Hazards Model

Differences Between the Kaplan–Meier Method and the Cox Proportional Hazards Model

What you want	Why KM Is Insufficient	Why Use the Cox Model
🔸 Analyze multiple factors (covariates) simultaneously	Can only handle one variable	Allows multivariable analysis
🔸 Include quantitative variables	Can only handle categorical variables	Can handle continuous variables
🔸 Control for confounders (e.g., gender, age)	Cannot adjust for them	The Can adjust for confounders
🔸 Know the hazard ratio (risk magnitude)	Provides only survival probabilities	Outputs hazard ratios
🔸 Build a predictive model	Not well-suited for prediction	Can be applied to prediction

2.1 Preparing the Data

Here, we use the Lung Cancer dataset included in the survival package in R.

library(survival)

Check the datasets available in the survival package.
Use one of the following commands:

data(package = "survival") 
help(package = "survival")

From these, we will use the lung dataset (NCCTG Lung Cancer Data).
Load the lung dataset.

names(lung)

 [1] "inst"      "time"      "status"    "age"       "sex"       "ph.ecog"  
 [7] "ph.karno"  "pat.karno" "meal.cal"  "wt.loss"

inst	: Medical institution code
time	: Survival time (days)
status	: Censoring status (1 = censored, 2 = dead)
age	: Age (years)
sex	: Gender (1 = male, 2 = female)
ph.ecog	: Physician-rated ECOG performance score (0 = asymptomatic, 1 = symptomatic but fully ambulatory, 2 = in bed less than 50% of the day, 3 = in bed more than 50% of the day but not bedridden, 4 = completely bedridden)
ph.karno	: Physician-rated Karnofsky performance score (0 = worst, 100 = best)
pat.karno	: Patient-rated Karnofsky performance score
meal.cal	: Calories consumed from meals
wt.loss	: Weight loss in the past 6 months (pounds)

Variable Transformation (status → death)

status: Censoring status (1 = censored, 2 = dead)
→ In survival analysis, event occurrence is coded as 1 (death).
→ If status = 2 (death), then assign 1; otherwise, assign 0.
→ Create a new variable: death = 1 (if dead), death = 0 (if alive/censored).

Variable Transformation (sex → female)

sex: Gender (1 = male, 2 = female)
→ If sex = female, assign 1; if male, assign 0.
→ Create a new variable: female = 1 (if female), female = 0 (if male).

lung <- lung |> 
  mutate(death = if_else(status == 2, 1, 0)) |> 
  mutate(female = if_else(sex == 2, 1, 0))

◎Note: By using the following command, you can check and filter multiple variables (e.g., time, status, x) simultaneously.

reactable::reactable(lung,
  filterable = TRUE,      # enable search/filter
  defaultPageSize = 10)   # set maximum number of displayed rows

2.2 Kaplan–Meier Survival Curves (by Gender)

# Load packages
library(survival)
library(survminer)

# Load data
aml <- read_csv("data/lung.csv")  # Load the uploaded lung.csv file

# Create survival object
# Time to relapse: time, Event occurrence: death, Treatment: x (Maintained or Nonmaintained)
fit <- survfit(Surv(time, death) ~ sex, 
  data = lung)

# Plot KM curve
km_1 <- ggsurvplot(fit, data = lung,
           pval = TRUE,              # p-value (log-rank test)
           conf.int = TRUE,          # show confidence interval
           risk.table = TRUE,        # show risk table
           xlab = "Time to Relapse",
           ylab = "Survival Probability",
           legend.title = "Therapy",
           legend.labs = c("Male Patient", "Female Patient"),
           palette = c("blue", "red"),
           ggtheme = theme_minimal())
km_1

Findings from the Kaplan–Meier Survival Curve

・The female curve lies above the male curve.
⇒ Compared to the male group, the female group tends to survive longer.
・Log-rank test (p = 0.0013)
⇒ The difference between men and women in survival time and survival probability is statistically significant.

2.4 Cox Proportional Hazards Model (by Gender)

Research Question

How does gender affect survival time in lung cancer patients?
At the same time, we want to control for age, physician-rated performance score, caloric intake, and weight loss over the past six months.
⇒ Therefore, instead of the Kaplan–Meier survival curve, we use the Cox Proportional Hazards Model.

cox_model <- coxph(Surv(time, death) ~ age + 
    female + 
    ph.ecog + 
    ph.karno + 
    pat.karno + 
    meal.cal + 
    wt.loss,
  data = lung)

summary(cox_model)

Call:
coxph(formula = Surv(time, death) ~ age + female + ph.ecog + 
    ph.karno + pat.karno + meal.cal + wt.loss, data = lung)

  n= 168, number of events= 121 
   (60 observations deleted due to missingness)

                coef  exp(coef)   se(coef)      z Pr(>|z|)   
age        1.065e-02  1.011e+00  1.161e-02  0.917  0.35906   
female    -5.509e-01  5.765e-01  2.008e-01 -2.743  0.00609 **
ph.ecog    7.342e-01  2.084e+00  2.233e-01  3.288  0.00101 **
ph.karno   2.246e-02  1.023e+00  1.124e-02  1.998  0.04574 * 
pat.karno -1.242e-02  9.877e-01  8.054e-03 -1.542  0.12316   
meal.cal   3.329e-05  1.000e+00  2.595e-04  0.128  0.89791   
wt.loss   -1.433e-02  9.858e-01  7.771e-03 -1.844  0.06518 . 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

          exp(coef) exp(-coef) lower .95 upper .95
age          1.0107     0.9894    0.9880    1.0340
female       0.5765     1.7347    0.3889    0.8545
ph.ecog      2.0838     0.4799    1.3452    3.2277
ph.karno     1.0227     0.9778    1.0004    1.0455
pat.karno    0.9877     1.0125    0.9722    1.0034
meal.cal     1.0000     1.0000    0.9995    1.0005
wt.loss      0.9858     1.0144    0.9709    1.0009

Concordance= 0.651  (se = 0.029 )
Likelihood ratio test= 28.33  on 7 df,   p=2e-04
Wald test            = 27.58  on 7 df,   p=3e-04
Score (logrank) test = 28.41  on 7 df,   p=2e-04

Upper part of the output・・・log hazard ratio ⇒ cannot be directly interpreted.
Lower part of the output・・・hazard ratio ⇒ interpretable Results for female (Hazard Ratio)

Indicator	Value	Interpretation
female coef	-5.509e-01	log hazard ratio (effect of being female)
female exp(coef)	0.5765	Hazard ratio (interpretable)
female p-value	0.00609	Since it is below the 5% significance level, the difference is statistically significant

Findings from the Cox Proportional Hazards Model (female)

Hazard Ratio Results

Effect of being female (female = 1) on the risk of death

・exp(coef) = 0.5765 → The mortality risk for women is about 57.6% of that for men
・Compared to men, women have about a 42.4% lower risk of death
・A hazard ratio below 1 indicates lower risk
・p-value = 0.00609 → Statistically significant

Results for age (Hazard Ratio)

Indicator	Value	Interpretation
age coef (coef)	1.065e-02	log hazard ratio (effect of age)
age exp(coef)	1.0107	Hazard ratio (interpretable)
age p-value	0.35906	Since it is above the 5% significance level, it is not statistically significant

Findings from the Cox Proportional Hazards Model (age)

Hazard Ratio Results

Effect of age on the risk of death

・exp(coef) = 1.0107 → Each additional year of age is estimated to increase the risk of death by about 1.07%.
・However, this difference is not statistically significant (p-value = 0.359).
・Therefore, in this model controlling for other variables, we cannot conclude that “higher age leads to a higher risk of death.”
・A hazard ratio below 1 indicates reduced risk.
・p-value = 0.35906 → Not statistically significant

3. Differences from Logistic Regression Analysis

Logistic Regression Analysis (Gender and Lung Cancer Mortality)

Load the lung data

library(tidyverse)

stargazer::stargazer(lung, 
  type = "html")


Statistic	N	Mean	St. Dev.	Min	Max

inst	227	11.088	8.303	1	33
time	228	305.232	210.646	5	1,022
status	228	1.724	0.448	1	2
age	228	62.447	9.073	39	82
sex	228	1.395	0.490	1	2
ph.ecog	227	0.952	0.718	0	3
ph.karno	227	81.938	12.328	50	100
pat.karno	225	79.956	14.623	30	100
meal.cal	181	928.779	402.175	96	2,600
wt.loss	214	9.832	13.140	-24	68
death	228	0.724	0.448	0	1
female	228	0.395	0.490	0	1

Draw scatter plots of death vs. age and death vs. female (using the log of the odds).
To avoid character encoding issues, Mac users should run the following two lines:

theme_set(theme_gray(base_size = 10, 
                     base_family = "HiraginoSans-W3"))

p_age <- ggplot(lung, aes(x = age, y = death)) + 
  geom_jitter(size = 1,
              alpha = 1/3,
              width = 0,
              height = 0.05) +
  geom_smooth(method = "glm", 
    color = "red",
    method.args = list(family = binomial(link = "logit"))) +
  labs(x = "Age",
       y = "Probability of Death")

p_female <- ggplot(lung, aes(x = female, y = death)) + 
  geom_jitter(size = 1,
              alpha = 1/3,
              width = 0.05,
              height = 0.05) +
  geom_smooth(method = "glm", 
    color = "red",
    method.args = list(family = binomial(link = "logit"))) +
  labs(x = "Gender",
       y = "Probability of Death")

p_age + p_female

model_lung <- glm(death ~ age + female + ph.ecog + ph.karno + pat.karno + meal.cal + wt.loss, 
            data = lung, 
            family = binomial(link = "logit")) # 係数を「オッズの対数」に指定

Use tidy() to check the estimation results.

broom::tidy(model_lung, conf.int = TRUE)

# A tibble: 8 × 7
  term         estimate std.error statistic p.value  conf.low conf.high
  <chr>           <dbl>     <dbl>     <dbl>   <dbl>     <dbl>     <dbl>
1 (Intercept) -2.28      3.30        -0.692  0.489  -8.80       4.24   
2 age          0.0210    0.0219       0.958  0.338  -0.0221     0.0643 
3 female      -0.940     0.392       -2.40   0.0164 -1.72      -0.177  
4 ph.ecog      1.05      0.480        2.19   0.0287  0.120      2.02   
5 ph.karno     0.0293    0.0271       1.08   0.279  -0.0245     0.0827 
6 pat.karno   -0.0148    0.0161      -0.914  0.361  -0.0471     0.0165 
7 meal.cal     0.000250  0.000502     0.498  0.618  -0.000700   0.00129
8 wt.loss     -0.00534   0.0147      -0.364  0.716  -0.0336     0.0247

→ The regression coefficient estimate displayed is the log of the odds, i.e., log(odds) = also called the log-odds.
- For example, the estimated coefficient for female is -0.940.
- This represents the log-odds = log(odds) = logit
- If this were the result of a multiple linear regression, it would mean:

“For every one-unit (1 million yen) increase in campaign expenditure, the probability of winning decreases by 0.940 points.”

However, coefficients in logistic regression cannot be interpreted in the same way as those in linear regression.

log-odds are difficult to interpret → Convert them into odds ratios

Display the results of model_lung in terms of odds ratios.

tidy(model_lung, 
  conf.int = TRUE, 
  exponentiate = TRUE)

# A tibble: 8 × 7
  term        estimate std.error statistic p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1 (Intercept)    0.102  3.30        -0.692  0.489  0.000151    69.4  
2 age            1.02   0.0219       0.958  0.338  0.978        1.07 
3 female         0.390  0.392       -2.40   0.0164 0.179        0.837
4 ph.ecog        2.86   0.480        2.19   0.0287 1.13         7.51 
5 ph.karno       1.03   0.0271       1.08   0.279  0.976        1.09 
6 pat.karno      0.985  0.0161      -0.914  0.361  0.954        1.02 
7 meal.cal       1.00   0.000502     0.498  0.618  0.999        1.00 
8 wt.loss        0.995  0.0147      -0.364  0.716  0.967        1.02

Displaying Odds Ratios with a Forest Plot

# Format results (convert to odds ratios)
results <- tidy(model_lung, conf.int = TRUE, exponentiate = TRUE) %>%
  filter(term != "(Intercept)") %>%
  mutate(
    term = recode(term,
                  "female" = "Female dummy",
                  "age" = "Age",
                  "ph.ecog" = "ECOG Performance Score",
                  "ph.karno" = "Karnofsky Score (Physician)",
                  "pat.karno" = "Karnofsky Score (Patient)",
                  "meal.cal" = "Calories from Meals",
                  "wt.loss" = "Weight Loss (pounds)"),
    OR_label = sprintf("%.2f", estimate)   # Convert OR to string
  )

# Forest plot + numeric labels
ggplot(results, aes(x = estimate, y = term)) +
  geom_point(size = 3, color = "blue") +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
  geom_vline(xintercept = 1, linetype = "dashed", color = "red") +
  geom_text(aes(label = OR_label), # Display OR
    hjust = 0.5, 
    vjust = -2, 
    size = 4) +  
  labs(
    title = "Odds Ratios from Logistic Regression",
    x = "Odds Ratio (95% CI)",
    y = ""
  ) +
  theme_minimal(base_size = 14) + # Add margin so labels are not cut off
  xlim(0, max(results$conf.high) * 1.2) +
  theme_bw(base_family = "HiraKakuProN-W3")

female coefficient: 0.39 (`p-value = 0.0164`)

Odds for females・・・\(odds_{female = 1}\)　　
Odds for males・・・\(odds_{female = 0}\)

\[OddRatios = \frac{odds_{female=1}}{odds_{female = 0}} = 0.39\]

→ The odds for \(odds_{female = 1}\) are 0.39 times the odds for \(odds_{female = 0}\)

→ For females, the probability of death decreases

Statistical Significance

The p-value for female is 0.0164.
→ At a significance level of 0.05, we can reject the null hypothesis that “gender has no effect on the probability of death.”
→ female has a negative effect on the probability of death, and this effect is judged to be statistically significant

log-odds are difficult to interpret → Convert to probabilities

3.1 Predicted Probabilities by Gender: `predict()`.

→ Useful when we want to illustrate “gender differences for a representative patient.”

Fix age, ph.ecog, ph.karno, pat.karno, meal.cal, and wt.loss at their mean values.

library(ggplot2)
library(dplyr)

# Fix average values and set female to 0 / 1
newdata <- data.frame(
  female = c(0, 1),
  age = mean(lung$age, na.rm = TRUE),
  ph.ecog = mean(lung$ph.ecog, na.rm = TRUE),
  ph.karno = mean(lung$ph.karno, na.rm = TRUE),
  pat.karno = mean(lung$pat.karno, na.rm = TRUE),
  meal.cal = mean(lung$meal.cal, na.rm = TRUE),
  wt.loss = mean(lung$wt.loss, na.rm = TRUE)
)

# Predicted values and standard errors
pred <- predict(model_lung, 
  newdata = newdata, 
  type = "response", 
  se.fit = TRUE)

# Data for plotting
plot_df <- newdata %>%
  mutate(
    fit = pred$fit,
    se = pred$se.fit,
    lower = fit - 1.96 * se,
    upper = fit + 1.96 * se,
    sex = factor(female, labels = c("male", "female"))
  )

# Plot: add p-value as text inside the figure
plot_prediction <- ggplot(plot_df, aes(x = sex, y = fit)) +
  geom_point(size = 5, color = "black") +
  geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.1) +
  geom_text(aes(label = paste0(round(fit * 100, 1), "%")), 
            hjust = -0.5, size = 5) +
  geom_text(
    data = data.frame(x = 1.5, y = max(plot_df$upper) + 0.05),
    aes(x = x, y = y, label = "p = 0.0164"),
    inherit.aes = FALSE,
    size = 5
  ) +
  labs(
    title = "Predicted Probability of Death by Gender",
    x = "Patient Gender",
    y = "Predicted Probability of Death"
  ) +
   theme_bw(base_family = "HiraKakuProN-W3")

plot_prediction

Results (Comparison of Predicted Probability of Death)
・The figure shows the difference in predicted probability of death by gender.
・Six control variables are fixed at their mean values.
・Predicted probability of death for males: 80.7%
・Predicted probability of death for females: 61.9%
・Since p-value = 0.024, at the 5% significance level we conclude that the difference between men and women is statistically significant.
・The difference in predicted probabilities is about 18.8 percentage points (80.7 − 61.9 = 18.8).
Under the same conditions, women have a significantly lower probability of death compared to men.

When the independent variable is categorical, if the main interest of the analysis is “the effect of gender on the probability of death,” then presenting the results in this way is sufficient.
However, if the main interest is “how the effect of being female changes depending on conditions,” it is necessary to show the conditional marginal effects for female = 0 and female = 1.

3.2 Predicted Probability of Death by Gender: `margins()`

Calculate the Average Marginal Effect (AME) to show the “effect of gender (difference)”

→ Useful when you want to show the “overall average gender difference.”
For the categorical variable (female), compute for each observation how much the probability of death changes when switching from male → female.
→ Take the average of these changes → this is the AME: Average Marginal Effect.
→ This allows you to state, for example:
“On average, women have a death probability that is X percentage points higher (or lower) than men.”
→ The interpretation of the coefficient thus becomes much clearer.

library(margins)
library(dplyr)
library(ggplot2)

# --- Function to format p-values ---
format_p <- function(x) {
  ifelse(x < 0.001,
         "p < 0.001",
         formatC(x, format = "e", digits = 3))
}
# --- Calculation of marginal effects ---
mfx <- margins(model_lung, variables = "female")

summary(mfx)

 factor     AME     SE       z      p   lower   upper
 female -0.1660 0.0650 -2.5543 0.0106 -0.2933 -0.0386

What this AME (-0.1660) indicates:
⇒ “When changing from male → female, the probability of death decreases on average by 16.6 percentage points (-0.1660).”
p-value ≈ 0.0106 → statistically significant

Results (Comparison of Marginal Effects)
・Understanding the “effect of gender (difference)”
= Clarifies the “average impact of gender change on the probability of death.
・AME = -0.1660
→ On average, women have a death probability about 16.6 percentage points lower
・The gender effect is statistically significant (p-value = 0.0106).

Summary of Results

・Result 3.1: `predict()`

・Compares predicted probabilities for men and women by assuming an “average patient.”
→ “Women have a probability of death that is 18.8 percentage points lower than men.”

・Result 3.2: `margins()`

・Respects the conditions of each sample observation and shows the average change in probability for women (AME, with 95% CI).
→ “On average, women have a death probability about 16.6 percentage points lower.”
- Since logistic regression is nonlinear, it is normal for the numerical values not to match exactly.

3.3 Examination of Validity

Cox Proportional Hazards Model

Objective	Estimate the risk of death (hazard ratio) based on “survival time.”
Result	Women have about a 42% lower risk of death compared to men; ph.ecog and ph.karno are significant.
Strength	Provides the probability of surviving up to a certain point in time (survival curve), rather than just the probability of death itself.
Validity	If survival data (time + censoring) are available, this is generally the most valid approach.

Logistic Regression Predicted Values Plot

Objective	Treat death as a binary outcome (0/1) and estimate “probability of death” from explanatory variables.
Result	Predicted probability of death: Women = 61.9%, Men = 80.7%.
Weakness	Treats those censored during the observation period (status = 1) as “not dead (death = 0).”
Validity	Inferior to the Cox proportional hazards model because it ignores survival time data.

Enter the data with status = 1 and death = 0.
→ This shows that there are 63 individuals censored during the survival period (status = 1) who are treated as “not dead (death = 0).”

reactable::reactable(lung,
  filterable = TRUE,  # Enable search
  defaultPageSize = 10) # Set the maximum number of rows displayed

Which is more valid?

If survival time data (time, censoring) are available:
→ The Cox proportional hazards model is more valid.
Logistic regression only considers “whether death occurred by a certain time or not.”
→ It discards time information.
However, for papers or presentations where the goal is to produce a “visually intuitive figure showing gender differences,” logistic regression predicted probability graphs are also straightforward and useful.

参考文献

Kieran Healy, DATA VISUALIZATION, Princeton, 2019
宋財泫 (Jaehyun Song)・矢内勇生 (Yuki Yanai)「私たちのR: ベストプラクティスの探究」
浅野正彦, 矢内勇生.『Rによる計量政治学』オーム社、2018年
浅野正彦, 中村公亮.『初めてのRStudio』オーム社、2018年
Winston Chang, R Graphics Cookbook, O’Reilly Media, 2012.
Kosuke Imai, Quantitative Social Science: An Introduction, Princeton University Press, 2017

34. Survival Analysis

Masahiko Asano

2025-08-25

1. What is Survival Analysis?

1.1 What Survival Analysis Can Tell Us

1.2 Survival Analysis Methods by Purpose

Example: Medical Studies

1.3 Common Terminology in Survival Analysis

1.4 Data

What is “cumulative survival probability”?

Example:

At day 0 (Time = 0) nobody has died yet

At day 5, two patients (patient 12 and 13) die

At day 8, two more patients (patient 14 and 15) die

1.5 Kaplan–Meier Survival Curve

Key Points for Interpreting the Results

Interpreting the Plot

1.6 Log-rank Test

x = Maintained Group (with maintenance therapy)

x = Non-maintained Group (without maintenance therapy)

Now, the result is…

1.7 Comparison of survival probabilities at a specified time

Survival probability up to 40 weeks

Upper curve (Survival Probability & Time to Relapse: weeks)

Lower risk table (Therapy & Number at risk)

x = Maintained group (with maintenance therapy)

x = Non-maintained group (without maintenance therapy)

1.8 Time until half of the patients survive

Median Survival Time

Now, the results are・・・

2. Analysis Using the Cox Proportional Hazards Model

Solution ⇒ Use the Cox Proportional Hazards Model

Differences Between the Kaplan–Meier Method and the Cox Proportional Hazards Model

2.1 Preparing the Data

Variable Transformation (status → death)

Variable Transformation (sex → female)

2.2 Kaplan–Meier Survival Curves (by Gender)

2.4 Cox Proportional Hazards Model (by Gender)

Research Question

Hazard Ratio Results

Effect of being female (female = 1) on the risk of death

Results for age (Hazard Ratio)

Hazard Ratio Results

Effect of age on the risk of death

3. Differences from Logistic Regression Analysis

Logistic Regression Analysis (Gender and Lung Cancer Mortality)

Displaying Odds Ratios with a Forest Plot

female coefficient: 0.39 (p-value = 0.0164)

Statistical Significance

3.1 Predicted Probabilities by Gender: predict().

3.2 Predicted Probability of Death by Gender: margins()

Calculate the Average Marginal Effect (AME) to show the “effect of gender (difference)”

・Result 3.1: predict()

・Result 3.2: margins()

3.3 Examination of Validity

Cox Proportional Hazards Model

Logistic Regression Predicted Values Plot

Which is more valid?

female coefficient: 0.39 (`p-value = 0.0164`)

3.1 Predicted Probabilities by Gender: `predict()`.

3.2 Predicted Probability of Death by Gender: `margins()`

・Result 3.1: `predict()`

・Result 3.2: `margins()`