• The R packages used here are as follows
library(corrplot)
library(jtools)
library(margins)
library(ROCR)
library(patchwork)
library(prediction)
library(stargazer)
library(tidyverse)

1. Linear Regression and Nonlinear Regression

1.1 Difference between linear and nonlinear regression

  • Suppose we want to examine the relationship between a candidate’s performance in an election and the amount of money spent on the campaign.
    → In other words, the research question is:
Do candidates who spend more money achieve better electoral outcomes?
  • Two types of outcome variables (response variables) can be considered:
  1. Vote share (or number of votes) — a continuous variable → Regression Analysis.

  2. Electoral outcome (win/loss) — coded as 1 if elected, 0 if not → Logistic Regression Analysis

Regression Analysis Logistic Regression Analysis
Response Variable Vote share (%) Electoral outcome (0 or 1)
Explanatory Variable Campaign expenditure (million yen) Campaign expenditure (million yen)
  • The distinction between regression analysis and logistic regression analysis can be illustrated visually as follows:

Key Differences

  • In regression analysis, the response variable is vote share = a continuous variable.
  • In logistic regression analysis, the response variable is electoral outcome = a binary variable (1 or 0).
  • In each figure, the regression line can be drawn.

Regression analysis:
  • The more money a candidate spends, the higher the vote share → no major issue.
Logistic regression analysis:
  • The more money a candidate spends, the higher the vote share.
    → However, there are three problems:
  1. The predicted probability of winning (0–1) may exceed 1 (100%).
  2. The predicted probability of winning (0–1) may take on negative values.
  3. The model lacks a good fit (it gives the impression of being forced).

1.2 Logit Transformation

  • Logistic regression analysis is essentially regression analysis applied to a binary variable after a logit transformation.

What is a Logit Transformation?

  • When a scatterplot of binary outcomes is converted into rates (probabilities), the curve takes an S-shape (left figure → middle figure).
  • When these rates are logit-transformed, the curve becomes a straight line (middle figure → right figure).
  • Logistic regression analysis makes use of this mathematical property.

  • Instead of looking at electoral outcomes (win/loss) (left figure), consider the winning rate (middle figure).
  • Represent the proportion of candidates who win at each level of campaign expenditure using a bar chart.
  • Looking at the bar chart of winning rates (middle figure)…

  • Candidates who spend less than 2 million yen on their campaigns rarely win.
  • Candidates who spend around 10 million yen have a winning rate of about 50%.
  • Candidates who spend 16 million yen have a winning rate of more than 75%.
  • Taking the logarithm of the winning rate by campaign expenditure (i.e., logit transformation)
    → The S-shaped curve becomes approximately linear.
    → Although actual data do not form a perfect straight line, the pattern is close to linear.
    → This curve is then used to estimate logistic regression.
    In short, logistic regression is regression analysis performed after transforming a binary variable into its logit form.

・In logistic regression analysis, results are expressed not in terms of “proportions” but in terms of “odds.”

::: {.kakomi-box11} Proportion, Odds, Odds Ratio, Logit Transformation

The Difference Between Proportion and Odds

Proportion

・Suppose there are 100 candidates who spend 8 million yen on their campaigns.
・Among them, 60 win and 40 lose.
・In this case, the winning “proportion” and the winning “odds” can be calculated as follows:

  • A proportion can never exceed 1.
  • However, odds can be greater than 1.
  • Both proportion and odds are indices that represent the likelihood of an event.

What is odds?

  • Odds are defined as the ratio of the probability that an event occurs, \(\mathrm{p}\), to the probability that it does not occur, \(\mathrm{1 - p}\).

p: probability that an event occurs
1 - p: probability that an event does not occur

\[Odds = \frac{p}{1-p}\] If the probability of winning is 75% (p = 0.75), the odds are:

\[Odds = \frac{p}{1-p} = \frac{0.75}{0.25}= 3\] - The meaning of “odds = 3” is:
→ There is a 3-to-1 chance of winning.

  • For example, let us calculate the odds when p = 0.01 (the probability of occurrence is 1%).
  • Substituting p = 0.01 and 1 - p = 1 - 0.01 = 0.99 into the formula for odds:

\[Odds = \frac{p}{1-p} = \frac{0.01}{0.99} = \frac{1}{99} = 0.01\]

・If the probability of an event is 1% (p = 0.01), the odds are 0.01.
- This means “a 0.01-to-1 chance of winning.”
- Equivalently, it can be expressed as “a 1-to-100 chance of winning.”
- In other words, “there is a chance of winning only once in 100 trials.”

Characteristics of odds ・The larger the odds, the more likely the event (p) is to occur.
・The minimum value of odds is 0.
・Odds = 1 means that the probability (p) of the event occurring is 50%.
・The maximum value of odds is infinity (∞).

What is an Odds Ratio?

・It is the ratio of the odds between two groups.

\[Odds Ratio = \frac{odds_1}{odds_2}\]

Example:

・Odds of winning for male candidates = 3
・Odds of winning for female candidates = 2

\[Odds Ratio = \frac{odds_1}{odds_2}= \frac{odds_{male}}{odds_{female}}=\frac{3}{2} = 1.5\]

・This is interpreted as:
→ Male candidates are 1.5 times more likely to win than female candidates.

Basic Knowledge of the Logit Model: Logit Transformation

What is a Logit Transformation?

・A logit transformation means taking the logarithm of the odds of a binary variable (0 or 1).
・Since the lower bound of odds is 0, they are difficult to use directly as an explanatory variable.
→ By applying a logarithmic transformation to odds, we compute the log odds.
→ This is what we call the logit = log(odds) = log-odds
logit is the name of a function (commonly used in statistical modeling)

\[log(odds) =log\frac{p}{1-p}\]

・A logit (logit) can be understood as the logarithm of the odds (log odds).

Characteristics of log-odds (= logit) ・Both the minimum value and the maximum value are infinite (∞).
・log(odds) = 0 means that the probability (p) of the event occurring is 50%.

2. 2. Differences in Estimation Methods

Multiple Regression Ordinary Least Square (OLS)
Logistic Regression Maximum Likelihood Estimation (MLE)

2.1 What is Maximum Likelihood Estimation (MLE)?

  • Maximum Likelihood Estimation (MLE) is a method for estimating the magnitude of underlying possibilities based on events that have already occurred.
  • In this section, MLE is explained by comparing the difference between probability and likelihood.

2.1.1 The Difference Between Probability and Likelihood

Probability The magnitude of the possibility of what will occur in the future
Likelihood The magnitude of the possibility underlying what has already occurred

Probability

  • Suppose there is a bag containing 5 balls: 3 red and 2 white.
  • One ball is drawn from the bag, its color is checked, and then it is returned.
  • This process is repeated three times.

Result:

  • 1st draw … Red
  • 2nd draw … White
  • 3rd draw … Red

Question:

  • What is the probability that the result will be “Red, White, Red”?
  • It can be calculated using the multiplication rule:

\[\frac{3}{5} × \frac{2}{5} × \frac{3}{5} = \frac{18}{125}\]

  • However, in order to use the multiplication rule
    → it is necessary to assume that “each draw is independent and does not influence the others,” i.e., the assumption of local independence.

Likelihood

  • Suppose there is a bag containing a total of 5 balls, some red and some white, but the exact numbers are unknown.
  • One ball is drawn, its color is checked, and then it is returned to the bag.
  • This process is repeated three times.
Result:
  • The outcomes are: 1st draw → Red, 2nd draw → White, 3rd draw → Red.
Question:
  • Given this result, what is the most likely combination of red and white balls that were originally in the bag?

  • Unlike probability, which predicts what will happen in the future, here the balls have already been drawn.
  • From the observed outcome (Red, White, Red), we want to estimate the numbers of red and white balls in the bag.
  • Likelihood is the method of estimating how plausible the underlying parameters are, given what has already occurred.
  • How many red and white balls would be reasonable?
  • Let us calculate the likelihood for each possible combination of red and white balls.
Number of balls Likelihood
Case: 1 red, 4 white \(\frac{1}{5}×\frac{4}{5}×\frac{1}{5}= \frac{4}{125}\)
Case: 2 red, 3 white \(\frac{2}{5}×\frac{3}{5}×\frac{2}{5}= \frac{12}{125}\)
Case: 3 red, 2 white \(\frac{3}{5}×\frac{2}{5}×\frac{3}{5}= \frac{18}{125}\)
Case: 4 red, 1 white \(\frac{4}{5}×\frac{1}{5}×\frac{4}{5}= \frac{16}{125}\)
  • Given the observed outcome “Red, White, Red,” the most plausible case is 3 red balls and 2 white balls.

Key Point Logistic regression employs an estimation method called Maximum Likelihood Estimation (MLE).

What We Want to Estimate in Logistic Regression

・For example, suppose we want to analyze the relationship Campaign Expenditure (X) → Electoral Outcome (Y).
・Since campaign expenditure (X) is an observed (fixed) variable, it cannot be estimated.

What we want to estimate

  • The regression coefficient of campaign expenditure (i.e., the effect of spending on the electoral outcome)
    ・The model calculates the probability of a candidate’s electoral outcome.
    ・The combination of coefficients that maximizes the likelihood is taken as the “estimate.”
    ・In logistic regression analysis, likelihood refers to:
    ・The probability that the logistic regression model as a whole can explain the observed election outcome data.
    → In other words, the probability that all the observed outcomes of “win/loss” across candidates would actually occur.

3. Logistic Regression Analysis (1)

When x is a Continuous Variable

  • The procedure for logistic regression analysis is as follows:
  1. Formulate the alternative hypothesis and the null hypothesis
  2. Display a scatterplot of the explanatory variable and the response variable
  3. Estimate the logistic regression equation Test the statistical significance of the regression coefficient(s)
  4. Interpret the meaning of the estimated results Assess the goodness-of-fit of the logistic regression model

3.1 Formulating the Alternative and Null Hypotheses

  • The hypothesis to be tested (alternative hypothesis) is as follows:

Hypothesis 1 The more money a candidate spends on the campaign, the higher their probability of winning in a single-member district.

  • Accordingly, the null hypothesis, which is set to be rejected, is as follows:

Null Hypothesis The amount of campaign expenditure has no relationship with the probability of winning in a single-member district.

  • In hypothesis testing within logistic regression, just as in multiple regression analysis, if the obtained p-value is smaller than the significance level, we reject the null hypothesis and accept the alternative hypothesis.

  • To test Hypothesis 1, we consider the following model.

Model 1

  • Configure the ggplot2 theme and font settings.
if (.Platform$OS.type == "windows") { 
  if (require(fontregisterer)) {
    my_font <- "Yu Gothic"
  } else {
    my_font <- "Japan1"
  }
} else if (capabilities("aqua")) {
  my_font <- "HiraginoSans-W3"
} else {
  my_font <- "IPAexGothic"
}

theme_set(theme_gray(base_size = 9,
                     base_family = my_font))

Data Preparation (hr96-24.csv)

  • Download the vote share data from the House of Representatives general elections: hr96-24.csv.

  • Using the downloaded 2021 House of Representatives election data, display a scatterplot with electoral outcome in single-member districts (wlsmd) on the vertical axis and campaign expenditure (expm) on the horizontal axis.

  • Once the dataset has been downloaded, load it and assign it to the object hr:

hr <- read_csv("data/hr96-24.csv", na = ".")
  • Display the contents of the data frame hr:
names(hr)
 [1] "year"          "pref"          "ku"            "kun"          
 [5] "wl"            "rank"          "nocand"        "seito"        
 [9] "j_name"        "gender"        "name"          "previous"     
[13] "age"           "exp"           "status"        "vote"         
[17] "voteshare"     "eligible"      "turnout"       "seshu_dummy"  
[21] "jiban_seshu"   "nojiban_seshu"
  • Use the select() function to keep only the four variables: year, wl, previous, and exp.
  • Use the filter() function to retain only the data from the 2021 House of Representatives election.
  • Check the contents of the variable wl:
unique(hr$wl)
[1] 1 0 2

0 = Lost in single-member district
1 = Won in single-member district
2 = Elected via proportional representation (“revival win”)

  • Using wl, create a new variable wlsmd to indicate electoral outcome in single-member districts (1 = win in SMD, 0 = otherwise).
  • Using exp, create a new variable expm to indicate campaign expenditure (measured in units of one million yen).
hr21 <- hr %>%
  select(year, wl, previous, exp) %>%
  filter(year == 2021) %>% 
  mutate(wlsmd = ifelse(wl == 1, 1, 0)) |> 
  mutate(expm = exp / 1000000) |> 
  
  select(year, wlsmd, expm, previous)
  • Display the contents of the data frame hr21.
hr21
# A tibble: 857 × 4
    year wlsmd  expm previous
   <dbl> <dbl> <dbl>    <dbl>
 1  2021     1 13.4         3
 2  2021     0  9.62        2
 3  2021     0 NA           0
 4  2021     1  8.82        8
 5  2021     0 11.4         0
 6  2021     1  7.42        8
 7  2021     0 11.9         3
 8  2021     1 13.0         3
 9  2021     0  7.48        6
10  2021     0  4.61        0
# ℹ 847 more rows

データ

Variable Description
year Year in which the House of Representatives election was held
wlsmd Dummy variable for electoral outcome in single-member districts (1 = win, 0 = loss)
previous Number of times previously elected
expm Candidate’s campaign expenditure (in million yen)
  • Display the descriptive statistics of the data frame hr21:
summary(hr21)
      year          wlsmd             expm              previous     
 Min.   :2021   Min.   :0.0000   Min.   : 0.009319   Min.   : 0.000  
 1st Qu.:2021   1st Qu.:0.0000   1st Qu.: 3.435080   1st Qu.: 0.000  
 Median :2021   Median :0.0000   Median : 5.899882   Median : 1.000  
 Mean   :2021   Mean   :0.3372   Mean   : 6.434585   Mean   : 2.162  
 3rd Qu.:2021   3rd Qu.:1.0000   3rd Qu.: 8.692808   3rd Qu.: 3.000  
 Max.   :2021   Max.   :1.0000   Max.   :27.443685   Max.   :17.000  
                                 NA's   :22                          
  • We find that there are 22 missing values (NA’s) in the variable expm.
  • Use na.omit()to keep only observations without missing values:
hr21 <- na.omit(hr21)
  • Display the descriptive statistics of the data frame hr21 again to confirm:
summary(hr21)
      year          wlsmd             expm              previous     
 Min.   :2021   Min.   :0.0000   Min.   : 0.009319   Min.   : 0.000  
 1st Qu.:2021   1st Qu.:0.0000   1st Qu.: 3.435080   1st Qu.: 0.000  
 Median :2021   Median :0.0000   Median : 5.899882   Median : 1.000  
 Mean   :2021   Mean   :0.3461   Mean   : 6.434585   Mean   : 2.211  
 3rd Qu.:2021   3rd Qu.:1.0000   3rd Qu.: 8.692808   3rd Qu.: 3.000  
 Max.   :2021   Max.   :1.0000   Max.   :27.443685   Max.   :17.000  
  • The 22 missing values (NA's) have been removed.  

3.2 Displaying a Scatterplot of the Explanatory and Response Variables

  • Display the descriptive statistics of the dataset (hr21) used in the analysis.
stargazer::stargazer(as.data.frame(hr21), 
  type = "html")
Statistic N Mean St. Dev. Min Max
year 835 2,021.000 0.000 2,021 2,021
wlsmd 835 0.346 0.476 0 1
expm 835 6.435 4.287 0.009 27.444
previous 835 2.211 2.921 0 17
  • Draw a scatterplot of the number of previous wins and the electoral outcome.
  • Visualize the relationship between the two variables. To avoid character corruption, Mac users should run the following two lines.
theme_set(theme_gray(base_size = 10, 
                     base_family = "HiraginoSans-W3"))
  • Since wlsmd is a binary variable (0 or 1), use the jitter() function to spread out the data points for display.
p <- ggplot(hr21, aes(x = expm, y = wlsmd)) + 
  geom_jitter(size = 1,        #Specification to display the data with jittering
              alpha = 1/3,
              width = 0,
              height = 0.05) +
  labs(x = "Campaign Expenditure (measured in units of one million yen)",
       y = "Win/loss in single-member districts")

plot(p)

  • If we forcibly fit a standard regression line to this figure, it looks like this.
p + geom_smooth(method = "lm", se = FALSE) +
    annotate("label", 
           label = "wlsmd (0 or 1) = a + bexpm", 
           x = 6, y = 0.75,
           size = 5, 
           colour = "blue", 
           family = "HiraginoSans-W3")

  • When the response variable is binary (0 or 1), forcing a standard regression line onto the data leads to problems:
    → The response variable (probability of winning) can take on negative values or exceed 100%.

Solution

  • Apply a logit transformation to the probability of winning.

  • The probability of winning, \(P\), takes values between 0 and 1 (a binary outcome).

  • \(\frac{P}{1-P}\) represents the odds (see the column “Basic Knowledge of the Logit Model: Odds”).

  • Odds are defined as the ratio of the probability of winning, \(\mathrm{p}\), to the probability of losing, \(\mathrm{1 - p}\).

  • The relationship between “probability of winning \(P\)” and “gender” can be expressed as follows:

\[\frac{P}{1-p} = a + (CampaignExpenditure)b\]

Log-odds curve

p3 <- ggplot(hr21, aes(x = expm, y = wlsmd)) + 
  geom_jitter(size = 1,
              alpha = 1/3,
              width = 0,
              height = 0.05) +
  geom_smooth(method = "glm", 
    color = "red",
    method.args = list(family = binomial(link = "logit"))) +
  labs(x = "Campaign Expenditure (measured in units of one million yen)",
       y = "Probability of winning in single-member districts")

print(p3)

3.3 Estimating the Logistic Regression Equation

Logistic Regression Equation (Model 1) ・Here, using the 2021 House of Representatives election data, we consider a model in which campaign expenditure (expm, measured in units of one million yen) explains the electoral outcome in single-member districts (wlsmd).

Model 1

  • This model can be expressed as follows:

\[Pr(Winning)=logit^{−1}(𝛼+\beta_1Expenditure + \beta_2PreviousWins)\]

  • In R, logistic regression analysis is conducted using the glm() function, which is designed to fit Generalized Linear Models (GLMs).
  • The glm() function is widely used not only for logistic regression but also for other types of models.
  • To apply this function to logistic regression, specify the argument family = binomial(link = "logit").
  • Here, logit refers to the logarithm of the odds = Log(Odds) = Log Odds.
model_1 <- glm(wlsmd ~ expm + previous, 
            data = hr21, 
            family = binomial(link = "logit")) # Specify the coefficients in terms of log odds
  • Use tidy() to check the estimation results:
tidy(model_1, conf.int = TRUE)
# A tibble: 3 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)   -2.65     0.194     -13.7  9.90e-43   -3.04     -2.28 
2 expm           0.170    0.0235      7.23 4.86e-13    0.125     0.217
3 previous       0.342    0.0355      9.62 6.38e-22    0.274     0.413

→ The regression coefficients displayed under estimate represent the logarithm of the odds = Log(odds) = also called Log-Odds.
⇒ Since Log-Odds are difficult to interpret directly, they are usually converted into odds ratios or probabilities.

  • The logistic regression equation for predicting the probability of winning is as follows:

\[Pr(Winning) = logit^{-1}(\alpha + \beta_1Expenditure + \beta_2PreviousWins)\]

\[= \frac{1}{1+exp(-[\alpha + \beta_1Expenditure+\beta_2PreviousWins])}\] \[= \frac{1}{1+exp(-[-2.65 + 0.170[expm] + 0.34[previous]])}\]

  • For example, the estimated coefficient for campaign expenditure (expm) is about 0.170.
  • This represents the log-odds = log(odds) = logit.
  • If this were the result of a linear regression, it would mean that “for each one-unit increase in campaign expenditure (1 million yen), the probability of winning increases by 0.170 points.”
  • However, the coefficient in logistic regression cannot be interpreted in the same way as a coefficient in linear regression.

Note!

Log-odds are difficult to interpret

→ Convert them into odds ratios or probabilities.

3.4 Testing the Significance of the Regression Coefficients

  • Use tidy(), let’s display the results of model_1 in terms of odds ratios
tidy(model_1, 
  conf.int = TRUE, 
  exponentiate = TRUE)
# A tibble: 3 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)   0.0705    0.194     -13.7  9.90e-43   0.0476     0.102
2 expm          1.19      0.0235      7.23 4.86e-13   1.13       1.24 
3 previous      1.41      0.0355      9.62 6.38e-22   1.32       1.51 

Visualization of Odds Ratios using a Forest Plot

# Format results (convert to odds ratios)
results <- tidy(model_1, conf.int = TRUE, exponentiate = TRUE) %>%
  filter(term != "(Intercept)") %>%
  mutate(
    term = recode(term,
                  "expm" = "Campaign Expenditure (million yen)",
                  "previous" = "Previous Wins"),
    OR_label = sprintf("%.2f", estimate)   # Convert OR to string
  )

# Forest plot + numeric labels
ggplot(results, aes(x = estimate, y = term)) +
  geom_point(size = 3, color = "blue") +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
  geom_vline(xintercept = 1, linetype = "dashed", color = "red") +
  geom_text(aes(label = OR_label), # Display OR values
    hjust = 0.5, 
    vjust = -2, 
    size = 4) +  
  labs(
    title = "Odds Ratios from Logistic Regression",
    x = "Odds Ratio (95% CI)",
    y = ""
  ) +
  theme_minimal(base_size = 14) + # Add margin to avoid label cutoff
  xlim(0, max(results$conf.high) * 1.2) +
  theme_bw(base_family = "HiraKakuProN-W3")

Coefficient for Campaign Expenditure: 1.19 (p-value = 4.86e-13)

  • “Odds after increasing campaign expenditure by 1 million yen” … \(odds_{expm+1}\)
  • “Odds before increasing campaign expenditure by 1 million yen” … \(odds_{expm}\)

\[Odd Ratio = \frac{odds_{expm + 1}}{odds_{expm}} = 1.19\]

  • The odds \(odds_{expm+1}\) are 1.19 times greater than the odds \(odds_{expm}\)
  • Coefficient for Previous Wins: 1.41 (p-value = 6.38e-22)
  • “Odds after increasing the number of previous wins by one” … \(odds_{previous+1}\)
  • “Odds before increasing the number of previous wins by one” … \(odds_{previous}\)

\[OddRatios = \frac{odds_{previous + 1}}{odds_{previous}} = 1.41\]

  • The odds \(odds_{previous+1}\) are 1.41 times greater than the odds \(odds_{previous}\).

Statistical Significance

  • The p.value for expm is 4.86e-13.
  • The p.value for previous is 6.38e-22.
    → At the 0.05 significance level, the null hypothesis that “these variables have no effect” can be rejected.
    → Both expm and previous have a positive effect on the probability of winning, and these effects are statistically significant.

3.5 Preferred Way of Presenting Results

Predicted Winning Probabilities and Marginal Effects

  • For each value of campaign expenditure, we can estimate the candidate’s probability of winning, expressed as Predicted Winning Probability: 0 – 1.

Marginal Effects

  • The coefficients from logistic regression cannot be interpreted in the same way as those from linear regression.
  • The logistic curve is not a straight line → it is a curve.
    → The impact of a one-unit change in an explanatory variable on the response variable (i.e., the slope) depends on the value of the explanatory variable.
  • To assess how an explanatory variable influences the response variable:
    → Fix the value of the explanatory variable (e.g.,expm), and calculate the effect at each value.
    → These are called Marginal Effects.
    In the example shown above, the slope of the line at each point (x = 1, 5, 10, 15, 20) represents the marginal effect.

How should we interpret the coefficient of expm? ・Here, the estimated coefficient (=log-odds) for expm is 0.170 (p-value = 4.86e-13).
・In multiple regression analysis, we analyze a linear model (a straight line).
→ The slope and its statistical significance remain the same regardless of the value of the explanatory variable.
・However, in logistic regression analysis, we analyze a nonlinear model (a curve).
→ Depending on the size of campaign expenditure, both the magnitude of the effect (marginal effect) and its statistical significance change.
→ If campaign expenditure (expm) is statistically significant, then “overall” campaign expenditure has a positive effect on electoral outcomes, and that effect is statistically significant.
→ However, this represents the average relationship between campaign expenditure and electoral outcome.
→ It does not necessarily mean that the effect is statistically significant at every level of expenditure.
→ Therefore, statistical significance should be checked for each level of campaign expenditure.
・The same applies to the other explanatory variable, previous.

3.5.1 Predicted Winning Probability

  • Let us consider two cases for a candidate with 3 previous wins (previous = 3):
    Example 1: Changing campaign expenditure from 0 yen to 1 million yen
    Example 2: Changing campaign expenditure from 1 million yen to 2 million yen

Example 1: Campaign Expenditure (0 yen → 1 million yen)

  • Here, we examine how the response variable changes when a candidate with 3 previous wins increases campaign expenditure from 0 yen to 1 million yen.
    → We calculate the predicted winning probabilities for each case and then take the difference.
  • The predicted winning probability for a candidate with 3 previous wins and 0 yen in campaign expenditure can be obtained by substituting previous = 3 and expm = 0 into the regression equation:

\[Pr(Winning) = \frac{1}{1+exp(-(-1.98 + 0.0735expm + 0.285previous))}\]

  • Let us compute this value in R:
p_0 <- predict(model_1, type = "response", # option to display predicted winning probability
               newdata = data_frame(previous = 3, expm = 0))
p_0
        1 
0.1642294 
  • The result is 0.1642294 (≈ 16.4%).
  • Next, for a candidate with 3 previous wins and campaign expenditure of 1 million yen:
p_1 <- predict(model_1, type = "response",
               newdata = data_frame(previous = 3, expm = 1))
p_1
        1 
0.1889165 
  • The result is 0.1889165 (≈ 18.9%).
  • Therefore, the change in predicted winning probability when a candidate with 3 previous wins increases campaign expenditure from 0 yen to 1 million yen is the difference (p_1 - p_0):
p_1 - p_0
         1 
0.02468706 
  • The result shows that when campaign expenditure increases from 0 yen to 1 million yen, the predicted winning probability increases by about 2.47 percentage points for a candidate with 3 previous wins.

Example 2: Campaign Expenditure (1 million yen → 2 million yen)

  • Here, we examine how the response variable changes when a candidate with 3 previous wins increases campaign expenditure from 1 million yen to 2 million yen.
  • The predicted winning probability for a candidate with 3 previous wins and campaign expenditure of 2 million yen is 0.2163539 (≈ 21.6%):
p_2 <- predict(model_1, type = "response",
               newdata = data_frame(previous = 3, expm = 2))
p_2
        1 
0.2163539 
  • The change in predicted winning probability when a candidate with 3 previous wins increases campaign expenditure from 1 million yen to 2 million yen is the difference (p_2 - p_1):
p_2 - p_1
         1 
0.02743738 
  • The result shows that when campaign expenditure increases from 1 million yen to 2 million yen, the predicted winning probability increases by 0.02743738 (≈ 2.7 percentage points) for a candidate with 3 previous wins.

Summary ・Increasing campaign expenditure from 0 yen to 1 million yen raises the predicted probability of winning by 2.47 percentage points.
・Increasing campaign expenditure from 1 million yen to 2 million yen raises the predicted probability of winning by 2.7 percentage points.
・Both changes represent the same increment in expenditure (an additional 1 million yen).
・However, their effects on the response variable (i.e., the probability of winning) differ.
The 1-million-yen increase from 1 million to 2 million has a greater effect on the response variable than the increase from 0 to 1 million.

Example 3: Campaign Expenditure (1 million yen → 28 million yen)

  • I used R to calculate the winning probability at each point when campaign expenditures increase from 1 million yen to 27 million yen, in increments of 1 million yen.
predict(model_1, type = "response",
               newdata = data_frame(previous = 3, expm = 2:28))
        1         2         3         4         5         6         7         8 
0.2163539 0.2465648 0.2794896 0.3149731 0.3527559 0.3924743 0.4336684 0.4757999 
        9        10        11        12        13        14        15        16 
0.5182788 0.5604951 0.6018538 0.6418077 0.6798848 0.7157072 0.7490008 0.7795960 
       17        18        19        20        21        22        23        24 
0.8074208 0.8324876 0.8548781 0.8747262 0.8922019 0.9074976 0.9208156 0.9323590 
       25        26        27 
0.9423251 0.9509001 0.9582566 
  • It can be seen that the predicted probability of winning increases as campaign expenditures rise.

Summary

・It is difficult to determine exactly how many percentage points the predicted winning probability will rise without explicitly calculating the probabilities, simply by looking at the estimated coefficient: log-odds (expm = 0.170).

Dependent variable:
wlsmd
expm 0.170***
(0.024)
previous 0.342***
(0.036)
Constant -2.652***
(0.194)
Observations 835
Log Likelihood -397.969
Akaike Inf. Crit. 801.939
Note: p<0.1; p<0.05; p<0.01

・This estimated value represents the logarithm of the odds = Log Odds = logit.
→ Therefore, it is necessary to compute predicted probabilities at specific values of expm and compare the differences.

3.5.2 Marginal Effects

Calculating Marginal Effects with margins

  • As seen in the previous example, repeatedly calculating predicted probabilities and taking their differences can be cumbersome.
    → To directly calculate marginal effects at specific values, it is convenient to use the margins package.
  • The margins() function allows us to compute the marginal effect of campaign expenditure at specific levels of expenditure.
  • Marginal Effect of Campaign Expenditure (previous = 3, expm = 4 million yen)
  • For example, when the number of previous wins is 3 and campaign expenditure is 4 million yen, the marginal effect of a one-unit (1 million yen) increase in campaign expenditure on the probability of winning is:
margins(model_1, variables = "expm",
        at = list(previous = 3, expm = 4))
 at(previous) at(expm)    expm
            3        4 0.03424
  • The coefficient for expm (0.03424) indicates: → “When campaign expenditures increase by 1 million yen, how much the probability of winning changes,” evaluated at expm = 4 and previous = 3.
  • The unit is change in probability (percentage points)
  • The marginal effect of campaign expenditures on the probability of winning is 0.03424 (= 3.4 percentage points)
    → When campaign expenditures increase by 1 million yen, the probability of winning rises by 3.4 percentage points.
  • This is not an Average Marginal Effect (AME), but rather a marginal effect at specified values (MEM: Marginal Effect at specified values).
  • Since the predicted probability changes with the expenditure level, the size of the marginal effect also changes.
    → This makes it possible to examine at which expenditure level the effect is largest.

Marginal Effect of Campaign Expenditures (previous = 3, expm: 0 to 28 million yen)

  • Which values to evaluate the marginal effect at can be specified by passing a list of variables to the at argument.
  • For example, we can simultaneously display the marginal effects of campaign expenditures ranging from 0 yen (expm = 0) to 28 million yen (expm = 28).
margins_1 <- margins(model_1, variables = "expm",
at = list(previous = 3, expm = c(0:28))) # Specify campaign expenditures from 0 yen to 28 million yen

margins_1
 at(previous) at(expm)     expm
            3        0 0.023337
            3        1 0.026052
            3        2 0.028827
            3        3 0.031585
            3        4 0.034239
            3        5 0.036685
            3        6 0.038820
            3        7 0.040540
            3        8 0.041758
            3        9 0.042406
            3       10 0.042449
            3       11 0.041884
            3       12 0.040742
            3       13 0.039087
            3       14 0.037004
            3       15 0.034595
            3       16 0.031964
            3       17 0.029215
            3       18 0.026437
            3       19 0.023710
            3       20 0.021093
            3       21 0.018631
            3       22 0.016352
            3       23 0.014273
            3       24 0.012397
            3       25 0.010723
            3       26 0.009241
            3       27 0.007938
            3       28 0.006801
  • In logistic regression analysis, the marginal effects of coefficients vary depending on the values of the explanatory variables.
    → Therefore, it is difficult to fully understand the results simply by looking at the coefficients.
    → It is necessary to visualize how explanatory variables influence the response variable under different conditions.
    → In other words, we need to illustrate the marginal effects (ME) corresponding to different values of the explanatory variables.

3.6 Interpreting the Meaning of the Estimation Results

How to Present the Results of Hypothesis 1 (Model_1)

model_1 <- glm(wlsmd ~ previous + expm, 
               data = hr21,
               family = binomial(link = "logit"))
Predicted Winning Probability differs from Marginal Effects
  • The predicted winning probability refers to the probability that a candidate with a given level of campaign expenditure will win.
  • The marginal effect refers to the impact of campaign expenditure on the predicted winning probability (i.e., the slope)
  • Both predicted winning probabilities and marginal effects can be visualized using margins::cplot()
  • To display predicted winning probability on the Y-axis, specify what = "prediction"
  • To display marginal effects on the Y-axis, specify what = "effect".
  • The value of previous is automatically fixed at previous = 2.211 (the mean value)
“Campaign Expenditure” and “Predicted Winning Probability” (hr21_me)
hr21_pred <- cplot(model_1, 
                  x = "expm", 
                  what = "prediction") %>% # setting to display predicted winning probability on the Y-axis
  as_data_frame() %>%
  ggplot(aes(x = xvals, 
             y = yvals, 
             ymin = lower, 
             ymax = upper)) +
  geom_ribbon(fill = "gray") +
  geom_line() +
  labs(x = "Campaign Expenditure (units: 1 million yen)", 
       y = "Predicted Winning Probability", 
       title = "Predicted Winning Probability: 2021 House of Representatives Election")

“Campaign Expenditure” and “Marginal Effect of Campaign Expenditure” (hr21_pred)

hr21_me <- cplot(model_1,  
                 x = "expm",    # variable on the x-axis  
                 dx = "expm",   # explanatory variable  
                 what = "effect") %>%   # setting to display marginal effect on the y-axis  
  as_data_frame() %>%
  ggplot(aes(x = xvals, 
             y = yvals, 
             ymin = lower, 
             ymax = upper)) +
  geom_ribbon(fill = "gray") +
  geom_line() +
  geom_abline(intercept = 0, slope = 0, linetype = "dashed", color = "red") + 
  ylim(-0.0001, 0.06) +
  labs(x = "Campaign Expenditure", 
       y = "Marginal Effect of Campaign Expenditure", 
       title = "Marginal Effect of Campaign Expenditure: 2021 House of Representatives Election")

Summary of Hypothesis Testing (Hypothesis 1):

hr21_pred + hr21_me

Conclusion of Hypothesis Testing (Hypothesis 1)  

What we can see from the left-hand figure

  • Vertical axis: probability (0–1)
  • As campaign expenditures increase, the probability of a candidate winning also increases.
  • When expenditures are low, the predicted probability of winning rises gradually.
  • A candidate spending 10 million yen has a predicted winning probability of 0.4933899 (≈ 49%).
  • A candidate spending 20 million yen has a predicted winning probability of 0.8420812 (≈ 84%).
  • As expenditures reach moderate levels, the slope of the curve becomes gradually steeper.
  • When expenditures become large, the slope decreases again.

What we can see from the right-hand figure

  • Vertical axis: probability (0–1)
  • The effect of campaign expenditures on the probability of winning is diminishing.
  • From 0 to 11 million yen, the marginal effect rises sharply, but beyond 11 million yen, the additional effect gradually decreases.
  • The marginal effect of campaign expenditures is greatest at 11 million yen: 0.04250 (≈ 4.25 percentage points).
  • The effect is smaller when expenditures are either higher or lower than this amount.
  • Statistically significant across the entire range of campaign expenditures.
Campaign Expenditures and Winning Probability
  • Using R, we calculate the winning probability at each point when campaign expenditures increase from 1 million yen to 28 million yen in increments of 1 million yen (with previous fixed at 2.211).
predict(model_1, type = "response",
               newdata = data_frame(previous = 2.211, 
                 expm = 2:28))
        1         2         3         4         5         6         7         8 
0.1741271 0.1999460 0.2285339 0.2598815 0.2938910 0.3303645 0.3689990 0.4093895 
        9        10        11        12        13        14        15        16 
0.4510409 0.4933899 0.5358339 0.5777650 0.6186045 0.6578331 0.6950162 0.7298182 
       17        18        19        20        21        22        23        24 
0.7620087 0.7914599 0.8181363 0.8420812 0.8633997 0.8822428 0.8987913 0.9132429 
       25        26        27 
0.9258014 0.9366680 0.9460359 

Marginal Effect of Campaign Expenditures

  • We display the marginal effects of campaign expenditures ranging from 0 yen (expm = 0) to 28 million yen (expm = 28), with previous fixed at 2.211.
margins(model_1, variables = "expm",
at = list(previous = 2.211, expm = c(0:28))) 
 at(previous) at(expm)    expm
        2.211        0 0.01929
        2.211        1 0.02180
        2.211        2 0.02445
        2.211        3 0.02720
        2.211        4 0.02998
        2.211        5 0.03270
        2.211        6 0.03528
        2.211        7 0.03761
        2.211        8 0.03959
        2.211        9 0.04111
        2.211       10 0.04210
        2.211       11 0.04250
        2.211       12 0.04229
        2.211       13 0.04148
        2.211       14 0.04011
        2.211       15 0.03827
        2.211       16 0.03604
        2.211       17 0.03353
        2.211       18 0.03083
        2.211       19 0.02806
        2.211       20 0.02530
        2.211       21 0.02261
        2.211       22 0.02005
        2.211       23 0.01766
        2.211       24 0.01547
        2.211       25 0.01347
        2.211       26 0.01168
        2.211       27 0.01009
        2.211       28 0.00868

3.7 Predicted Probability of Winning by Number of Prior Wins

3.7.1 Predicted Probability

  • Regardless of the amount of campaign expenditure, we have established that campaign spending exerts a positive effect on the predicted probability of winning.
  • But does this also hold true when candidates’ number of prior wins differs?
  • To address this, we examine how campaign expenditure affects the predicted probability of winning for candidates with different numbers of prior wins.
  • First, we check the descriptive statistics of candidates’ prior wins (previous):
table(hr21$previous)

  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  16  17 
390  79  55 106  51  31  33  30  24  16   7   3   5   2   1   1   1 
hist(hr21$previous)

  • We observe that the majority of candidates have zero prior wins.
  • The 17 observed values for prior wins are grouped into intervals of two: 0, 2, 4, …, 16.
    → previous = seq(0, 16, by = 2)
  • Next, we examine the descriptive statistics of candidates’ campaign expenditure (expm):
summary(hr21$expm)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
 0.009319  3.435080  5.899882  6.434585  8.692808 27.443685 
hist(hr21$expm)

  • Campaign expenditure ranges from 0 to just over 27 million yen.
    → We divide the scale into 5-million-yen intervals: expm = seq(0, 20, by = 5)
  • Finally, we visualize the predicted probability of winning by number of prior wins.
df_pre <- expand.grid(previous = seq(0, 17, by = 2), 
  expm = seq(0, 20, by = 5)) %>% 
  as_data_frame() 
pred <- predict(model_1, 
  type = "response", 
  newdata = df_pre, 
  se.fit = TRUE) 

df_pre$fit <- pred$fit 
df_pre$lower <- with(pred, fit - 2 * se.fit) 
df_pre$upper <- with(pred, fit + 2 * se.fit) 
df_pre <- df_pre %>%

mutate(lower = ifelse(lower < 0, 0, lower), 
  upper = ifelse(upper > 1, 1, upper)) 
plt_prob <- ggplot(df_pre, aes(x = expm, y = fit)) +
geom_ribbon(aes(ymin = lower, ymax = upper), fill = "gray") +
geom_line() +

facet_wrap(. ~ previous) + 
  labs(x = "Campaign Expenditure (units: 1 million yen)", 
       y = "Predicted Winning Probability") 
print(plt_prob)

  • Candidates with fewer prior wins experience a strong positive effect of campaign expenditure on their predicted probability of winning.
  • As the number of prior wins increases, the effect of campaign expenditure on the predicted probability of winning diminishes.
    → Candidates with as many as 16 prior wins are almost certain to win regardless of their campaign expenditure.

3.7.2 Marginal Effects

  • Next, we check whether the effect of campaign expenditure on the predicted probability of winning remains statistically significant even when the number of prior wins varies.

  • We specify how the effect of campaign expenditure (expm) changes as number of prior wins (previous) increases
    x = "previous", dx = "expm"

  • Up to previous = 15, the 95% confidence interval does not touch the red dashed line at y = 0.
    → This means that, even as the number of prior wins increases, the effect of campaign expenditure on the predicted probability of winning remains statistically significant up to previous = 15.

Predicted Probability by Number of Prior Wins: Summary

plt_prob + plt_previous

Insights from the Left Figure

・For candidates with fewer prior wins, campaign expenditure has a strong positive effect on the predicted probability of winning.
・As the number of prior wins increases, the effect of campaign expenditure on the predicted probability of winning becomes smaller.
→ For candidates with 16 prior wins, the probability of winning is almost guaranteed, regardless of campaign expenditure.

Insights from the Right Figure

・This finding holds true for candidates with 0 to 15 prior wins, and the effect is statistically significant.
・The impact of campaign expenditure on the predicted probability of winning is greatest for candidates with 5 prior wins.

Statistical Significance of Marginal Effects Predicted probabilitiesand marginal effects are distinct concepts.

  • The marginal effect indicates how much the dependent variable increases when the explanatory variable increases by one unit.
  • Here, the explanatory variable is election expenditure.
Note: The marginal effect and the predicted probability of winning are different.
  • The marginal effect represents the impact of election expenditure on the predicted probability of winning (i.e., the slope).
  • The predicted probability is the probability that a candidate will win given a specific amount of election expenditure.

→ The marginal effect varies depending on the value of the explanatory variable.
→ It is necessary to present the marginal effects corresponding to the values of the explanatory variable. ・The 95% confidence interval varies depending on the number of voters, because the standard error of the marginal effect changes with the value of the explanatory variable (campaign expenditure).
・Assessing whether the main explanatory variable has a statistically significant effect on the outcome
・In the above example, within the observed range of voters, the entire 95% confidence interval lies above 0
→ This indicates that campaign expenditure has a statistically significant positive effect on electoral outcomes, regardless of the amount of expenditure.
・Conversely, when the entire 95% confidence interval lies below 0, the effect is also statistically significant (negative).
・As the figure illustrates, in logistic regression analysis, the effect of a key explanatory variable can vary not only in magnitude but also in its statistical significance.
・There may exist both ranges where the effect is statistically significant and ranges where it is not.
・For example, if the value of the explanatory variable is less than 1 or greater than 2, the effect is statistically significant.
・If the value of the explanatory variable is between 1 and 2, the effect is not statistically significant.
・It is extremely difficult to determine from the estimated regression coefficients alone which ranges of the explanatory variable yield statistically significant marginal effects.
→ Only by visualizing the marginal effects can we see
(a) where they are positive or negative, and
(b) in which ranges they are statistically significant.

★Only through the graphical depiction of marginal effects can it be determined across which ranges they take particular signs and whether they are statistically significant.

3.8 Evaluating Model Fit

  • There are three main methods to evaluate the fit of a logistic regression model:
  1. Prediction Accuracy
  2. ROC Curve and AUC
  3. Akaike Information Criterion: AIC
Prediction Accuracy
  • The model we aim to evaluate here is as follows:
  • The response variable is binary, indicating whether each candidate “won” or “lost” (wlsmd = 1 or 0).
model_1 <- glm(wlsmd ~ previous + expm, 
               data = hr21,
               family = binomial(link = "logit"))
  • We examine how accurately the above estimated logistic regression equation predicts electoral outcomes.
  • Specifically, we create a cross-tabulation of “predicted outcomes” versus “actual outcomes.”
  • A predicted winning probability of 0.5 or greater is interpreted as a prediction of “win.”
  • Using the fitted() function, we extract the predicted values from model_1 and construct the cross-tabulation of “predicted outcomes” and “actual outcomes.”
Pred <- (fitted(model_1) >= 0.5) %>%
  factor(levels = c(FALSE, TRUE), 
         labels = c("Predicted Loss", "Predicted Win"))

table(Pred, hr21$wlsmd) %>% addmargins()
                
Pred               0   1 Sum
  Predicted Loss 476 120 596
  Predicted Win   70 169 239
  Sum            546 289 835

Predicted Outcomes Based on Estimated Values

・Among the 546 candidates who actually lost, 476 were correctly predicted as losses
・For the remaining 70, the model incorrectly predicted wins
・Among the 289 candidates who actually won, 169 were correctly predicted as wins
・For the other 120 of the 289 winners, the model incorrectly predicted losses
・Overall, out of 835 candidates, 645 (476 + 169) were correctly predicted
・The remaining 190 were misclassified
→ Therefore, the accuracy of this model is 645/835 (about 77%)
- How should we evaluate this 77% accuracy?
- It is not the case that the prediction accuracy increased from 0% to 77% simply because of logistic regression.

Actual Outcomes = Baseline Accuracy

  • Here, out of 835 candidates, 289 actually won
    → If we make a naive prediction of “everyone wins” without including any explanatory variables….
    → The prediction accuracy would be 289/835 (about 34%)

Prediction Accuracy

model_1 improved prediction accuracy from 34% to 77%, an increase of 43 percentage points.
  • Whether the prediction accuracy of the logistic regression model can be considered high should be evaluated by comparing it to the baseline accuracy that could be achieved without using any explanatory variables.

ROC Curve and AUC

(1) A visual method for evaluating model fit

  • Receiver Operating Characteristic (ROC) curve
How to draw an ROC curve
  • On the horizontal axis of the ROC curve, we plot the false positive rate (FPR).
  • A false positive means “an observation that is actually negative but is incorrectly classified as positive.”
  • In this context, “predicting a win when the candidate actually lost” = false positive.
  • If we set the cutoff for predicted winning probability at 0.5, the false positive rate is 55/678 (see figure below).
Pred <- (fitted(model_1) >= 0.5) %>%
  factor(levels = c(FALSE, TRUE), labels = c("Predicted Loss", "Predicted Win"))

table(Pred, hr21$wlsmd) %>% addmargins()
                
Pred               0   1 Sum
  Predicted Loss 476 120 596
  Predicted Win   70 169 239
  Sum            546 289 835
  • Because false positives are misclassifications, it is desirable for the false positive rate to be small.
  • On the vertical axis, we plot the true positive rate (TPR).
  • The true positive rate is also called sensitivity.
  • A true positive means “an observation that is actually positive and is correctly classified as positive.”
  • In this context, “predicting a win for a candidate who actually won” = true positive.
  • If the cutoff probability is set at 0.5, the sensitivity is 169/289 (see the table above).
  • Because sensitivity represents the probability of a correct classification, larger values are desirable.
ROC curve of a well-fitting model:

→ The curve rises steeply from point (0, 0) to near point (0, 1), then proceeds toward point (1, 1).

ROC curve of a poorly fitting model:

→ The ROC curve lies close to the 45-degree diagonal line.
- The two models estimated here are as follows:

model_1 <- glm(wlsmd ~ expm, data = hr21,
               family = binomial(link = "logit"))
model_2 <- glm(wlsmd ~ expm + previous, data = hr21,
               family = binomial(link = "logit"))
  • To determine which model, model_1 or model_2, has a better fit, we draw the ROC curve.
  • To draw the ROC curve, we use the ROCR package.
  • Note that the function named prediction() exists not only in the ROCR package but also in the margins package.
    → Therefore, we must explicitly specify which function to use.
pi1 <- predict(model_1, type = "response")
pi2 <- predict(model_2, type = "response")
pr1 <- ROCR::prediction(pi1, labels = hr21$wlsmd == "1")
pr2 <- ROCR::prediction(pi2, labels = hr21$wlsmd == "1")
roc1 <- performance(pr1, measure = "tpr", x.measure = "fpr")
roc2 <- performance(pr2, measure = "tpr", x.measure = "fpr")
df_roc <- data_frame(fpr = c(roc1@x.values[[1]], roc2@x.values[[1]]),
                     tpr = c(roc1@y.values[[1]], roc2@y.values[[1]])) %>% 
  mutate(model = rep(c("Model 1", "Model 2"), c(n()/2, n()/2)))
roc <- ggplot(df_roc, aes(x = fpr, y = tpr,
                          color = model, linetype = model)) + 
  geom_line() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
  scale_linetype_discrete(name = "") +
  scale_color_discrete(name = "") +
  coord_fixed() +
  labs(x = "False Positive Rate (1 - Specificity)", 
       y = "True Positive Rate (Sensitivity)")
print(roc)

Model Fit Based on AUC The blue ROC curve (model_2) deviates from the 45-degree line (dashed line) toward the point (0, 1), indicating a good model fit.

If you want to plot only the ROC curve for Model 2:

# Calculate predicted values
pi2 <- predict(model_2, type = "response")

# Create an object for ROCR
pr2 <- ROCR::prediction(pi2, labels = hr21$wlsmd == "1")
roc2 <- performance(pr2, measure = "tpr", x.measure = "fpr")

# Convert to a data frame
df_roc2 <- data.frame(
  fpr = roc2@x.values[[1]],
  tpr = roc2@y.values[[1]]
)

# Plot the ROC curve with ggplot
roc_plot2 <- ggplot(df_roc2, aes(x = fpr, y = tpr)) +
  geom_line(color = "blue") +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
  coord_fixed() +
  labs(
    x = "False Positive Rate (1 - Specificity)",
    y = "True Positive Rate (Sensitivity)",
    title = "ROC Curve: Model 2"
  )

print(roc_plot2)

(2) Evaluation of Model Fit Using Numerical Indicators

AUC

  • There are limitations to assessing model fit solely by visual inspection.
    → Instead, we use AUC (area under the curve: the area under the ROC curve).
  • Specifically, we calculate the area under the ROC curve within the range 0 ≤ x ≤ 1, 0 ≤ y ≤ 1.
  • All ROC curves pass through the two points (0, 0) and (1, 1).
  • A better-fitting model’s curve passes closer to (1, 1).
  • If the ROC curve passes through the three points (0, 0), (0, 1), and (1, 1), then → the area under the curve equals 1.
  • Good model fit (AUC) → closer to 1.
  • Poor model fit (AUC) → closer to 0.5 (since the ROC curve approaches the 45-degree diagonal).
  • From the ROC curves, there does not appear to be a large difference in model fit between the two models.
  • For confirmation, we calculate the AUC:
auc1 <- performance(pr1, measure = "auc")
auc1@y.values[[1]]   # AUC of model_1
[1] 0.7825171
auc2 <- performance(pr2, measure = "auc")
auc2@y.values[[1]]   # AUC of model_2
[1] 0.853502
Model Fit Based on AUC Judging by the AUC, the fit of model_2 is superior.

4. Logistic Regression Analysis (2)

When x is a categorical variable

  • The steps for conducting logistic regression analysis are as follows:
  1. Formulate the alternative hypothesis and the null hypothesis
  2. Display a scatterplot of the explanatory and response variables
  3. Estimate the logistic regression equation
  4. Test the significance of the regression coefficients
  5. Interpret the meaning of the estimated results
  6. Evaluate the model fit of the logistic regression model

4.1 Formulating the Alternative and Null Hypotheses

  • The hypothesis to be tested (i.e., the alternative hypothesis) is as follows:

Hypothesis 2 ・Female candidates have a higher probability of winning in single-member districts.

  • Accordingly, the null hypothesis to be set up for potential rejection is as follows:

Null Hypothesis ・A candidate’s gender is unrelated to the probability of winning in single-member districts.

  • In hypothesis testing for logistic regression, just as in multiple regression analysis, if the obtained p-value is smaller than the significance level, we reject the null hypothesis and accept the alternative hypothesis.
  • To test Hypothesis 2, we consider the following model. Would you like me to also translate the model specification part into English (e.g., how you write glm(wlsmd ~ female, …)) in the same academic style?

Model 2

  • Using the downloaded 2021 House of Representatives election data, display a scatterplot with the candidate’s electoral outcome in single-member districts (wlsmd) on the vertical axis and campaign expenditures (expm) on the horizontal axis.
Data Preparation (hr96-24.csv)
  • Configure the ggplot2 theme and font settings.
if (.Platform$OS.type == "windows") { 
  if (require(fontregisterer)) {
    my_font <- "Yu Gothic"
  } else {
    my_font <- "Japan1"
  }
} else if (capabilities("aqua")) {
  my_font <- "HiraginoSans-W3"
} else {
  my_font <- "IPAexGothic"
}

theme_set(theme_gray(base_size = 9,
                     base_family = my_font))
  • Download the vote share data from the House of Representatives general elections hr96-24.csv.
  • After downloading, load the data and assign it the name hr.
hr <- read_csv("data/hr96-24.csv", na = ".")
  • Display the contents of the data frame hr.
names(hr)
 [1] "year"          "pref"          "ku"            "kun"          
 [5] "wl"            "rank"          "nocand"        "seito"        
 [9] "j_name"        "gender"        "name"          "previous"     
[13] "age"           "exp"           "status"        "vote"         
[17] "voteshare"     "eligible"      "turnout"       "seshu_dummy"  
[21] "jiban_seshu"   "nojiban_seshu"
  • Use the select() function to extract only the four variables: year, wl, previous, and gender.
  • Use the filter() function to keep only the data from the 2021 House of Representatives election.
  • Check the contents of wl.
unique(hr$wl)
[1] 1 0 2

0・・・Lost in the single-member district
1・・・Won in the single-member district
2・・・Elected through proportional representation (after losing in the single-member district)

  • Check the contents of gender.
unique(hr$gender)
[1] "male"   "female"
  • Using wl, create a variable wlsmd to indicate election outcomes (1 = elected in the single-member district, 0 = otherwise).
  • Using exp, create a variable expm to represent election expenditures (measured in units of one million yen).
hr21a <- hr %>%
  select(year, wl, previous, gender) %>%
  filter(year == 2021) %>% 
  mutate(wlsmd = ifelse(wl == 1, 1, 0)) |> 
  mutate(female = ifelse(gender == "female", 1, 0)) |> # 1 if female, 0 if male
  
  select(year, wlsmd, female, previous) # Select only the necessary variables
  • Display the contents of the data frame hr21a.
hr21a
# A tibble: 857 × 4
    year wlsmd female previous
   <dbl> <dbl>  <dbl>    <dbl>
 1  2021     1      0        3
 2  2021     0      0        2
 3  2021     0      1        0
 4  2021     1      0        8
 5  2021     0      0        0
 6  2021     1      0        8
 7  2021     0      0        3
 8  2021     1      0        3
 9  2021     0      0        6
10  2021     0      1        0
# ℹ 847 more rows
Data:
Variable Description
year Year of the House of Representatives election
wlsmd Dummy for election outcome in the single-member district (1 = won, 0 = lost)
female Gender dummy (1 = female, 0 = male)
previous Number of times elected
  • Display the descriptive statistics of the data frame hr21a.
summary(hr21a)
      year          wlsmd            female          previous     
 Min.   :2021   Min.   :0.0000   Min.   :0.0000   Min.   : 0.000  
 1st Qu.:2021   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 0.000  
 Median :2021   Median :0.0000   Median :0.0000   Median : 1.000  
 Mean   :2021   Mean   :0.3372   Mean   :0.1634   Mean   : 2.162  
 3rd Qu.:2021   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.: 3.000  
 Max.   :2021   Max.   :1.0000   Max.   :1.0000   Max.   :17.000  
  • It was confirmed that there are no missing values.

4.2 Displaying a Scatter Plot of the Explanatory and Response Variables

  • Display the descriptive statistics of the dataset used for analysis (hr21a).
stargazer::stargazer(as.data.frame(hr21a), 
  type = "html")
Statistic N Mean St. Dev. Min Max
year 857 2,021.000 0.000 2,021 2,021
wlsmd 857 0.337 0.473 0 1
female 857 0.163 0.370 0 1
previous 857 2.162 2.908 0 17
  • Draw a scatter plot of the number of times elected and the election outcome.
  • Visualize the relationship between the two variables.
  • To avoid character corruption, Mac users should run the following two lines:
theme_set(theme_gray(base_size = 10, 
                     base_family = "HiraginoSans-W3"))
  • Since both wlsmd (0 or 1) and female (0 or 1) are binary variables, use the jitter() function to spread out the points for display.
p1 <- ggplot(hr21a, aes(x = female, y = wlsmd)) + 
  geom_jitter(size = 1,        # Spread out the points
              alpha = 1/3,
              width = 0.05,
              height = 0.05) +
  labs(x = "Candidate's Gender",
       y = "Election Outcome in Single-Member District (0: Lost, 1: Won)")

plot(p1)

  • If we fit a simple regression line to this plot, it looks like this.
p1 + geom_smooth(method = "lm", se = FALSE) +
    annotate("label", 
           label = "Winnig (0 or 1) = a + bfemale", 
           x = 0.5, y = 0.5,
           size = 5, 
           colour = "blue", 
           family = "HiraginoSans-W3")

Curve Taking the Log of the Odds

p4 <- ggplot(hr21a, aes(x = female, y = wlsmd)) + 
  geom_jitter(size = 1,
              alpha = 1/3,
              width = 0.05,
              height = 0.05) +
  geom_smooth(method = "glm", 
    color = "red",
    method.args = list(family = binomial(link = "logit"))) +
   labs(x = "Candidate's Gender",
       y = "Election Outcome in Single-Member District (0: Lost, 1: Won)")

print(p4)

4.3 Estimating the Logistic Regression Equation.

Logistic Regression Equation (Model 2)

  • Here, we use data from the 2021 House of Representatives election and consider a model in which the candidate’s gender explains the electoral outcome in single-member districts (wlsmd).

Model 2

  • This model can be expressed in the following equation:

\[Pr(wlsmd)=logit^{−1}(𝛼+\beta_1female + \beta_2previous)\]

  • In R, logistic regression analysis is conducted using the glm() function, which fits Generalized Linear Models (GLMs).
  • glm() is a function that is frequently used not only for logistic regression but also for many other types of models.
  • To apply this function for logistic regression, you specify the argument family = binomial(link = "logit").
  • Here, logit refers to the logarithm of the odds = log-odds = log(odds).
model_2 <- glm(wlsmd ~ female + previous, 
            data = hr21a, 
            family = binomial(link = "logit")) 
# Specify the coefficients in terms of the logarithm of the odds (“log-odds”)
  • Use tidy(), let’s display the results of model_1 in terms of odds ratios
tidy(model_2, 
  conf.int = TRUE, 
  exponentiate = TRUE)
# A tibble: 3 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)    0.203    0.122     -13.1  4.79e-39    0.159     0.256
2 female         0.551    0.265      -2.25 2.44e- 2    0.322     0.912
3 previous       1.53     0.0350     12.2  1.89e-34    1.44      1.65 

Visualization of Odds Ratios using a Forest Plot

# Format results (convert to odds ratios)
results <- tidy(model_2, conf.int = TRUE, exponentiate = TRUE) %>%
  filter(term != "(Intercept)") %>%
  mutate(
    term = recode(term,
                  "female" = "Gender (Female = 1, Male = 0)",
                  "previous" = "Previous Wins"),
    OR_label = sprintf("%.2f", estimate)   # Convert OR to string
  )

# Forest plot + numeric labels
ggplot(results, aes(x = estimate, y = term)) +
  geom_point(size = 3, color = "blue") +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
  geom_vline(xintercept = 1, linetype = "dashed", color = "red") +
  geom_text(aes(label = OR_label), # Display OR values
    hjust = 0.5, 
    vjust = -2, 
    size = 4) +  
  labs(
    title = "Odds Ratios from Logistic Regression",
    x = "Odds Ratio (95% CI)",
    y = ""
  ) +
  theme_minimal(base_size = 14) + # Add margin to avoid label cutoff
  xlim(0, max(results$conf.high) * 1.2) +
  theme_bw(base_family = "HiraKakuProN-W3")

Coefficient for Gender: 0.55 (p-value = 2.44e-2)

  • Odds when the candidate is female: \(odds_{female=1}\)
  • Odds when the candidate is male: \(odds_{female=0}\)

\[OddRatio = \frac{odds_{female=1}}{odds_{female=0}}=0.55\]

→ The odds for a female candidate are 0.551 times those for a male candidate. → This indicates that women are less likely to be elected.

Coefficient for Previous Wins: 1.53 (p-value = 1.89e-34)

\[OddRatio = \frac{odds_{previous + 1}}{odds_{previous}}=1.53\]

  • Odds after increasing the number of previous wins by one: \(odds_{previous+1}\)
  • Odds before increasing the number of previous wins by one: \(odds_{previous+1}\)
    → The odds after one additional previous win are 1.53 times those before.
    → Candidates with more prior wins are more likely to be elected.

Statistical Significance

  • p-value for female = 2.44e-2
  • p-value for previous = 1.89e-34
    → At the 0.05 significance level, the null hypothesis of “no effect” can be rejected for both variables.
    → Both female and previous have a statistically significant effect on electoral outcomes, and their effects are positive and meaningful.

4.5 Presenting Results in a Desirable Way

  • Calculate the predicted probability of election for each gender
  • Fix previous experience (previous) at its mean.
library(ggplot2)
library(dplyr)

# Fix previous at its mean, set female = 0 / 1
newdata <- data.frame(
  female = c(0, 1),
  previous = mean(hr21a$previous, na.rm = TRUE)
)

# Predicted values and standard errors
pred <- predict(model_2, 
  newdata = newdata, 
  type = "response", 
  se.fit = TRUE)

# Data for plotting
plot_df <- newdata %>%
  mutate(
    fit = pred$fit,
    se = pred$se.fit,
    lower = fit - 1.96 * se,
    upper = fit + 1.96 * se,
    sex = factor(female, labels = c("male", "female"))
  )

# Plot: add p-value as text in the figure
plot_prediction <- ggplot(plot_df, aes(x = sex, y = fit)) +
  geom_point(size = 5, color = "black") +
  geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.1) +
  geom_text(aes(label = paste0(round(fit * 100, 1), "%")), 
            hjust = -0.5, size = 5) +
  geom_text(
    data = data.frame(x = 1.5, y = max(plot_df$upper) + 0.05),
    aes(x = x, y = y, label = "p = 0.024"),
    inherit.aes = FALSE,
    size = 5
  ) +
  labs(
    title = "Predicted Probability of Election by Gender",
    x = "Candidate Gender",
    y = "Predicted Probability of Election"
  ) +
   theme_bw(base_family = "HiraKakuProN-W3")

plot_prediction

Results ・The figure shows the difference in predicted probabilities of election by gender.
・Prior experience (previous) is fixed at the mean (2.162 times).
・The predicted probability of election for male candidates is 33.8%.
・The predicted probability of election for female candidates is 22%.
・There is a statistically significant difference between the two (p = 0.024).
The difference in predicted probability of election is about 11.8 percentage points: 33.8 − 22 = 11.8

  • When the independent variable is categorical, if the main analytical interest is the effect of gender diff/erences on the probability of election, presenting the results in this way is sufficient.
  • However, if the main interest is how the effect of female candidates changes depending on conditions, then it is necessary to present the conditional marginal effects for female = 0 and female = 1.

4.6 A More Desirable Way of Presenting Results

Calculate the Average Marginal Effect (AME) and present the results for male and female candidates.
  • For a categorical variable (female), compute for each observation how much the predicted probability of election changes when shifting from male → female.
  • Then, take the average of these differences → this is the Average Marginal Effect (AME).
  • This allows us to state: “On average, female candidates have a ○ percentage point lower probability of election than male candidates.”
  • The meaning of the coefficient thus becomes even clearer.
# Required packages
library(margins)
library(dplyr)
library(ggplot2)
# --- Function to format p-values ---
format_p <- function(x) {
  ifelse(x < 0.001,
         "p < 0.001",
         formatC(x, format = "e", digits = 3))
}

# --- Calculation of marginal effects ---
mfx <- margins(model_2, variables = "female")

summary(mfx)
 factor     AME     SE       z      p   lower   upper
 female -0.0976 0.0431 -2.2633 0.0236 -0.1821 -0.0131

Results (Comparison of Marginal Effects)

The “effect of gender (difference)” can be identified

= It shows the average impact on the probability of election when gender changes
AME = -0.0976
→ On average, female candidates have about a 9.8 percentage point lower probability of election
・This effect is statistically significant (p-value = 0.0236)

6. Excercise

  • Logistic Regression Analysis Using the 2009 General Election Data (Democratic Party Regime Change) Using the 2009 general election data in which the Democratic Party achieved a change of government, conduct a logistic regression analysis with whether a candidate won a seat in a single-member district (wlsmd) as the dependent variable, and past number of wins (previous) and campaign expenditures (expm) as explanatory variables.
  • Answer the following questions.
  • The election data can be downloaded from hr96-24.csv.
  • Q1. Draw a scatter plot with election outcome (wlsmd) on the vertical axis and campaign expenditures (expm) on the horizontal axis, fitting an ordinary regression line (even if inappropriate).
  • Q2. Draw a scatter plot with predicted probability of winning on the vertical axis and campaign expenditures (expm) on the horizontal axis (the curve obtained from the odds of wlsmd).
  • Q3. Conduct a logistic regression analysis with wlsmd as the dependent variable and expm as the explanatory variable. Name this model model_1 and display the results using the tidy() function.
  • Q4. Conduct a logistic regression analysis with wlsmd as the dependent variable and previous and expm as the explanatory variables. Name this model model_2 and display the results using the tidy() function.
  • Q5. Present the results of model_1 and model_2 together in one table using the stargazer() function.
  • Q6. Explain the meaning of the coefficients for expm and previous in model_2, and discuss what can be inferred from the results.
  • Q7. Use the margins::cplot() function to display the estimation results of model_2, and state your conclusion. Present the results using the patchwork package.
  • Q8. Discuss whether the inclusion of previous in model_2 increases or decreases the predictive accuracy compared to model_1.
  • Q9. Compare and present the ROC curves of model_1 and model_2, and state which model is better.
  • Q10. In model_2, estimate the predicted probability of winning for each value of previous and its statistical significance using the margins::cplot() function.
参考文献