library(corrplot)
library(jtools)
library(margins)
library(ROCR)
library(patchwork)
library(prediction)
library(stargazer)
library(tidyverse)
Vote share (or number of votes) — a continuous variable → Regression Analysis.
Electoral outcome (win/loss) — coded as 1 if elected, 0 if not → Logistic Regression Analysis
Regression Analysis | Logistic Regression Analysis | |
Response Variable | Vote share (%) | Electoral outcome (0 or 1) |
Explanatory Variable | Campaign expenditure (million yen) | Campaign expenditure (million yen) |
・In logistic regression analysis, results are expressed not in terms of “proportions” but in terms of “odds.”
::: {.kakomi-box11} Proportion, Odds, Odds Ratio, Logit Transformation
・Suppose there are 100 candidates who spend 8 million yen on their
campaigns.
・Among them, 60 win and 40 lose.
・In this case, the winning “proportion” and the winning “odds” can be
calculated as follows:
p
: probability that an event occurs
1 - p
: probability that an event does not occur
\[Odds = \frac{p}{1-p}\] If the probability of winning is 75% (p = 0.75), the odds are:
\[Odds = \frac{p}{1-p} =
\frac{0.75}{0.25}= 3\] - The meaning of “odds = 3” is:
→ There is a 3-to-1 chance of winning.
p = 0.01
and
1 - p = 1 - 0.01 = 0.99
into the formula for odds:\[Odds = \frac{p}{1-p} = \frac{0.01}{0.99} = \frac{1}{99} = 0.01\]
・If the probability of an event is 1% (p = 0.01), the odds are
0.01.
- This means “a 0.01-to-1 chance of winning.”
- Equivalently, it can be expressed as “a 1-to-100 chance of
winning.”
- In other words, “there is a chance of winning only once in 100
trials.”
Characteristics of odds ・The larger
the odds, the more likely the event (p) is to occur.
・The minimum value of odds is 0.
・Odds = 1 means that the probability (p) of the event occurring is
50%.
・The maximum value of odds is infinity (∞).
・It is the ratio of the odds between two groups.
\[Odds Ratio = \frac{odds_1}{odds_2}\]
・Odds of winning for male candidates = 3
・Odds of winning for female candidates = 2
\[Odds Ratio = \frac{odds_1}{odds_2}= \frac{odds_{male}}{odds_{female}}=\frac{3}{2} = 1.5\]
・This is interpreted as:
→ Male candidates are 1.5 times more likely to win than female
candidates.
Basic Knowledge of the Logit Model: Logit Transformation
・A logit transformation means taking the logarithm of the odds of a
binary variable (0 or 1).
・Since the lower bound of odds is 0, they are difficult to use directly
as an explanatory variable.
→ By applying a logarithmic transformation to odds, we compute the log
odds.
→ This is what we call the logit =
log(odds) = log-odds
・logit
is the name of a function (commonly used in
statistical modeling)
\[log(odds) =log\frac{p}{1-p}\]
・A logit (logit) can be understood as the logarithm of the odds (log odds).
Characteristics of log-odds (=
logit) ・Both the minimum value and the maximum value are
infinite (∞).
・log(odds) = 0 means that the probability (p) of the event occurring is
50%.
Multiple Regression | Ordinary Least Square (OLS) |
Logistic Regression | Maximum Likelihood Estimation (MLE) |
Probability | The magnitude of the possibility of what will occur in the future |
Likelihood | The magnitude of the possibility underlying what has already occurred |
\[\frac{3}{5} × \frac{2}{5} × \frac{3}{5} = \frac{18}{125}\]
Number of balls | Likelihood |
Case: 1 red, 4 white | \(\frac{1}{5}×\frac{4}{5}×\frac{1}{5}= \frac{4}{125}\) |
Case: 2 red, 3 white | \(\frac{2}{5}×\frac{3}{5}×\frac{2}{5}= \frac{12}{125}\) |
Case: 3 red, 2 white | \(\frac{3}{5}×\frac{2}{5}×\frac{3}{5}= \frac{18}{125}\) |
Case: 4 red, 1 white | \(\frac{4}{5}×\frac{1}{5}×\frac{4}{5}= \frac{16}{125}\) |
Key Point Logistic regression employs an estimation method called Maximum Likelihood Estimation (MLE).
・For example, suppose we want to analyze the relationship Campaign
Expenditure (X) → Electoral Outcome (Y).
・Since campaign expenditure (X) is an observed (fixed) variable, it
cannot be estimated.
Hypothesis 1 The more money a candidate spends on the campaign, the higher their probability of winning in a single-member district.
Null Hypothesis The amount of campaign expenditure has no relationship with the probability of winning in a single-member district.
In hypothesis testing within logistic regression, just as in multiple regression analysis, if the obtained p-value is smaller than the significance level, we reject the null hypothesis and accept the alternative hypothesis.
To test Hypothesis 1, we consider the following model.
hr96-24.csv
)Download the vote share data from the House of Representatives
general elections: hr96-24.csv.
Using the downloaded 2021 House of Representatives election data, display a scatterplot with electoral outcome in single-member districts (wlsmd) on the vertical axis and campaign expenditure (expm) on the horizontal axis.
Once the dataset has been downloaded, load it and assign it to the object hr:
[1] "year" "pref" "ku" "kun"
[5] "wl" "rank" "nocand" "seito"
[9] "j_name" "gender" "name" "previous"
[13] "age" "exp" "status" "vote"
[17] "voteshare" "eligible" "turnout" "seshu_dummy"
[21] "jiban_seshu" "nojiban_seshu"
select()
function to keep only the four
variables: year, wl, previous, and exp.filter()
function to retain only the data from
the 2021 House of Representatives election.[1] 1 0 2
0 = Lost in single-member district
1 = Won in single-member district
2 = Elected via proportional representation (“revival win”)
hr21 <- hr %>%
select(year, wl, previous, exp) %>%
filter(year == 2021) %>%
mutate(wlsmd = ifelse(wl == 1, 1, 0)) |>
mutate(expm = exp / 1000000) |>
select(year, wlsmd, expm, previous)
hr21
.# A tibble: 857 × 4
year wlsmd expm previous
<dbl> <dbl> <dbl> <dbl>
1 2021 1 13.4 3
2 2021 0 9.62 2
3 2021 0 NA 0
4 2021 1 8.82 8
5 2021 0 11.4 0
6 2021 1 7.42 8
7 2021 0 11.9 3
8 2021 1 13.0 3
9 2021 0 7.48 6
10 2021 0 4.61 0
# ℹ 847 more rows
データ:
Variable | Description |
---|---|
year |
Year in which the House of Representatives election was held |
wlsmd |
Dummy variable for electoral outcome in single-member districts (1 = win, 0 = loss) |
previous |
Number of times previously elected |
expm |
Candidate’s campaign expenditure (in million yen) |
year wlsmd expm previous
Min. :2021 Min. :0.0000 Min. : 0.009319 Min. : 0.000
1st Qu.:2021 1st Qu.:0.0000 1st Qu.: 3.435080 1st Qu.: 0.000
Median :2021 Median :0.0000 Median : 5.899882 Median : 1.000
Mean :2021 Mean :0.3372 Mean : 6.434585 Mean : 2.162
3rd Qu.:2021 3rd Qu.:1.0000 3rd Qu.: 8.692808 3rd Qu.: 3.000
Max. :2021 Max. :1.0000 Max. :27.443685 Max. :17.000
NA's :22
na.omit()
to keep only observations without missing
values: year wlsmd expm previous
Min. :2021 Min. :0.0000 Min. : 0.009319 Min. : 0.000
1st Qu.:2021 1st Qu.:0.0000 1st Qu.: 3.435080 1st Qu.: 0.000
Median :2021 Median :0.0000 Median : 5.899882 Median : 1.000
Mean :2021 Mean :0.3461 Mean : 6.434585 Mean : 2.211
3rd Qu.:2021 3rd Qu.:1.0000 3rd Qu.: 8.692808 3rd Qu.: 3.000
Max. :2021 Max. :1.0000 Max. :27.443685 Max. :17.000
NA's
) have been
removed. Statistic | N | Mean | St. Dev. | Min | Max |
year | 835 | 2,021.000 | 0.000 | 2,021 | 2,021 |
wlsmd | 835 | 0.346 | 0.476 | 0 | 1 |
expm | 835 | 6.435 | 4.287 | 0.009 | 27.444 |
previous | 835 | 2.211 | 2.921 | 0 | 17 |
wlsmd
is a binary variable (0 or 1), use the
jitter()
function to spread out the data points for
display.p <- ggplot(hr21, aes(x = expm, y = wlsmd)) +
geom_jitter(size = 1, #Specification to display the data with jittering
alpha = 1/3,
width = 0,
height = 0.05) +
labs(x = "Campaign Expenditure (measured in units of one million yen)",
y = "Win/loss in single-member districts")
plot(p)
p + geom_smooth(method = "lm", se = FALSE) +
annotate("label",
label = "wlsmd (0 or 1) = a + bexpm",
x = 6, y = 0.75,
size = 5,
colour = "blue",
family = "HiraginoSans-W3")
Apply a logit transformation to
the probability of winning.
The probability of winning, \(P\), takes values between 0 and 1 (a binary outcome).
\(\frac{P}{1-P}\) represents the
odds (see the column “Basic Knowledge of the Logit Model: Odds”).
Odds are defined as the ratio of the probability of winning, \(\mathrm{p}\), to the probability of losing, \(\mathrm{1 - p}\).
The relationship between “probability of winning \(P\)” and “gender” can be expressed as follows:
\[\frac{P}{1-p} = a + (CampaignExpenditure)b\]
p3 <- ggplot(hr21, aes(x = expm, y = wlsmd)) +
geom_jitter(size = 1,
alpha = 1/3,
width = 0,
height = 0.05) +
geom_smooth(method = "glm",
color = "red",
method.args = list(family = binomial(link = "logit"))) +
labs(x = "Campaign Expenditure (measured in units of one million yen)",
y = "Probability of winning in single-member districts")
print(p3)
Logistic Regression Equation (Model 1) ・Here, using the 2021 House of Representatives election data, we consider a model in which campaign expenditure (expm, measured in units of one million yen) explains the electoral outcome in single-member districts (wlsmd).
\[Pr(Winning)=logit^{−1}(𝛼+\beta_1Expenditure + \beta_2PreviousWins)\]
glm()
function, which is designed to fit Generalized Linear
Models (GLMs).glm()
function is widely used not only for logistic
regression but also for other types of models.family = binomial(link = "logit")
.model_1 <- glm(wlsmd ~ expm + previous,
data = hr21,
family = binomial(link = "logit")) # Specify the coefficients in terms of log odds
tidy()
to check the estimation results:# A tibble: 3 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -2.65 0.194 -13.7 9.90e-43 -3.04 -2.28
2 expm 0.170 0.0235 7.23 4.86e-13 0.125 0.217
3 previous 0.342 0.0355 9.62 6.38e-22 0.274 0.413
→ The regression coefficients displayed under estimate represent the
logarithm of the odds = Log(odds) = also called Log-Odds.
⇒ Since Log-Odds are difficult to interpret directly, they are usually
converted into odds ratios or probabilities.
\[Pr(Winning) = logit^{-1}(\alpha + \beta_1Expenditure + \beta_2PreviousWins)\]
\[= \frac{1}{1+exp(-[\alpha + \beta_1Expenditure+\beta_2PreviousWins])}\] \[= \frac{1}{1+exp(-[-2.65 + 0.170[expm] + 0.34[previous]])}\]
expm
) is about 0.170.log-odds
= log(odds)
=
logit.
Note!
Log-odds
are difficult to
interpret
→ Convert them into odds ratios
or probabilities
.
tidy()
, let’s display the results of
model_1
in terms of odds ratios
# A tibble: 3 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.0705 0.194 -13.7 9.90e-43 0.0476 0.102
2 expm 1.19 0.0235 7.23 4.86e-13 1.13 1.24
3 previous 1.41 0.0355 9.62 6.38e-22 1.32 1.51
# Format results (convert to odds ratios)
results <- tidy(model_1, conf.int = TRUE, exponentiate = TRUE) %>%
filter(term != "(Intercept)") %>%
mutate(
term = recode(term,
"expm" = "Campaign Expenditure (million yen)",
"previous" = "Previous Wins"),
OR_label = sprintf("%.2f", estimate) # Convert OR to string
)
# Forest plot + numeric labels
ggplot(results, aes(x = estimate, y = term)) +
geom_point(size = 3, color = "blue") +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
geom_vline(xintercept = 1, linetype = "dashed", color = "red") +
geom_text(aes(label = OR_label), # Display OR values
hjust = 0.5,
vjust = -2,
size = 4) +
labs(
title = "Odds Ratios from Logistic Regression",
x = "Odds Ratio (95% CI)",
y = ""
) +
theme_minimal(base_size = 14) + # Add margin to avoid label cutoff
xlim(0, max(results$conf.high) * 1.2) +
theme_bw(base_family = "HiraKakuProN-W3")
\[Odd Ratio = \frac{odds_{expm + 1}}{odds_{expm}} = 1.19\]
\[OddRatios = \frac{odds_{previous + 1}}{odds_{previous}} = 1.41\]
expm
),
and calculate the effect at each
value.How should we interpret the coefficient of
expm? ・Here, the estimated coefficient (=log-odds
)
for expm is 0.170 (p-value = 4.86e-13).
・In multiple regression analysis, we analyze a linear model (a straight
line).
→ The slope and its statistical significance remain the same regardless
of the value of the explanatory variable.
・However, in logistic regression analysis, we analyze a nonlinear model
(a curve).
→ Depending on the size of campaign expenditure, both the magnitude of
the effect (marginal effect) and its statistical significance
change.
→ If campaign expenditure (expm) is statistically significant, then
“overall” campaign expenditure has a
positive effect on electoral outcomes, and that effect is statistically
significant.
→ However, this represents the average relationship between campaign
expenditure and electoral outcome.
→ It does not necessarily mean that the effect is statistically
significant at every level of expenditure.
→ Therefore, statistical significance should be checked for each level
of campaign expenditure.
・The same applies to the other explanatory variable, previous.
\[Pr(Winning) = \frac{1}{1+exp(-(-1.98 + 0.0735expm + 0.285previous))}\]
p_0 <- predict(model_1, type = "response", # option to display predicted winning probability
newdata = data_frame(previous = 3, expm = 0))
p_0
1
0.1642294
1
0.1889165
1
0.02468706
1
0.2163539
1
0.02743738
Summary ・Increasing campaign
expenditure from 0 yen to 1 million yen raises the predicted probability
of winning by 2.47 percentage points.
・Increasing campaign expenditure from 1 million yen to 2 million yen
raises the predicted probability of winning by 2.7 percentage
points.
・Both changes represent the same increment in expenditure (an
additional 1 million yen).
・However, their effects on the response variable (i.e., the probability
of winning) differ.
・The 1-million-yen increase from 1 million to
2 million has a greater effect on the response variable than the
increase from 0 to 1 million.
1 2 3 4 5 6 7 8
0.2163539 0.2465648 0.2794896 0.3149731 0.3527559 0.3924743 0.4336684 0.4757999
9 10 11 12 13 14 15 16
0.5182788 0.5604951 0.6018538 0.6418077 0.6798848 0.7157072 0.7490008 0.7795960
17 18 19 20 21 22 23 24
0.8074208 0.8324876 0.8548781 0.8747262 0.8922019 0.9074976 0.9208156 0.9323590
25 26 27
0.9423251 0.9509001 0.9582566
Summary
・It is difficult to determine exactly how many percentage points the
predicted winning probability will rise without explicitly calculating
the probabilities, simply by looking at the estimated coefficient:
log-odds
(expm = 0.170
).
Dependent variable: | |
wlsmd | |
expm | 0.170*** |
(0.024) | |
previous | 0.342*** |
(0.036) | |
Constant | -2.652*** |
(0.194) | |
Observations | 835 |
Log Likelihood | -397.969 |
Akaike Inf. Crit. | 801.939 |
Note: | p<0.1; p<0.05; p<0.01 |
・This estimated value represents the logarithm of the odds = Log Odds = logit.
→ Therefore, it is necessary to compute predicted probabilities at
specific values of expm and compare the differences.
margins()
function allows us to compute the
marginal effect of campaign expenditure at specific levels of
expenditure. at(previous) at(expm) expm
3 4 0.03424
expm
(0.03424) indicates: → “When
campaign expenditures increase by 1 million yen, how much the
probability of winning changes,” evaluated at expm = 4
and
previous = 3
.margins_1 <- margins(model_1, variables = "expm",
at = list(previous = 3, expm = c(0:28))) # Specify campaign expenditures from 0 yen to 28 million yen
margins_1
at(previous) at(expm) expm
3 0 0.023337
3 1 0.026052
3 2 0.028827
3 3 0.031585
3 4 0.034239
3 5 0.036685
3 6 0.038820
3 7 0.040540
3 8 0.041758
3 9 0.042406
3 10 0.042449
3 11 0.041884
3 12 0.040742
3 13 0.039087
3 14 0.037004
3 15 0.034595
3 16 0.031964
3 17 0.029215
3 18 0.026437
3 19 0.023710
3 20 0.021093
3 21 0.018631
3 22 0.016352
3 23 0.014273
3 24 0.012397
3 25 0.010723
3 26 0.009241
3 27 0.007938
3 28 0.006801
Predicted Winning Probability
differs from
Marginal Effects
margins::cplot()
what = "prediction"
what = "effect"
.previous = 2.211
(the mean value)hr21_me
)hr21_pred <- cplot(model_1,
x = "expm",
what = "prediction") %>% # setting to display predicted winning probability on the Y-axis
as_data_frame() %>%
ggplot(aes(x = xvals,
y = yvals,
ymin = lower,
ymax = upper)) +
geom_ribbon(fill = "gray") +
geom_line() +
labs(x = "Campaign Expenditure (units: 1 million yen)",
y = "Predicted Winning Probability",
title = "Predicted Winning Probability: 2021 House of Representatives Election")
hr21_me <- cplot(model_1,
x = "expm", # variable on the x-axis
dx = "expm", # explanatory variable
what = "effect") %>% # setting to display marginal effect on the y-axis
as_data_frame() %>%
ggplot(aes(x = xvals,
y = yvals,
ymin = lower,
ymax = upper)) +
geom_ribbon(fill = "gray") +
geom_line() +
geom_abline(intercept = 0, slope = 0, linetype = "dashed", color = "red") +
ylim(-0.0001, 0.06) +
labs(x = "Campaign Expenditure",
y = "Marginal Effect of Campaign Expenditure",
title = "Marginal Effect of Campaign Expenditure: 2021 House of Representatives Election")
Conclusion of Hypothesis Testing (Hypothesis 1)
1 2 3 4 5 6 7 8
0.1741271 0.1999460 0.2285339 0.2598815 0.2938910 0.3303645 0.3689990 0.4093895
9 10 11 12 13 14 15 16
0.4510409 0.4933899 0.5358339 0.5777650 0.6186045 0.6578331 0.6950162 0.7298182
17 18 19 20 21 22 23 24
0.7620087 0.7914599 0.8181363 0.8420812 0.8633997 0.8822428 0.8987913 0.9132429
25 26 27
0.9258014 0.9366680 0.9460359
at(previous) at(expm) expm
2.211 0 0.01929
2.211 1 0.02180
2.211 2 0.02445
2.211 3 0.02720
2.211 4 0.02998
2.211 5 0.03270
2.211 6 0.03528
2.211 7 0.03761
2.211 8 0.03959
2.211 9 0.04111
2.211 10 0.04210
2.211 11 0.04250
2.211 12 0.04229
2.211 13 0.04148
2.211 14 0.04011
2.211 15 0.03827
2.211 16 0.03604
2.211 17 0.03353
2.211 18 0.03083
2.211 19 0.02806
2.211 20 0.02530
2.211 21 0.02261
2.211 22 0.02005
2.211 23 0.01766
2.211 24 0.01547
2.211 25 0.01347
2.211 26 0.01168
2.211 27 0.01009
2.211 28 0.00868
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 17
390 79 55 106 51 31 33 30 24 16 7 3 5 2 1 1 1
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.009319 3.435080 5.899882 6.434585 8.692808 27.443685
df_pre <- expand.grid(previous = seq(0, 17, by = 2),
expm = seq(0, 20, by = 5)) %>%
as_data_frame()
pred <- predict(model_1,
type = "response",
newdata = df_pre,
se.fit = TRUE)
df_pre$fit <- pred$fit
df_pre$lower <- with(pred, fit - 2 * se.fit)
df_pre$upper <- with(pred, fit + 2 * se.fit)
df_pre <- df_pre %>%
mutate(lower = ifelse(lower < 0, 0, lower),
upper = ifelse(upper > 1, 1, upper))
plt_prob <- ggplot(df_pre, aes(x = expm, y = fit)) +
geom_ribbon(aes(ymin = lower, ymax = upper), fill = "gray") +
geom_line() +
facet_wrap(. ~ previous) +
labs(x = "Campaign Expenditure (units: 1 million yen)",
y = "Predicted Winning Probability")
print(plt_prob)
Next, we check whether the effect of campaign expenditure on the
predicted probability of winning remains statistically significant even
when the number of prior wins varies.
We specify how the effect of campaign
expenditure (expm) changes as number of prior wins (previous)
increases
→ x = "previous"
, dx = "expm"
Up to previous = 15, the 95% confidence interval does not touch
the red dashed line at y = 0.
→ This means that, even as the number of prior wins increases, the
effect of campaign expenditure on the predicted probability of winning
remains statistically significant up to previous = 15.
Predicted Probability by Number of Prior Wins: Summary
・For candidates with fewer prior wins, campaign expenditure has a
strong positive effect on the predicted probability of winning.
・As the number of prior wins increases, the effect of campaign
expenditure on the predicted probability of winning becomes
smaller.
→ For candidates with 16 prior wins, the probability of winning is
almost guaranteed, regardless of campaign expenditure.
・This finding holds true for candidates with 0 to 15 prior wins, and
the effect is statistically significant.
・The impact of campaign expenditure on the predicted probability of
winning is greatest for candidates with 5 prior wins.
Statistical Significance of Marginal
Effects Predicted probabilities
and
marginal effects
are distinct concepts.
→ The marginal effect varies depending on the value of the
explanatory variable.
→ It is necessary to present the marginal effects corresponding to the
values of the explanatory variable. ・The 95% confidence interval varies
depending on the number of voters, because the standard error of the
marginal effect changes with the value of the explanatory variable
(campaign expenditure).
・Assessing whether the main explanatory variable has a statistically
significant effect on the outcome
・In the above example, within the observed range of voters, the entire
95% confidence interval lies above 0
→ This indicates that campaign expenditure has a statistically
significant positive effect on electoral outcomes, regardless of the
amount of expenditure.
・Conversely, when the entire 95% confidence interval lies below 0, the
effect is also statistically significant (negative).
・As the figure illustrates, in logistic regression analysis, the effect
of a key explanatory variable can vary not only in magnitude but also in
its statistical significance.
・There may exist both ranges where the effect is statistically
significant and ranges where it is not.
・For example, if the value of the explanatory variable is less than 1
or greater than 2, the effect is statistically significant.
・If the value of the explanatory variable is between 1 and 2, the
effect is not statistically significant.
・It is extremely difficult to determine from the estimated regression
coefficients alone which ranges of the explanatory variable yield
statistically significant marginal effects.
→ Only by visualizing the marginal effects can we see
(a) where they are positive or negative, and
(b) in which ranges they are statistically significant.
★Only through the graphical depiction of marginal effects can it be determined across which ranges they take particular signs and whether they are statistically significant.
fitted()
function, we extract the predicted
values from model_1 and construct the cross-tabulation of “predicted
outcomes” and “actual outcomes.”Pred <- (fitted(model_1) >= 0.5) %>%
factor(levels = c(FALSE, TRUE),
labels = c("Predicted Loss", "Predicted Win"))
table(Pred, hr21$wlsmd) %>% addmargins()
Pred 0 1 Sum
Predicted Loss 476 120 596
Predicted Win 70 169 239
Sum 546 289 835
・Among the 546 candidates who actually lost, 476 were correctly
predicted as losses
・For the remaining 70, the model incorrectly predicted wins
・Among the 289 candidates who actually won, 169 were correctly
predicted as wins
・For the other 120 of the 289 winners, the model incorrectly predicted
losses
・Overall, out of 835 candidates, 645 (476 + 169) were correctly
predicted
・The remaining 190 were misclassified
→ Therefore, the accuracy of this model is 645/835 (about 77%)
- How should we evaluate this 77% accuracy?
- It is not the case that the prediction accuracy increased from 0% to
77% simply because of logistic regression.
Prediction Accuracy
model_1
improved prediction accuracy from 34% to 77%, an
increase of 43 percentage points.
Pred <- (fitted(model_1) >= 0.5) %>%
factor(levels = c(FALSE, TRUE), labels = c("Predicted Loss", "Predicted Win"))
table(Pred, hr21$wlsmd) %>% addmargins()
Pred 0 1 Sum
Predicted Loss 476 120 596
Predicted Win 70 169 239
Sum 546 289 835
→ The curve rises steeply from point (0, 0) to near point (0, 1), then proceeds toward point (1, 1).
→ The ROC curve lies close to the 45-degree diagonal line.
- The two models estimated here are as follows:
pi1 <- predict(model_1, type = "response")
pi2 <- predict(model_2, type = "response")
pr1 <- ROCR::prediction(pi1, labels = hr21$wlsmd == "1")
pr2 <- ROCR::prediction(pi2, labels = hr21$wlsmd == "1")
roc1 <- performance(pr1, measure = "tpr", x.measure = "fpr")
roc2 <- performance(pr2, measure = "tpr", x.measure = "fpr")
df_roc <- data_frame(fpr = c(roc1@x.values[[1]], roc2@x.values[[1]]),
tpr = c(roc1@y.values[[1]], roc2@y.values[[1]])) %>%
mutate(model = rep(c("Model 1", "Model 2"), c(n()/2, n()/2)))
roc <- ggplot(df_roc, aes(x = fpr, y = tpr,
color = model, linetype = model)) +
geom_line() +
geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
scale_linetype_discrete(name = "") +
scale_color_discrete(name = "") +
coord_fixed() +
labs(x = "False Positive Rate (1 - Specificity)",
y = "True Positive Rate (Sensitivity)")
print(roc)
AUC
The
blue ROC curve (model_2) deviates from the 45-degree line (dashed line)
toward the point (0, 1), indicating a good model fit.
If you want to plot only the ROC curve for Model 2:
# Calculate predicted values
pi2 <- predict(model_2, type = "response")
# Create an object for ROCR
pr2 <- ROCR::prediction(pi2, labels = hr21$wlsmd == "1")
roc2 <- performance(pr2, measure = "tpr", x.measure = "fpr")
# Convert to a data frame
df_roc2 <- data.frame(
fpr = roc2@x.values[[1]],
tpr = roc2@y.values[[1]]
)
# Plot the ROC curve with ggplot
roc_plot2 <- ggplot(df_roc2, aes(x = fpr, y = tpr)) +
geom_line(color = "blue") +
geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
coord_fixed() +
labs(
x = "False Positive Rate (1 - Specificity)",
y = "True Positive Rate (Sensitivity)",
title = "ROC Curve: Model 2"
)
print(roc_plot2)
[1] 0.7825171
[1] 0.853502
AUC
Judging by the AUC
, the fit of model_2
is
superior.
Hypothesis 2 ・Female candidates have a higher probability of winning in single-member districts.
Null Hypothesis ・A candidate’s gender is unrelated to the probability of winning in single-member districts.
wlsmd
) on the vertical axis and
campaign expenditures (expm
) on the horizontal axis.hr96-24.csv
)if (.Platform$OS.type == "windows") {
if (require(fontregisterer)) {
my_font <- "Yu Gothic"
} else {
my_font <- "Japan1"
}
} else if (capabilities("aqua")) {
my_font <- "HiraginoSans-W3"
} else {
my_font <- "IPAexGothic"
}
theme_set(theme_gray(base_size = 9,
base_family = my_font))
hr96-24.csv
. [1] "year" "pref" "ku" "kun"
[5] "wl" "rank" "nocand" "seito"
[9] "j_name" "gender" "name" "previous"
[13] "age" "exp" "status" "vote"
[17] "voteshare" "eligible" "turnout" "seshu_dummy"
[21] "jiban_seshu" "nojiban_seshu"
select()
function to extract only the four
variables: year, wl, previous, and gender.filter()
function to keep only the data from
the 2021 House of Representatives election.[1] 1 0 2
0・・・Lost in the single-member district
1・・・Won in the single-member district
2・・・Elected through proportional representation (after losing in the
single-member district)
[1] "male" "female"
hr21a <- hr %>%
select(year, wl, previous, gender) %>%
filter(year == 2021) %>%
mutate(wlsmd = ifelse(wl == 1, 1, 0)) |>
mutate(female = ifelse(gender == "female", 1, 0)) |> # 1 if female, 0 if male
select(year, wlsmd, female, previous) # Select only the necessary variables
# A tibble: 857 × 4
year wlsmd female previous
<dbl> <dbl> <dbl> <dbl>
1 2021 1 0 3
2 2021 0 0 2
3 2021 0 1 0
4 2021 1 0 8
5 2021 0 0 0
6 2021 1 0 8
7 2021 0 0 3
8 2021 1 0 3
9 2021 0 0 6
10 2021 0 1 0
# ℹ 847 more rows
Variable | Description |
year | Year of the House of Representatives election |
wlsmd | Dummy for election outcome in the single-member district (1 = won, 0 = lost) |
female | Gender dummy (1 = female, 0 = male) |
previous | Number of times elected |
year wlsmd female previous
Min. :2021 Min. :0.0000 Min. :0.0000 Min. : 0.000
1st Qu.:2021 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 0.000
Median :2021 Median :0.0000 Median :0.0000 Median : 1.000
Mean :2021 Mean :0.3372 Mean :0.1634 Mean : 2.162
3rd Qu.:2021 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.: 3.000
Max. :2021 Max. :1.0000 Max. :1.0000 Max. :17.000
Statistic | N | Mean | St. Dev. | Min | Max |
year | 857 | 2,021.000 | 0.000 | 2,021 | 2,021 |
wlsmd | 857 | 0.337 | 0.473 | 0 | 1 |
female | 857 | 0.163 | 0.370 | 0 | 1 |
previous | 857 | 2.162 | 2.908 | 0 | 17 |
jitter()
function to spread out the points for
display.p1 <- ggplot(hr21a, aes(x = female, y = wlsmd)) +
geom_jitter(size = 1, # Spread out the points
alpha = 1/3,
width = 0.05,
height = 0.05) +
labs(x = "Candidate's Gender",
y = "Election Outcome in Single-Member District (0: Lost, 1: Won)")
plot(p1)
p1 + geom_smooth(method = "lm", se = FALSE) +
annotate("label",
label = "Winnig (0 or 1) = a + bfemale",
x = 0.5, y = 0.5,
size = 5,
colour = "blue",
family = "HiraginoSans-W3")
p4 <- ggplot(hr21a, aes(x = female, y = wlsmd)) +
geom_jitter(size = 1,
alpha = 1/3,
width = 0.05,
height = 0.05) +
geom_smooth(method = "glm",
color = "red",
method.args = list(family = binomial(link = "logit"))) +
labs(x = "Candidate's Gender",
y = "Election Outcome in Single-Member District (0: Lost, 1: Won)")
print(p4)
wlsmd
).Model 2
\[Pr(wlsmd)=logit^{−1}(𝛼+\beta_1female + \beta_2previous)\]
glm()
function, which fits Generalized Linear Models
(GLMs).glm()
is a function that is frequently used not only
for logistic regression but also for many other types of models.family = binomial(link = "logit")
.log-odds
= log(odds)
.model_2 <- glm(wlsmd ~ female + previous,
data = hr21a,
family = binomial(link = "logit"))
# Specify the coefficients in terms of the logarithm of the odds (“log-odds”)
tidy()
, let’s display the results of
model_1
in terms of odds ratios# A tibble: 3 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.203 0.122 -13.1 4.79e-39 0.159 0.256
2 female 0.551 0.265 -2.25 2.44e- 2 0.322 0.912
3 previous 1.53 0.0350 12.2 1.89e-34 1.44 1.65
# Format results (convert to odds ratios)
results <- tidy(model_2, conf.int = TRUE, exponentiate = TRUE) %>%
filter(term != "(Intercept)") %>%
mutate(
term = recode(term,
"female" = "Gender (Female = 1, Male = 0)",
"previous" = "Previous Wins"),
OR_label = sprintf("%.2f", estimate) # Convert OR to string
)
# Forest plot + numeric labels
ggplot(results, aes(x = estimate, y = term)) +
geom_point(size = 3, color = "blue") +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
geom_vline(xintercept = 1, linetype = "dashed", color = "red") +
geom_text(aes(label = OR_label), # Display OR values
hjust = 0.5,
vjust = -2,
size = 4) +
labs(
title = "Odds Ratios from Logistic Regression",
x = "Odds Ratio (95% CI)",
y = ""
) +
theme_minimal(base_size = 14) + # Add margin to avoid label cutoff
xlim(0, max(results$conf.high) * 1.2) +
theme_bw(base_family = "HiraKakuProN-W3")
p-value = 2.44e-2
)\[OddRatio = \frac{odds_{female=1}}{odds_{female=0}}=0.55\]
→ The odds for a female candidate are 0.551 times those for a male candidate. → This indicates that women are less likely to be elected.
p-value = 1.89e-34
)\[OddRatio = \frac{odds_{previous + 1}}{odds_{previous}}=1.53\]
p-value
for female = 2.44e-2
value
for previous = 1.89e-34
library(ggplot2)
library(dplyr)
# Fix previous at its mean, set female = 0 / 1
newdata <- data.frame(
female = c(0, 1),
previous = mean(hr21a$previous, na.rm = TRUE)
)
# Predicted values and standard errors
pred <- predict(model_2,
newdata = newdata,
type = "response",
se.fit = TRUE)
# Data for plotting
plot_df <- newdata %>%
mutate(
fit = pred$fit,
se = pred$se.fit,
lower = fit - 1.96 * se,
upper = fit + 1.96 * se,
sex = factor(female, labels = c("male", "female"))
)
# Plot: add p-value as text in the figure
plot_prediction <- ggplot(plot_df, aes(x = sex, y = fit)) +
geom_point(size = 5, color = "black") +
geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.1) +
geom_text(aes(label = paste0(round(fit * 100, 1), "%")),
hjust = -0.5, size = 5) +
geom_text(
data = data.frame(x = 1.5, y = max(plot_df$upper) + 0.05),
aes(x = x, y = y, label = "p = 0.024"),
inherit.aes = FALSE,
size = 5
) +
labs(
title = "Predicted Probability of Election by Gender",
x = "Candidate Gender",
y = "Predicted Probability of Election"
) +
theme_bw(base_family = "HiraKakuProN-W3")
plot_prediction
Results ・The figure shows the
difference in predicted probabilities of election by gender.
・Prior experience (previous) is fixed at the mean (2.162 times).
・The predicted probability of election for male candidates is
33.8%.
・The predicted probability of election for female candidates is
22%.
・There is a statistically significant difference between the two (p =
0.024).
・The difference in predicted probability of
election is about 11.8 percentage points: 33.8 − 22 = 11.8
# --- Function to format p-values ---
format_p <- function(x) {
ifelse(x < 0.001,
"p < 0.001",
formatC(x, format = "e", digits = 3))
}
# --- Calculation of marginal effects ---
mfx <- margins(model_2, variables = "female")
summary(mfx)
factor AME SE z p lower upper
female -0.0976 0.0431 -2.2633 0.0236 -0.1821 -0.0131
Results (Comparison of Marginal Effects)
= It shows the average impact on the probability of election when
gender changes
・AME = -0.0976
→ On average, female candidates have
about a 9.8 percentage point lower probability of
election
・This effect is statistically significant
(p-value = 0.0236
)