R packages we use in this section
Research Question:
“Why is there a significant difference in the performance of Italian
regional governments?”
Theory:
Social Capital enhances the performance of
governments
【Response Variable】
【Explanatory Variable】
Hypothesis
If this theory is correct, then as cc
(Civic Community Index) increases, gov_p
(government performance) should also increase
gov_p
(Government Performance)cc
(Civic Community Index)econ
(Economic Indicators of Local
Governments)【Considering North-South Disparity in Italy】
Question 1: Is the degree of social capital accumulation (cc)
merely reflecting the regional differences (location) between the North
and South, and unrelated to government performance (gov_p)?
Variable Type | Variable Name | Details |
---|---|---|
region |
Abbreviations for Italian regional governments | |
Response Variable | gov_p |
Government Performance |
Explanatory Variable | cc |
Civic Community Index |
Control Variable | econ |
Economic Indicators of Local Governments (higher values indicate better economy) |
Control Variable | location |
North Italy Dummy (North if north, South if south) |
Here, a new dummy variable location is added as a control variable
# A tibble: 20 × 5
region gov_p cc econ location
<chr> <dbl> <dbl> <dbl> <chr>
1 Ab 7.5 8 7 south
2 Ba 7.5 4 3 south
3 Cl 1.5 1 3 south
4 Cm 2.5 2 6.5 south
5 Em 16 18 13 north
6 Fr 12 17 14.5 north
7 La 10 13 12.5 north
8 Li 11 16 15.5 north
9 Lo 11 17 19 north
10 Ma 9 15.5 10.5 north
11 Mo 6.5 3.5 2.5 south
12 Pi 13 15.5 17 north
13 Pu 5.5 3.5 4 south
14 Sa 5.5 8.5 8.5 south
15 Si 4.5 3.5 5.5 south
16 To 13 17.5 14.5 north
17 Tr 11 18 12.5 north
18 Um 15 15.5 11 north
19 Va 10 15 15 north
20 Ve 11 15 13.5 north
Testing for Differences in Government Performance Between North and South Regions
Drawing a box plot:
df1 %>%
ggplot(aes(x = location, y = gov_p, fill = location)) +
geom_boxplot() +
labs(x = "North-South Italy (location)", y = "Government Performance (gov_p)",
title = "Government Performance by Region in Italy") +
stat_smooth(method = lm, se = FALSE) + # se = FALSE → removes 95% confidence interval
theme_bw(base_family = "HiraKakuProN-W3")
unpaired
,
we conduct a t-test
by default.
Welch Two Sample t-test
data: df1$gov_p[df1$location == "north"] and df1$gov_p[df1$location == "south"]
t = 6.8253, df = 14.552, p-value = 6.737e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
4.607777 8.808890
sample estimates:
mean of x mean of y
11.83333 5.12500
Question 1 Result ・Average gov_p in
the North: 11.833
・Average gov_p in the South: 5.125
・The difference of -6.708 is statistically significant at the 1% level
(p-value = 6.737e-06)
→ As Goldberg (1996) argues, there is a regional difference in
government performance between the North and South
Question 2: Does a higher degree of economic modernization (econ) correlate with higher government performance (gov_p) within both the Northern and Southern regions?
econ
and gov_p
df1 %>%
ggplot(aes(econ, gov_p)) +
geom_point() +
theme_bw() +
labs(x = "Economic Conditions (econ)", y = "Government Performance (gov_p)",
title = "Government Performance and Economic Conditions") +
stat_smooth(method = lm, se = FALSE) + # se = FALSE → removes 95% confidence interval
theme_bw(base_family = "HiraKakuProN-W3")
There is a positive correlation between the degree of economic
modernization and government performance (gov_p
).
There is a positive correlation between econ and
gov_p
Regions with higher economic indicators (econ
) tend
to have higher government performance (gov_p
).
Derive the regression equation.
Call:
lm(formula = gov_p ~ econ, data = df1)
Residuals:
Min 1Q Median 3Q Max
-4.3386 -1.7733 0.0086 0.8336 5.5114
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0108 1.3847 2.174 0.043264 *
econ 0.5889 0.1200 4.909 0.000113 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.659 on 18 degrees of freedom
Multiple R-squared: 0.5724, Adjusted R-squared: 0.5487
F-statistic: 24.1 on 1 and 18 DF, p-value: 0.0001131
\[\widehat{gov_p}\ = 3.01 +
0.589econ\] Check the types of variables included in
df1
.
spc_tbl_ [20 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ region : chr [1:20] "Ab" "Ba" "Cl" "Cm" ...
$ gov_p : num [1:20] 7.5 7.5 1.5 2.5 16 12 10 11 11 9 ...
$ cc : num [1:20] 8 4 1 2 18 17 13 16 17 15.5 ...
$ econ : num [1:20] 7 3 3 6.5 13 14.5 12.5 15.5 19 10.5 ...
$ location: chr [1:20] "south" "south" "south" "south" ...
- attr(*, "spec")=
.. cols(
.. region = col_character(),
.. gov_p = col_double(),
.. cc = col_double(),
.. econ = col_double(),
.. location = col_character()
.. )
- attr(*, "problems")=<externalptr>
df1
is of type charactor, so
convert it to a 0, 1 dummy variable (data type numeric).df2
.df2 <- mutate(df1, location = as.numeric(location == "north" ))
# Convert to north = 1, south = 0
head(df2) # Display part of the transformed data
# A tibble: 6 × 5
region gov_p cc econ location
<chr> <dbl> <dbl> <dbl> <dbl>
1 Ab 7.5 8 7 0
2 Ba 7.5 4 3 0
3 Cl 1.5 1 3 0
4 Cm 2.5 2 6.5 0
5 Em 16 18 13 1
6 Fr 12 17 14.5 1
→ A multiple regression analysis is needed by including both econ and the location dummy in the model
Call:
lm(formula = gov_p ~ econ + location, data = df2)
Residuals:
Min 1Q Median 3Q Max
-3.6638 -1.1011 -0.2199 1.2497 4.1464
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.22207 1.34660 3.878 0.00121 **
econ -0.01941 0.22037 -0.088 0.93083
location 6.88386 2.22907 3.088 0.00667 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.19 on 17 degrees of freedom
Multiple R-squared: 0.7261, Adjusted R-squared: 0.6939
F-statistic: 22.53 on 2 and 17 DF, p-value: 1.659e-05
SRF
) equation is obtained:\[\widehat{gov_p}\ = 5.22 - 0.019econ + 6.88location\]
econ
) does not
have a significant impact on gov_p.
gov_p
by 6.88
points, which is significant at the 1% level.For the South (location = 0):
\[\widehat{gov_p}\ = 5.22 - 0.019econ\]
For the North (location = 1):
\[\widehat{gov_p}\ = 12.11 - 0.019econ\]
econ
) increases,
so does government performance (gov_p
).econ
) does not
affect government performance (gov_p
).p-value
= 0.00667).gov_p
) is 6.88 points higher in
the Northern region than in the Southern region.location
):econ
) on government
performance (gov_p
) disappears.gov_p
) is a spurious
correlation.Question 2 Result ・When considering
the North-South region dummy (location
), the degree of
economic modernization within the North and South regions does not
explain government performance.
Question 3: Does the claim by Putnam (1994) that social capital
(cc
) influences government performance (gov_p
)
hold true even when considering regional differences
(location
)?
gov_p
.ggplot
, Mac users
should input the following line:df1 %>%
ggplot(aes(cc, gov_p)) +
geom_point() +
labs(x = "Civic Community Index", y = "Government Performance",
title = "Government Performance and Civic Community Index") +
stat_smooth(method = lm, se = FALSE) + # se = FALSE → removes 95% confidence interval
theme_bw(base_family = "HiraKakuProN-W3")
cc
and
gov_p
.cc
) tend to have
higher government performance (gov_p
).
Call:
lm(formula = gov_p ~ cc, data = df2)
Residuals:
Min 1Q Median 3Q Max
-2.5043 -1.3481 -0.2087 0.9764 3.4957
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.71115 0.84443 3.211 0.00485 **
cc 0.56730 0.06552 8.658 7.81e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.789 on 18 degrees of freedom
Multiple R-squared: 0.8064, Adjusted R-squared: 0.7956
F-statistic: 74.97 on 1 and 18 DF, p-value: 7.806e-08
\[\widehat{gov_p}\ = 2.711 + 0.567cc\]
cc
),
government performance (gov_p
) increases by 0.567
points.p-value
is
7.81e-08).cc
and the location dummy in the
model
Call:
lm(formula = gov_p ~ cc + location, data = df2)
Residuals:
Min 1Q Median 3Q Max
-2.5003 -1.3445 -0.2058 0.9773 3.4997
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.69850 1.12131 2.407 0.0278 *
cc 0.57094 0.21485 2.657 0.0166 *
location -0.04781 2.67759 -0.018 0.9860
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.841 on 17 degrees of freedom
Multiple R-squared: 0.8064, Adjusted R-squared: 0.7836
F-statistic: 35.4 on 2 and 17 DF, p-value: 8.689e-07
\[\widehat{gov_p}\ = 2.7 + 0.57cc - 0.048location\]
cc
) has a significant impact on
gov_p
, significant at the 5% level.gov_p
For the South (location = 0):
\[\widehat{gov_p}\ = 2.7 + 0.57cc\]
For the North (location = 1):
\[\widehat{gov_p}\ = 2.65 + 0.57cc\]
cc
) increases, government performance
(gov_p
) also increases.cc
) increases, government performance
(gov_p
) also increases.location
is not statistically significant at the 5%
level (p-value
= 0.9860).cc
is statistically significant even at the 5% level
(p-value
= 0.0166).location
),cc
) still has an impact on government
performance (gov_p
).Question 3 Result
・In municipalities with the same level of social capital
(cc
), there is no difference in the level of government
performance (gov_p
), whether in the North or South.
・Government performance (gov_p
) is explained by “social
capital” (cc
) rather than by “regional differences”
(location
).
Create a folder named data
inside your
RProject
folder, and manually place the downloaded
hr96-21.csv
file into the data folder within the RProject
directory.
The command na = "."
means “replace missing values
with a period (dot).
Leaving missing values as blank can cause issues, such as data
originally of “numeric” type being recognized as “character” type, which
can lead to errors.
Therefore, it’s important to handle missing values at the time of reading to avoid such problems.
hr96_21.csv
contains data for 9 House of
Representatives elections that have been conducted since the
introduction of single-member districts in 1996 (1996, 2000, 2003, 2005,
2009, 2012, 2014, 2017, 2021) in Japan. [1] "year" "pref" "ku" "kun"
[5] "wl" "rank" "nocand" "seito"
[9] "j_name" "gender" "name" "previous"
[13] "age" "exp" "status" "vote"
[17] "voteshare" "eligible" "turnout" "seshu_dummy"
[21] "jiban_seshu" "nojiban_seshu"
・hr96_21.csv
contains the following 23 variables:
variable | detail |
---|---|
year | Election year (1996-2021) |
pref | Prefecture |
ku | Electoral district name |
kun | Number of electoral district |
rank | Ascending order of votes |
wl | 0 = loser / 1 = single-member district (smd) winner / 2 = zombie winner |
nocand | Number of candidates in each district |
seito | Candidate’s affiliated party (in Japanese) |
j_name | Candidate’s name (Japanese) |
name | Candidate’s name (English) |
previous | Previous wins |
gender | Candidate’s gender:“male”, “female” |
age | Candidate’s age |
exp | Election expenditure (yen) spent by each candidate |
status | 0 = challenger / 1 = incumbent / 2 = former incumbent |
vote | votes each candidate garnered |
voteshare | Vote share (%) |
eligible | Eligible voters in each district |
turnout | Turnout in each district (%) |
castvote | Total votes cast in each district |
seshu_dummy | 0 = Not-hereditary candidates, 1 = hereditary candidate |
jiban_seshu | Relationship between candidate and his predecessor |
nojiban_seshu | Relationship between candidate and his predecessor |
spc_tbl_ [9,660 × 22] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ year : num [1:9660] 1996 1996 1996 1996 1996 ...
$ pref : chr [1:9660] "愛知" "愛知" "愛知" "愛知" ...
$ ku : chr [1:9660] "aichi" "aichi" "aichi" "aichi" ...
$ kun : num [1:9660] 1 1 1 1 1 1 1 2 2 2 ...
$ wl : num [1:9660] 1 0 0 0 0 0 0 1 0 2 ...
$ rank : num [1:9660] 1 2 3 4 5 6 7 1 2 3 ...
$ nocand : num [1:9660] 7 7 7 7 7 7 7 8 8 8 ...
$ seito : chr [1:9660] "新進" "自民" "民主" "共産" ...
$ j_name : chr [1:9660] "河村たかし" "今枝敬雄" "佐藤泰介" "岩中美保子" ...
$ gender : chr [1:9660] "male" "male" "male" "female" ...
$ name : chr [1:9660] "KAWAMURA, TAKASHI" "IMAEDA, NORIO" "SATO, TAISUKE" "IWANAKA, MIHOKO" ...
$ previous : num [1:9660] 2 2 2 0 0 0 0 2 0 0 ...
$ age : num [1:9660] 47 72 53 43 51 51 45 51 71 30 ...
$ exp : num [1:9660] 9828097 9311555 9231284 2177203 NA ...
$ status : num [1:9660] 1 2 1 0 0 0 0 1 2 0 ...
$ vote : num [1:9660] 66876 42969 33503 22209 616 ...
$ voteshare : num [1:9660] 40 25.7 20.1 13.3 0.4 0.3 0.2 32.9 26.4 25.7 ...
$ eligible : num [1:9660] 346774 346774 346774 346774 346774 ...
$ turnout : num [1:9660] 49.2 49.2 49.2 49.2 49.2 49.2 49.2 51.8 51.8 51.8 ...
$ seshu_dummy : num [1:9660] 0 0 0 0 0 0 0 0 1 0 ...
$ jiban_seshu : chr [1:9660] NA NA NA NA ...
$ nojiban_seshu: chr [1:9660] NA NA NA NA ...
- attr(*, "spec")=
.. cols(
.. year = col_double(),
.. pref = col_character(),
.. ku = col_character(),
.. kun = col_double(),
.. wl = col_double(),
.. rank = col_double(),
.. nocand = col_double(),
.. seito = col_character(),
.. j_name = col_character(),
.. gender = col_character(),
.. name = col_character(),
.. previous = col_double(),
.. age = col_double(),
.. exp = col_double(),
.. status = col_double(),
.. vote = col_double(),
.. voteshare = col_double(),
.. eligible = col_double(),
.. turnout = col_double(),
.. seshu_dummy = col_double(),
.. jiban_seshu = col_character(),
.. nojiban_seshu = col_character()
.. )
- attr(*, "problems")=<externalptr>
numeric
, while
character values are recognized as character
.hr96-21.csv
contains result data for 9 House of
Representatives elections conducted since the introduction of
single-member districts in 1996 (1996, 2000, 2003, 2005, 2009, 2012,
2014, 2017, 2021). [1] "year" "pref" "ku" "kun"
[5] "wl" "rank" "nocand" "seito"
[9] "j_name" "gender" "name" "previous"
[13] "age" "exp" "status" "vote"
[17] "voteshare" "eligible" "turnout" "seshu_dummy"
[21] "jiban_seshu" "nojiban_seshu"
exppv
and
ldp
) required for analysis. num [1:9660] 9828097 9311555 9231284 2177203 NA ...
num [1:9660] 346774 346774 346774 346774 346774 ...
numeric
, which is fine.exppv
) using
exp
and eligible
:hr <- hr %>%
dplyr::mutate(exppv = exp/eligible) # eligible represents the number of eligible voters per single-member district
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0013 8.1762 18.7646 23.0907 33.3863 120.8519 2831
[1] "新進" "自民" "民主"
[4] "共産" "文化フォーラム" "国民党"
[7] "無所" "自由連合" "政事公団太平会"
[10] "新社会" "社民" "新党さきがけ"
[13] "沖縄社会大衆党" "市民新党にいがた" "緑の党"
[16] "さわやか神戸・市民の会" "民主改革連合" "青年自由"
[19] "日本新進" "公明" "諸派"
[22] "保守" "無所属の会" "自由"
[25] "改革クラブ" "保守新" "ニューディールの会"
[28] "新党尊命" "世界経済共同体党" "新党日本"
[31] "国民新党" "新党大地" "幸福"
[34] "みんな" "改革" "日本未来"
[37] "日本維新の会" "当たり前" "政治団体代表"
[40] "安楽死党" "アイヌ民族党" "次世"
[43] "維新" "生活" "立憲"
[46] "希望" "緒派" ""
[49] "N党" "国民" "れい"
hr
:hr
: [1] "year" "pref" "ku" "kun"
[5] "wl" "rank" "nocand" "seito"
[9] "j_name" "gender" "name" "previous"
[13] "age" "exp" "status" "vote"
[17] "voteshare" "eligible" "turnout" "seshu_dummy"
[21] "jiban_seshu" "nojiban_seshu" "exppv" "ldp"
ldp
has been created.hr
contains the following 24 variables.variable | detail |
---|---|
year | Election year (1996-2021) |
pref | Prefecture |
ku | Electoral district name |
kun | Number of electoral district |
rank | Ascending order of votes |
wl | 0 = loser / 1 = single-member district (smd) winner / 2 = zombie winner |
nocand | Number of candidates in each district |
seito | Candidate’s affiliated party (in Japanese) |
j_name | Candidate’s name (Japanese) |
name | Candidate’s name (English) |
previous | Previous wins |
gender | Candidate’s gender:“male”, “female” |
age | Candidate’s age |
exp | Election expenditure (yen) spent by each candidate |
status | 0 = challenger / 1 = incumbent / 2 = former incumbent |
vote | votes each candidate garnered |
voteshare | Vote share (%) |
eligible | Eligible voters in each district |
turnout | Turnout in each district (%) |
castvote | Total votes cast in each district |
seshu_dummy | 0 = Not-hereditary candidates, 1 = hereditary candidate |
jiban_seshu | Relationship between candidate and his predecessor |
nojiban_seshu | Relationship between candidate and his predecessor |
exppv | Election Expenses Per Voter (in Yen) |
ldp | LDP Dummy: 1 for LDP candidates, 0 for others |
hr
, and performing multiple regression
analysis using the following three variables (voteshare
,
exppv
, ldp
).variables | detail |
---|---|
Response Variable | voteshare |
Explanatory Variable | exppv |
Control Variable | ldp |
hr05 <- hr %>%
dplyr::filter(year == 2005) %>% # Select only data for the year 2005
dplyr::select(voteshare, exppv, ldp) # Select only 3 variables
voteshare exppv ldp
Min. : 0.60 Min. : 0.148 Min. :0.0000
1st Qu.: 8.80 1st Qu.: 8.352 1st Qu.:0.0000
Median :34.80 Median :22.837 Median :0.0000
Mean :30.33 Mean :24.627 Mean :0.2932
3rd Qu.:46.60 3rd Qu.:35.269 3rd Qu.:1.0000
Max. :73.60 Max. :89.332 Max. :1.0000
NA's :4
stargazer()
==========================================
Statistic N Mean St. Dev. Min Max
------------------------------------------
voteshare 989 30.333 19.230 0.600 73.600
exppv 985 24.627 17.907 0.148 89.332
ldp 989 0.293 0.455 0 1
------------------------------------------
type = "html"
and use chunk option ```{r, results = "asis"}
to render the
HTMLStatistic | N | Mean | St. Dev. | Min | Max |
voteshare | 989 | 30.333 | 19.230 | 0.600 | 73.600 |
exppv | 985 | 24.627 | 17.907 | 0.148 | 89.332 |
ldp | 989 | 0.293 | 0.455 | 0 | 1 |
<style>
table, td, th {
border: none;
padding-left: 1em;
padding-right: 1em;
min-width: 50%;
margin-left: auto;
margin-right: auto;
margin-top: 1em;
margin-bottom: 1em;
}
</style>
exppv
and
voteshare.
hr05 %>%
ggplot(aes(exppv, voteshare)) +
geom_point() +
labs(x = "Election Expenses Per Voter (Yen)", y = "Vote Share (%)",
title = "Candidate Vote Share and Election Expenses") +
theme_bw(base_family = "HiraKakuProN-W3") +
geom_smooth(method = lm, se = FALSE) # se = FALSE → Remove 95% confidence interval
hr05 %>%
ggplot(aes(ldp, voteshare)) +
geom_point() +
labs(x = "0 = Non-LDP, 1 = LDP", y = "Vote Share (%)",
title = "Candidate Vote Share and LDP Affiliation") +
geom_jitter(width = 0.02) + # Scatter data for display
theme_bw(base_family = "HiraKakuProN-W3") +
geom_smooth(method = lm, se = FALSE) # se = FALSE → Remove 95% confidence interval
type = "html"
and use chunk option ```{r, results = "asis"}
:Dependent variable: | ||
voteshare | ||
(1) | (2) | |
exppv | 0.767*** | 0.565*** |
(0.024) | (0.024) | |
ldp | 15.852*** | |
(0.961) | ||
Constant | 11.453*** | 11.779*** |
(0.727) | (0.644) | |
Observations | 985 | 985 |
R2 | 0.512 | 0.618 |
Adjusted R2 | 0.512 | 0.617 |
Residual Std. Error | 13.422 (df = 983) | 11.883 (df = 982) |
F Statistic | 1,031.634*** (df = 1; 983) | 794.203*** (df = 2; 982) |
Note: | p<0.1; p<0.05; p<0.01 |
\[\widehat{voteshare}\ = 11.78 + 0.57\cdot exppv + 15.85\cdot ldp\]
exppv
(election expenses per
voter) is associated with an increase of 0.57 percentage points in the
candidate’s vote share, which is statistically significant at the 1%
level.\[\widehat{voteshare}\ = 11.78 + 0.57\cdot exppv\]
If the candidate is an LDP (Liberal Democratic Party) candidate (ldp = 1),
\[\widehat{voteshare}\ = 27.63 + 0.57 \cdot exppv\].
## Create a data frame for predicted values and name it 'pred'
pred <- with(hr05, expand.grid(
exppv = seq(min(exppv, na.rm=TRUE), max(exppv, na.rm=TRUE), length = 100),
ldp = c(0,1)
))
## Use mutate to create and calculate the new variable, 'voteshare' (predictions)
pred <- mutate(pred, voteshare = predict(model_6, newdata = pred))
## Draw dots (points) for observed values (hr05) and lines for regression lines (pred)
p3 <- ggplot(hr05, aes(x = exppv, y = voteshare, color = as.factor(ldp))) +
geom_point(size = 1) + geom_line(data = pred)
p3 <- p3 + labs(x = "Election Expenses Per Voter", y = "Vote Share (%)")
p3 <- p3 + scale_color_discrete(name = "Party Affiliation",
labels = c("Non-LDP","LDP")) + guides(color = guide_legend(reverse = TRUE)) +
theme_bw(base_family = "HiraKakuProN-W3")
print(p3 + ggtitle("Relationship Between Vote Share and Election Expenses (By Candidate's Party Affiliation)"))
Step 1: Simple linear regression with continuous and
dummy variables (voteshare
, ldp
).
Step 2: Simple linear regression with two continuous
variables (voteshare
, exppv
).
Step 3: Multiple regression analysis with dummy
variable (voteshare
, exppv
,
ldp
).
voteshare
) as the response variable and whether the
candidate is endorsed by the Liberal Democratic Party (ldp
)
as the explanatory variable.voteshare
is a continuous variable, and
ldp
is a categorical variable.ldp
(on the x-axis).ggplot(hr05, aes(ldp, voteshare)) +
geom_point() +
labs(x = "0 = Non-LDP, 1 = LDP", y = "Vote Share (%)",
title = "Candidate Vote Share and LDP Affiliation") +
geom_jitter(width = 0.02) + # Scatter data for display
geom_smooth(method = lm, se = FALSE) + # se = FALSE → Remove 95% confidence interval
theme_bw(base_family = "HiraKakuProN-W3")
\[voteshare_i \sim \mathrm{N}(\alpha_0 + \alpha_1 \cdot ldp_i, \sigma^2)\]
lm()
function
and name it model_7
.summary()
.
Call:
lm(formula = voteshare ~ ldp, data = hr05)
Residuals:
Min 1Q Median 3Q Max
-30.223 -14.313 -0.423 12.387 46.187
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.4132 0.5593 40.07 <2e-16 ***
ldp 27.0096 1.0329 26.15 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 14.79 on 987 degrees of freedom
Multiple R-squared: 0.4093, Adjusted R-squared: 0.4087
F-statistic: 683.8 on 1 and 987 DF, p-value: < 2.2e-16
\[\widehat{voteshare}\ = 22.41 + 27 \cdot ldp\]
ldp
is
0.Pr(>|t|)
value of “2e-16” is the
p-value
.p-value
is smaller than 0.01, the null
hypothesis is rejected at the significance level of α = 0.01 (1%).ldp = 1
) is associated with a 27%
increase in vote share.voteshare
) can be
explained by ldp.Signif.
codes indicate the significance level,
and the number of asterisks attached to the p-value
in the
upper-right corner indicates the level of significance at which the null
hypothesis is rejected. Three asterisks indicate significance at
α = 0.001 (0.1%)
.α = 0.01 (1%)
.at α = 0.05 (5%)
.stargazer()
function can be used to display the
results in a more readable format.type = "html"
and use chunk option ```{r, results = "asis"}
stargazer(model_7,
type = "html",
dep.var.labels = "Voteshare (%)", # Response variable name
title = "Table 1: LDP Affiliation and Voteshare in the 2005HR Election") # Title
Dependent variable: | |
Voteshare (%) | |
ldp | 27.010*** |
(1.033) | |
Constant | 22.413*** |
(0.559) | |
Observations | 989 |
R2 | 0.409 |
Adjusted R2 | 0.409 |
Residual Std. Error | 14.788 (df = 987) |
F Statistic | 683.759*** (df = 1; 987) |
Note: | p<0.1; p<0.05; p<0.01 |
ldp = 0
into the equation:\(22.41 + 27.01 \cdot 0 = 22.41\)
ldp = 1
into the
prediction equation, we obtain the average vote share (predicted vote
share) for LDP candidates.\[22.41 + 27.01 \cdot 1 = 49.42\] -
Let’s calculate the average vote share for each candidate in R and
verify if they match the two predicted values obtained earlier.
- Average vote share (predicted vote share) for non-LDP candidates
(ldp = 0
):
hr05 %>%
filter(ldp == 0) %>% # Use filter to limit the data to ldp = 0 only
with(mean(voteshare)) %>% # Use with to calculate the mean of votehshare
round(2) # Display up to 2 decimal places
[1] 22.41
hr05 %>%
filter(ldp == 1) %>% # Use filter to limit the data to ldp = 1 only
with(mean(voteshare)) %>% # Use with to calculate the mean of votehshare
round(2) # Display up to 2 decimal places
[1] 49.42
t-test
.
Welch Two Sample t-test
data: hr05$voteshare[hr05$ldp == 1] and hr05$voteshare[hr05$ldp == 0]
t = 32.446, df = 897.39, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
25.37584 28.64335
sample estimates:
mean of x mean of y
49.42276 22.41316
hr05
, let’s consider a regression
analysis with candidate’s vote share (voteshare
) as the
response variable and campaign expenses per voter (exppv
:
in yen) as the explanatory variable.voteshare
and exppv
are continuous
variables.voteshare
on the
vertical axis and exppv
on the horizontal axis to visualize
the correlation between these two variables.ggplot(hr05, aes(exppv, voteshare)) +
geom_point() +
labs(x = "Campaign Expenses per Voter (yen)", y = "Vote Share (%)",
title = "Candidate's Vote Share and Campaign Expenses") +
theme_bw(base_family = "HiraKakuProN-W3") +
geom_smooth(method = lm, se = FALSE) # se = FALSE to remove 95% confidence intervals
Call:
lm(formula = voteshare ~ exppv, data = hr05)
Residuals:
Min 1Q Median 3Q Max
-43.596 -9.661 -3.493 9.909 42.851
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.45334 0.72744 15.74 <2e-16 ***
exppv 0.76745 0.02389 32.12 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.42 on 983 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.5121, Adjusted R-squared: 0.5116
F-statistic: 1032 on 1 and 983 DF, p-value: < 2.2e-16
p-value
is 2.2e-16, indicating that the F-test is
significant at the 1% level.Pr(>|t|)
” is the
p-value
.p-value
is smaller than 0.01, the null
hypothesis is rejected at the significance level of α = 0.01 (1%).voteshare
) can be
explained by exppv
.stargazer()
function to display the
results in an easily readable format.type = "html"
, use the chunk option
```{r, results = "asis"}
.Point: Did being a Liberal Democratic Party (LDP) candidate affect a candidate’s vote share?
voteshare
(vote share) as the response variable and two
explanatory variables, ldp
(LDP candidate) and
exppv
(campaign expenses per eligible voter).Assumption of model_9:
“Even if being an LDP candidate affects vote share, the impact (size of
the slope) of campaign expenses on vote share is the same.”
Call:
lm(formula = voteshare ~ ldp + exppv, data = hr05)
Residuals:
Min 1Q Median 3Q Max
-42.990 -8.828 -2.427 9.284 46.290
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.77885 0.64431 18.28 <2e-16 ***
ldp 15.85224 0.96087 16.50 <2e-16 ***
exppv 0.56538 0.02444 23.13 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.88 on 982 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.618, Adjusted R-squared: 0.6172
F-statistic: 794.2 on 2 and 982 DF, p-value: < 2.2e-16
model_9
is as follows:\[\widehat{voteshare}\ = 11.78 + 15.85 \cdot ldp\ + 0.565 \cdot exppv`\]
The F-test result is shown at the bottom, and the
p-value
is 2.2e-16, indicating that the F-test is
significant at the 1% level.
Null hypotheses are as follows:
\(H_0\): “The coefficient of the
variable ldp
(whether the candidate is an LDP candidate) is
0.”
\(H_0\): “The coefficient of the
variable exppv
(campaign expenses per eligible voter) is
0.”
The “2e-16” under “Pr(>|t|)” is the p-value.
→ Both p-values for these variables are smaller than 0.01, so the null
hypotheses are rejected at the significance level of α = 0.01
(1%).
→ When exppv is held constant at its mean, being an LDP candidate (ldp =
1) increases vote share by 15.85 percentage points.
→ When ldp is held constant at its mean, spending 1 yen on campaign
expenses per eligible voter (exppv
) increases vote share by
0.565 percentage points.
The coefficient of the intercept is 11.78, so when both
exppv
and ldp
are 0, the predicted vote share
is 11.78 percentage points.
The adjusted R-squared value is 0.6172, indicating that 61.72% of
the variance in the response variable (voteshare
) can be
explained by exppv
and ldp
.
What does the coefficient of ldp
(15.85) mean?
→ “If the ldp
dummy variable increases by 1 unit, the vote
share increases by 15.85 percentage points.”
→ “LDP-endorsed candidates have a vote share that is 15.85 percentage
points higher than non-LDP candidates.”
When visualizing the results of model_9, it looks like this:
## Create a data frame for predicted values and name it 'pred'
pred <- with(hr05, expand.grid(
exppv = seq(min(exppv, na.rm=TRUE), max(exppv, na.rm=TRUE), length = 100),
ldp = c(0,1)
))
## Use 'mutate' to create a new variable 'voteshare' by calculating predictions
pred <- mutate(pred, voteshare = predict(model_9, newdata = pred))
## Scatterplot points (data) are plotted with observed values (hr05), and regression lines are plotted with predicted values (pred)
p9 <- ggplot(hr05, aes(x = exppv, y = voteshare, color = as.factor(ldp))) +
geom_point(size = 1) + geom_line(data = pred)
p9 <- p9 + labs(x = "Campaign Expenses Per Eligible Voter (¥)", y = "Vote Share (%)")
p9 <- p9 + scale_color_discrete(name = "LDP Candidate",
labels = c("No","Yes")) + guides(color = guide_legend(reverse = TRUE)) +
theme_bw(base_family = "HiraKakuProN-W3")
print(p9 + ggtitle("Campaign Expenses and Vote Share (By LDP Affiliation)"))
Regression Equation for model_9
\[\widehat{voteshare} = 11.78 + 15.85 \cdot ldp\ + 0.565 \cdot exppv\]
ldp = 0
) into the model_9
regression
equation, we obtain the tomato-colored regression equation.\[\widehat{voteshare}\ = 11.78 + 0.565 \cdot exppv\]
ldp = 1
) into the
model_9
regression equation, we obtain the blue-colored
regression equation.\[\widehat{voteshare}\ = 27.63 + 0.565 \cdot exppv\]
stargazer()
.type = "html"
).type = "html"
in
chunk options using ```{r, results = "asis"}
.stargazer(model_7, model_8, model_9,
type = "html",
dep.var.labels = "Voteshare (%)", # Dependent variable name
title = "Table 4: Campaign Expenses and Vote Share (By LDP Affiliation)") # Title
Dependent variable: | |||
Voteshare (%) | |||
(1) | (2) | (3) | |
ldp | 27.010*** | 15.852*** | |
(1.033) | (0.961) | ||
exppv | 0.767*** | 0.565*** | |
(0.024) | (0.024) | ||
Constant | 22.413*** | 11.453*** | 11.779*** |
(0.559) | (0.727) | (0.644) | |
Observations | 989 | 985 | 985 |
R2 | 0.409 | 0.512 | 0.618 |
Adjusted R2 | 0.409 | 0.512 | 0.617 |
Residual Std. Error | 14.788 (df = 987) | 13.422 (df = 983) | 11.883 (df = 982) |
F Statistic | 683.759*** (df = 1; 987) | 1,031.634*** (df = 1; 983) | 794.203*** (df = 2; 982) |
Note: | p<0.1; p<0.05; p<0.01 |
qt()
to find the range within which 95% of the distribution
falls.## Calculating lower and upper bounds of the confidence interval
## First, calculate predicted values (pred) and standard error (err)
err <- predict(model_9, newdata = pred, se.fit = TRUE)
## Using predicted values and standard error to calculate confidence interval (pred$lower, pred$upper)
pred$lower <- err$fit + qt(0.025, df = err$df) * err$se.fit
pred$upper <- err$fit + qt(0.975, df = err$df) * err$se.fit
p9.ci95 <- p9 +
geom_smooth(data = pred, aes(ymin = lower, ymax = upper), stat ="identity")
# Display the created plot
print(p9.ci95 + ggtitle("Campaign Expenses and Vote Share (By LDP Affiliation)"))
voteshare
, exppv
,
jcp
):Variable Type | Variable Name | Description |
---|---|---|
Response Variable | voteshare |
Vote share (%) |
Predictor Variable | exppv |
Election expenses per eligible voter (in yen) |
Control Variable | jcp |
Communist Party Dummy (Communist Party candidate = 1, others = 0) |
stargazer()
to display the
descriptive statistics for the three variables (voteshare
,
exppv
, jcp
) for the 2005 House of
Representatives election.exppv
and
voteshare
. Also, include the regression line.jcp
and
voteshare
. Also, include the regression line.voteshare
as the response variable and exppv
and jcp
as predictor variables. Show the multiple
regression equation and interpret the results.voteshare
with exppv
and jcp as predictor variables.jcp.
If you are plotting in black and white,
consider changing the shape of the scatter plot dots or using two
different types of regression lines.