• R packages we use in this section
library(tidyverse)
library(stargazer)

1. Understanding Dummy Variables

1.1 What are Dummy Variables?

  • Dummy variables are binary variables used to indicate the presence or absence of a certain attribute.
  • They take a value of 1 if a specific category is present, and 0 if it is not.
  • Examples include: ・Gender (Female = 1, Male = 0)
    ・Election results (Elected = 1, Not Elected = 0)
    ・State of society (Wartime = 1, Peacetime = 0)
    ・Geographical location of states (North = 1, South = 0), among others.

1.2 Insights from Dummy Variables

  • Referring to “9. Regression Analysis 1 (Simple and Multiple Regression)” and the example of Robert Putnam:
  • There is a positive correlation between economic conditions (x-axis) and government performance… shown in the left figure.
  • Introducing a “location dummy variable” for Italian regional governments (North/South):
    → This results in a negative (or no) correlation between economic conditions (x-axis) and government performance… shown in the right figure.

2. North-South Disparity Dummy and Economic Modernization

2.1 Theory and Hypothesis

Research Question:
“Why is there a significant difference in the performance of Italian regional governments?”

Theory:
Social Capital enhances the performance of governments

  • Differences in the performance of local governments can be explained by the extent of social capital accumulation in the area.
  • “Social Capital” refers to individual connections
    → networks and norms of reciprocity.
    (It helps in building cooperative relationships even with strangers.)
  • Regions with high accumulation of social capital:
    → Trust and cooperate with each other.
    → Enhance the performance of the government.

【Response Variable】

  • Operationalizing “government performance”
    gov_p: Composed of 12 indicators.
  • Stability of local government cabinets.
  • Speed of budget passage.
  • Provision of statistical-information services.

【Explanatory Variable】

  • Operationalizing the “degree of accumulation of social capital”
    cc: Civic Community Index.
  • Proportion of named individual votes in proportional representation = Degree of Clientelism (political patronage).
  • Voter turnout in referendums = Level of community engagement.
  • Proportion of newspaper subscribers = Degree of civic deliberative capacity.
  • Proportion of sports and cultural organizations = Extent of civic social life.

Hypothesis
If this theory is correct, then as cc (Civic Community Index) increases, gov_p (government performance) should also increase

  • Response Variable: gov_p (Government Performance)
  • Explanatory Variable: cc (Civic Community Index)
  • Control Variable: econ (Economic Indicators of Local Governments)

2.2 Goldberg’s Critique of Putnam and Its Verification

【Considering North-South Disparity in Italy】

  • Goldberg’s (1996) criticism of Putnam (1994).
  • In Italy, the North and the South have entirely different histories, traditions, and cultures.
  • Differences in political, economic, and social states can all be explained by the regional differences between the North and South.
  • The degree of social capital accumulation reflects the differences between the Northern and Southern regions.
  • In analyses, it’s necessary to consider regional differences between North and South.
  • North → High social capital → High government performance.
  • South → Low social capital → Low government performance.

Verification 1:

  • Does government performance differ between the North and South?

Question 1: Is the degree of social capital accumulation (cc) merely reflecting the regional differences (location) between the North and South, and unrelated to government performance (gov_p)?

Data (putnam.csv):
Variable Type Variable Name Details
region Abbreviations for Italian regional governments
Response Variable gov_p Government Performance
Explanatory Variable cc Civic Community Index
Control Variable econ Economic Indicators of Local Governments (higher values indicate better economy)
Control Variable location North Italy Dummy (North if north, South if south)

Here, a new dummy variable location is added as a control variable

  • Download the data putnam.csv and save it in your RProject Folder.
df1 <- read_csv("data/putnam.csv")
df1
# A tibble: 20 × 5
   region gov_p    cc  econ location
   <chr>  <dbl> <dbl> <dbl> <chr>   
 1 Ab       7.5   8     7   south   
 2 Ba       7.5   4     3   south   
 3 Cl       1.5   1     3   south   
 4 Cm       2.5   2     6.5 south   
 5 Em      16    18    13   north   
 6 Fr      12    17    14.5 north   
 7 La      10    13    12.5 north   
 8 Li      11    16    15.5 north   
 9 Lo      11    17    19   north   
10 Ma       9    15.5  10.5 north   
11 Mo       6.5   3.5   2.5 south   
12 Pi      13    15.5  17   north   
13 Pu       5.5   3.5   4   south   
14 Sa       5.5   8.5   8.5 south   
15 Si       4.5   3.5   5.5 south   
16 To      13    17.5  14.5 north   
17 Tr      11    18    12.5 north   
18 Um      15    15.5  11   north   
19 Va      10    15    15   north   
20 Ve      11    15    13.5 north   
  • Testing for Differences in Government Performance Between North and South Regions

  • Drawing a box plot:

df1 %>% 
  ggplot(aes(x = location, y = gov_p, fill = location)) +
    geom_boxplot() +
  labs(x = "North-South Italy (location)", y = "Government Performance (gov_p)",
       title = "Government Performance by Region in Italy") + 
  stat_smooth(method = lm, se = FALSE) +  # se = FALSE → removes 95% confidence interval
  theme_bw(base_family = "HiraKakuProN-W3")

  • Visually, there is a clear difference in government performance between the North and South.
  • Since the data on government performance is unpaired, we conduct a t-test by default.
t.test(df1$gov_p[df1$location == "north"],
       df1$gov_p[df1$location == "south"])

    Welch Two Sample t-test

data:  df1$gov_p[df1$location == "north"] and df1$gov_p[df1$location == "south"]
t = 6.8253, df = 14.552, p-value = 6.737e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 4.607777 8.808890
sample estimates:
mean of x mean of y 
 11.83333   5.12500 

Question 1 Result ・Average gov_p in the North: 11.833
・Average gov_p in the South: 5.125
・The difference of -6.708 is statistically significant at the 1% level (p-value = 6.737e-06)
→ As Goldberg (1996) argues, there is a regional difference in government performance between the North and South

Verification 2:

  • Can “Economy” Explain “Government Performance” Even Within North and South Regions?

Question 2: Does a higher degree of economic modernization (econ) correlate with higher government performance (gov_p) within both the Northern and Southern regions? 

  • Draw a scatter plot of econ and gov_p  
df1 %>% 
  ggplot(aes(econ, gov_p)) +
  geom_point() +
  theme_bw() +
  labs(x = "Economic Conditions (econ)", y = "Government Performance (gov_p)",
       title = "Government Performance and Economic Conditions") + 
  stat_smooth(method = lm, se = FALSE) +  # se = FALSE → removes 95% confidence interval
  theme_bw(base_family = "HiraKakuProN-W3")

  • There is a positive correlation between the degree of economic modernization and government performance (gov_p).

  • There is a positive correlation between econ and gov_p

  • Regions with higher economic indicators (econ) tend to have higher government performance (gov_p).

  • Derive the regression equation.

model_1 <- lm(gov_p ~ econ, data = df1)

summary(model_1)

Call:
lm(formula = gov_p ~ econ, data = df1)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.3386 -1.7733  0.0086  0.8336  5.5114 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.0108     1.3847   2.174 0.043264 *  
econ          0.5889     0.1200   4.909 0.000113 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.659 on 18 degrees of freedom
Multiple R-squared:  0.5724,    Adjusted R-squared:  0.5487 
F-statistic:  24.1 on 1 and 18 DF,  p-value: 0.0001131

\[\widehat{gov_p}\ = 3.01 + 0.589econ\] Check the types of variables included in df1.

str(df1)
spc_tbl_ [20 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ region  : chr [1:20] "Ab" "Ba" "Cl" "Cm" ...
 $ gov_p   : num [1:20] 7.5 7.5 1.5 2.5 16 12 10 11 11 9 ...
 $ cc      : num [1:20] 8 4 1 2 18 17 13 16 17 15.5 ...
 $ econ    : num [1:20] 7 3 3 6.5 13 14.5 12.5 15.5 19 10.5 ...
 $ location: chr [1:20] "south" "south" "south" "south" ...
 - attr(*, "spec")=
  .. cols(
  ..   region = col_character(),
  ..   gov_p = col_double(),
  ..   cc = col_double(),
  ..   econ = col_double(),
  ..   location = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
  • The variable location in df1 is of type charactor, so convert it to a 0, 1 dummy variable (data type numeric).
  • Name the transformed dataframe as df2.
df2 <- mutate(df1, location = as.numeric(location == "north" )) 
                               # Convert to north = 1, south = 0 
head(df2) # Display part of the transformed data
# A tibble: 6 × 5
  region gov_p    cc  econ location
  <chr>  <dbl> <dbl> <dbl>    <dbl>
1 Ab       7.5     8   7          0
2 Ba       7.5     4   3          0
3 Cl       1.5     1   3          0
4 Cm       2.5     2   6.5        0
5 Em      16      18  13          1
6 Fr      12      17  14.5        1
  • Regions with a higher degree of economic modernization (econ) tend to have higher government performance (gov_p)
  • To verify if this is also observed within the North and South regions,

→ A multiple regression analysis is needed by including both econ and the location dummy in the model

model_2 <- lm(gov_p ~ econ + location, data = df2)
  • Display the analysis results.
summary(model_2)

Call:
lm(formula = gov_p ~ econ + location, data = df2)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6638 -1.1011 -0.2199  1.2497  4.1464 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  5.22207    1.34660   3.878  0.00121 **
econ        -0.01941    0.22037  -0.088  0.93083   
location     6.88386    2.22907   3.088  0.00667 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.19 on 17 degrees of freedom
Multiple R-squared:  0.7261,    Adjusted R-squared:  0.6939 
F-statistic: 22.53 on 2 and 17 DF,  p-value: 1.659e-05
  • From these results, the following sample regression function (SRF) equation is obtained:

\[\widehat{gov_p}\ = 5.22 - 0.019econ + 6.88location\]

  • The degree of economic modernization (econ) does not have a significant impact on gov_p.
  • Being in the Northern region increases gov_p by 6.88 points, which is significant at the 1% level.
  • Controlling with the location dummy results in the following two (with equal slopes) regression equations:

For the South (location = 0):

\[\widehat{gov_p}\ = 5.22 - 0.019econ\]

For the North (location = 1):

\[\widehat{gov_p}\ = 12.11 - 0.019econ\]

  • Let’s represent these results in a scatter plot.

  • Looking at the overall picture (top left figure):
    → As the degree of economic modernization (econ) increases, so does government performance (gov_p).
  • When looking at the North and South regions separately (top right figure):
    → The degree of economic modernization (econ) does not affect government performance (gov_p).
  • location is statistically significant at the 1% level (p-value = 0.00667).
    → Government performance (gov_p) is 6.88 points higher in the Northern region than in the Southern region.
  • econ is not statistically significant even at the 5% level.
    → When considering the North-South region dummy (location):
    → The impact of economic modernization (econ) on government performance (gov_p) disappears.
    → The relationship between the degree of economic modernization (econ) and government performance (gov_p) is a spurious correlation.

Question 2 Result ・When considering the North-South region dummy (location), the degree of economic modernization within the North and South regions does not explain government performance.

Verification 3:

  • What Explains Government Performance - “Economics,” “North-South,” or “Social Capital”?

Question 3: Does the claim by Putnam (1994) that social capital (cc) influences government performance (gov_p) hold true even when considering regional differences (location)?

  • Draw a scatter plot of cc and gov_p.
  • If you want to display Japanese in ggplot, Mac users should input the following line:
theme_bw(base_family = "HiraKakuProN-W3")
df1 %>% 
  ggplot(aes(cc, gov_p)) +
  geom_point() +
  labs(x = "Civic Community Index", y = "Government Performance",
       title = "Government Performance and Civic Community Index") + 
  stat_smooth(method = lm, se = FALSE)  + # se = FALSE → removes 95% confidence interval
  theme_bw(base_family = "HiraKakuProN-W3")

  • There is a positive correlation between cc and gov_p.
  • Regions with higher social capital (cc) tend to have higher government performance (gov_p).
  • The regression equation is obtained as follows:
model_3 <- lm(gov_p ~ cc, data = df2)

summary(model_3)

Call:
lm(formula = gov_p ~ cc, data = df2)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5043 -1.3481 -0.2087  0.9764  3.4957 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.71115    0.84443   3.211  0.00485 ** 
cc           0.56730    0.06552   8.658 7.81e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.789 on 18 degrees of freedom
Multiple R-squared:  0.8064,    Adjusted R-squared:  0.7956 
F-statistic: 74.97 on 1 and 18 DF,  p-value: 7.806e-08

\[\widehat{gov_p}\ = 2.711 + 0.567cc\]

  • For each unit increase in social capital (cc), government performance (gov_p) increases by 0.567 points.
  • This is not likely to be a coincidence (← p-value is 7.81e-08).
  • To verify if this is observed within both North and South regions:
    → A multiple regression analysis is required, including both cc and the location dummy in the model
model_4 <- lm(gov_p ~ cc + location, data = df2)
  • Display the analysis results.
summary(model_4)

Call:
lm(formula = gov_p ~ cc + location, data = df2)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5003 -1.3445 -0.2058  0.9773  3.4997 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  2.69850    1.12131   2.407   0.0278 *
cc           0.57094    0.21485   2.657   0.0166 *
location    -0.04781    2.67759  -0.018   0.9860  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.841 on 17 degrees of freedom
Multiple R-squared:  0.8064,    Adjusted R-squared:  0.7836 
F-statistic:  35.4 on 2 and 17 DF,  p-value: 8.689e-07
  • From these results, the following sample regression function (SRF) equation is obtained:

\[\widehat{gov_p}\ = 2.7 + 0.57cc - 0.048location\]

  • Social capital (cc) has a significant impact on gov_p, significant at the 5% level.
  • The location dummy does not impact gov_p
  • Controlling with the location dummy results in the following two (with equal slopes) regression equations:

For the South (location = 0):

\[\widehat{gov_p}\ = 2.7 + 0.57cc\]

For the North (location = 1):

\[\widehat{gov_p}\ = 2.65 + 0.57cc\]

  • Overall (top left figure):
    → As social capital (cc) increases, government performance (gov_p) also increases.
  • Even when looking at the North and South regions separately (top right figure):
    → As social capital (cc) increases, government performance (gov_p) also increases.
  • location is not statistically significant at the 5% level (p-value = 0.9860).
  • cc is statistically significant even at the 5% level (p-value = 0.0166).
  • Even when considering the North-South region dummy (location),
    → Social capital (cc) still has an impact on government performance (gov_p).

Question 3 Result
・In municipalities with the same level of social capital (cc), there is no difference in the level of government performance (gov_p), whether in the North or South. ・Government performance (gov_p) is explained by “social capital” (cc) rather than by “regional differences” (location).

3. What can be understood through dummy variables (general election)?

  • Let’s consider the following model using the results data of the Japanese House of Representatives election.

3.1 Data Preparation (hr96-21.csv)

3.1.1 Downloading Data

  • Click hr96-21.csv and download the election file to your computer.

3.1.2 How to load Data

  • Create a folder named data inside your RProject folder, and manually place the downloaded hr96-21.csv file into the data folder within the RProject directory.

  • The command na = "." means “replace missing values with a period (dot).

  • Leaving missing values as blank can cause issues, such as data originally of “numeric” type being recognized as “character” type, which can lead to errors.

  • Therefore, it’s important to handle missing values at the time of reading to avoid such problems.

hr <- read_csv("data/hr96-21.csv",
               na = ".")  

3.1.2 Inspecting the Read Election Data

  • hr96_21.csv contains data for 9 House of Representatives elections that have been conducted since the introduction of single-member districts in 1996 (1996, 2000, 2003, 2005, 2009, 2012, 2014, 2017, 2021) in Japan.
  • To display the variable names included in hr, you can use the following R code:
names(hr)
 [1] "year"          "pref"          "ku"            "kun"          
 [5] "wl"            "rank"          "nocand"        "seito"        
 [9] "j_name"        "gender"        "name"          "previous"     
[13] "age"           "exp"           "status"        "vote"         
[17] "voteshare"     "eligible"      "turnout"       "seshu_dummy"  
[21] "jiban_seshu"   "nojiban_seshu"

hr96_21.csv contains the following 23 variables:

variable detail
year Election year (1996-2021)
pref Prefecture
ku Electoral district name
kun Number of electoral district
rank Ascending order of votes
wl 0 = loser / 1 = single-member district (smd) winner / 2 = zombie winner
nocand Number of candidates in each district
seito Candidate’s affiliated party (in Japanese)
j_name Candidate’s name (Japanese)
name Candidate’s name (English)
previous Previous wins
gender Candidate’s gender:“male”, “female”
age Candidate’s age
exp Election expenditure (yen) spent by each candidate
status 0 = challenger / 1 = incumbent / 2 = former incumbent
vote votes each candidate garnered
voteshare Vote share (%)
eligible Eligible voters in each district
turnout Turnout in each district (%)
castvote Total votes cast in each district
seshu_dummy 0 = Not-hereditary candidates, 1 = hereditary candidate
jiban_seshu Relationship between candidate and his predecessor
nojiban_seshu Relationship between candidate and his predecessor
  • Checking the Data Types.
  • You can check the data types of the variables in the dataset using the following R code:
str(hr)
spc_tbl_ [9,660 × 22] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ year         : num [1:9660] 1996 1996 1996 1996 1996 ...
 $ pref         : chr [1:9660] "愛知" "愛知" "愛知" "愛知" ...
 $ ku           : chr [1:9660] "aichi" "aichi" "aichi" "aichi" ...
 $ kun          : num [1:9660] 1 1 1 1 1 1 1 2 2 2 ...
 $ wl           : num [1:9660] 1 0 0 0 0 0 0 1 0 2 ...
 $ rank         : num [1:9660] 1 2 3 4 5 6 7 1 2 3 ...
 $ nocand       : num [1:9660] 7 7 7 7 7 7 7 8 8 8 ...
 $ seito        : chr [1:9660] "新進" "自民" "民主" "共産" ...
 $ j_name       : chr [1:9660] "河村たかし" "今枝敬雄" "佐藤泰介" "岩中美保子" ...
 $ gender       : chr [1:9660] "male" "male" "male" "female" ...
 $ name         : chr [1:9660] "KAWAMURA, TAKASHI" "IMAEDA, NORIO" "SATO, TAISUKE" "IWANAKA, MIHOKO" ...
 $ previous     : num [1:9660] 2 2 2 0 0 0 0 2 0 0 ...
 $ age          : num [1:9660] 47 72 53 43 51 51 45 51 71 30 ...
 $ exp          : num [1:9660] 9828097 9311555 9231284 2177203 NA ...
 $ status       : num [1:9660] 1 2 1 0 0 0 0 1 2 0 ...
 $ vote         : num [1:9660] 66876 42969 33503 22209 616 ...
 $ voteshare    : num [1:9660] 40 25.7 20.1 13.3 0.4 0.3 0.2 32.9 26.4 25.7 ...
 $ eligible     : num [1:9660] 346774 346774 346774 346774 346774 ...
 $ turnout      : num [1:9660] 49.2 49.2 49.2 49.2 49.2 49.2 49.2 51.8 51.8 51.8 ...
 $ seshu_dummy  : num [1:9660] 0 0 0 0 0 0 0 0 1 0 ...
 $ jiban_seshu  : chr [1:9660] NA NA NA NA ...
 $ nojiban_seshu: chr [1:9660] NA NA NA NA ...
 - attr(*, "spec")=
  .. cols(
  ..   year = col_double(),
  ..   pref = col_character(),
  ..   ku = col_character(),
  ..   kun = col_double(),
  ..   wl = col_double(),
  ..   rank = col_double(),
  ..   nocand = col_double(),
  ..   seito = col_character(),
  ..   j_name = col_character(),
  ..   gender = col_character(),
  ..   name = col_character(),
  ..   previous = col_double(),
  ..   age = col_double(),
  ..   exp = col_double(),
  ..   status = col_double(),
  ..   vote = col_double(),
  ..   voteshare = col_double(),
  ..   eligible = col_double(),
  ..   turnout = col_double(),
  ..   seshu_dummy = col_double(),
  ..   jiban_seshu = col_character(),
  ..   nojiban_seshu = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
  • This code will display the data types of the variables, and as you mentioned, numeric values are recognized as numeric, while character values are recognized as character.

3.2 Data Creation and Descriptive Statistics.

  • hr96-21.csv contains result data for 9 House of Representatives elections conducted since the introduction of single-member districts in 1996 (1996, 2000, 2003, 2005, 2009, 2012, 2014, 2017, 2021).
  • Confirm the variables included in hr using the following R code:
names(hr)
 [1] "year"          "pref"          "ku"            "kun"          
 [5] "wl"            "rank"          "nocand"        "seito"        
 [9] "j_name"        "gender"        "name"          "previous"     
[13] "age"           "exp"           "status"        "vote"         
[17] "voteshare"     "eligible"      "turnout"       "seshu_dummy"  
[21] "jiban_seshu"   "nojiban_seshu"
  • It is found that hr contains 22 variables.
  • Using these variables, create new variables (exppv and ldp) required for analysis.

Creating exppv (Election Expenses Per Voter).

  • Check the data types of exp and eligible:
str(hr$exp)
 num [1:9660] 9828097 9311555 9231284 2177203 NA ...
str(hr$eligible)
 num [1:9660] 346774 346774 346774 346774 346774 ...
  • Both of them are numeric, which is fine.
  • Create the election expenses per voter (exppv) using exp and eligible:
hr <- hr %>% 
  dplyr::mutate(exppv = exp/eligible) # eligible represents the number of eligible voters per single-member district
summary(hr$exppv)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
  0.0013   8.1762  18.7646  23.0907  33.3863 120.8519     2831 
  • There are 1974 NA (missing values) values!

Creating ldp (LDP Dummy).

  • Examine the values in seito:
unique(hr$seito)
 [1] "新進"                   "自民"                   "民主"                  
 [4] "共産"                   "文化フォーラム"         "国民党"                
 [7] "無所"                   "自由連合"               "政事公団太平会"        
[10] "新社会"                 "社民"                   "新党さきがけ"          
[13] "沖縄社会大衆党"         "市民新党にいがた"       "緑の党"                
[16] "さわやか神戸・市民の会" "民主改革連合"           "青年自由"              
[19] "日本新進"               "公明"                   "諸派"                  
[22] "保守"                   "無所属の会"             "自由"                  
[25] "改革クラブ"             "保守新"                 "ニューディールの会"    
[28] "新党尊命"               "世界経済共同体党"       "新党日本"              
[31] "国民新党"               "新党大地"               "幸福"                  
[34] "みんな"                 "改革"                   "日本未来"              
[37] "日本維新の会"           "当たり前"               "政治団体代表"          
[40] "安楽死党"               "アイヌ民族党"           "次世"                  
[43] "維新"                   "生活"                   "立憲"                  
[46] "希望"                   "緒派"                   ""                      
[49] "N党"                   "国民"                   "れい"                  
  • For analysis, we need an LDP dummy variable.
  • Create a dummy variable with LDP = 1 for the LDP and 0 for other parties, then add it to the data frame hr:
  • 自民 means LDP in Japanese.
hr <- hr %>% 
  mutate(ldp = as.numeric(seito =="自民"))
  • Confirm the variables included in hr:
names(hr)
 [1] "year"          "pref"          "ku"            "kun"          
 [5] "wl"            "rank"          "nocand"        "seito"        
 [9] "j_name"        "gender"        "name"          "previous"     
[13] "age"           "exp"           "status"        "vote"         
[17] "voteshare"     "eligible"      "turnout"       "seshu_dummy"  
[21] "jiban_seshu"   "nojiban_seshu" "exppv"         "ldp"          
  • Finally, confirm that ldp has been created.
  • With the addition of these 2 variables, the total number of variables in hr is now 24.
  • hr contains the following 24 variables.
variable detail
year Election year (1996-2021)
pref Prefecture
ku Electoral district name
kun Number of electoral district
rank Ascending order of votes
wl 0 = loser / 1 = single-member district (smd) winner / 2 = zombie winner
nocand Number of candidates in each district
seito Candidate’s affiliated party (in Japanese)
j_name Candidate’s name (Japanese)
name Candidate’s name (English)
previous Previous wins
gender Candidate’s gender:“male”, “female”
age Candidate’s age
exp Election expenditure (yen) spent by each candidate
status 0 = challenger / 1 = incumbent / 2 = former incumbent
vote votes each candidate garnered
voteshare Vote share (%)
eligible Eligible voters in each district
turnout Turnout in each district (%)
castvote Total votes cast in each district
seshu_dummy 0 = Not-hereditary candidates, 1 = hereditary candidate
jiban_seshu Relationship between candidate and his predecessor
nojiban_seshu Relationship between candidate and his predecessor
exppv Election Expenses Per Voter (in Yen)
ldp LDP Dummy: 1 for LDP candidates, 0 for others
  • Extracting only the 2005 House of Representatives election data from the data frame hr, and performing multiple regression analysis using the following three variables (voteshare, exppv, ldp).
variables detail
Response Variable voteshare
Explanatory Variable exppv
Control Variable ldp
hr05 <- hr %>%
  dplyr::filter(year == 2005) %>%   # Select only data for the year 2005
  dplyr::select(voteshare, exppv, ldp) # Select only 3 variables
  • o inspect the contents of the variables:
DT::datatable(hr05)
  • To display the data summary:
summary(hr05)
   voteshare         exppv             ldp        
 Min.   : 0.60   Min.   : 0.148   Min.   :0.0000  
 1st Qu.: 8.80   1st Qu.: 8.352   1st Qu.:0.0000  
 Median :34.80   Median :22.837   Median :0.0000  
 Mean   :30.33   Mean   :24.627   Mean   :0.2932  
 3rd Qu.:46.60   3rd Qu.:35.269   3rd Qu.:1.0000  
 Max.   :73.60   Max.   :89.332   Max.   :1.0000  
                 NA's   :4                        
  • To display descriptive statistics using stargazer()
  • For text output:
stargazer(as.data.frame(hr05), type = "text")

==========================================
Statistic  N   Mean  St. Dev.  Min   Max  
------------------------------------------
voteshare 989 30.333  19.230  0.600 73.600
exppv     985 24.627  17.907  0.148 89.332
ldp       989 0.293   0.455     0     1   
------------------------------------------
  • For HTML output:
  • When displaying in R-Markdown, specify type = "html" and use chunk option ```{r, results = "asis"} to render the HTML
stargazer(as.data.frame(hr05), type = "html")
Statistic N Mean St. Dev. Min Max
voteshare 989 30.333 19.230 0.600 73.600
exppv 985 24.627 17.907 0.148 89.332
ldp 989 0.293 0.455 0 1
  • To achieve a well-formatted output when displaying html, you can paste the following command “outside of a chunk” to ensure balanced presentation.
<style>
table, td, th {
  border: none;
  padding-left: 1em;
  padding-right: 1em;
  min-width: 50%;
  margin-left: auto;
  margin-right: auto;
  margin-top: 1em;
  margin-bottom: 1em;
}
</style>

3.3 Displaying Scatterplots

  • Display a scatterplot of exppv and voteshare.
  • If you want to display Japanese characters with ggplot, Mac users can input the following line:
theme_bw(base_family = "HiraKakuProN-W3")
hr05 %>% 
  ggplot(aes(exppv, voteshare)) +
  geom_point() +
  labs(x = "Election Expenses Per Voter (Yen)", y = "Vote Share (%)",
         title = "Candidate Vote Share and Election Expenses") +
  theme_bw(base_family = "HiraKakuProN-W3") + 
  geom_smooth(method = lm, se = FALSE)  # se = FALSE → Remove 95% confidence interval

  • Display a scatterplot of ldp and voteshare.  
hr05 %>% 
  ggplot(aes(ldp, voteshare)) +
  geom_point() + 
  labs(x = "0 = Non-LDP, 1 = LDP", y = "Vote Share (%)",
         title = "Candidate Vote Share and LDP Affiliation") + 
  geom_jitter(width = 0.02) + # Scatter data for display
  theme_bw(base_family = "HiraKakuProN-W3") +
  geom_smooth(method = lm, se = FALSE)  # se = FALSE → Remove 95% confidence interval

3.4 Two Regression Analyses (Model_5 & Model_6).

  • Simple linear regression between voteshare and exppv:
model_5 <- lm(voteshare ~ exppv, 
              data = hr05)
  • Multiple regression between voteshare and exppv + ldp:
model_6 <- lm(voteshare ~ exppv + ldp, 
              data = hr05)
  • Display the results of the three models.
  • When displaying in R-Markdown, specify type = "html" and use chunk option ```{r, results = "asis"}:
stargazer(model_5, model_6, type = "html")
Dependent variable:
voteshare
(1) (2)
exppv 0.767*** 0.565***
(0.024) (0.024)
ldp 15.852***
(0.961)
Constant 11.453*** 11.779***
(0.727) (0.644)
Observations 985 985
R2 0.512 0.618
Adjusted R2 0.512 0.617
Residual Std. Error 13.422 (df = 983) 11.883 (df = 982)
F Statistic 1,031.634*** (df = 1; 983) 794.203*** (df = 2; 982)
Note: p<0.1; p<0.05; p<0.01
  • From the results of model_6 (right column), we obtain the following sample regression function (SRF) equation.

\[\widehat{voteshare}\ = 11.78 + 0.57\cdot exppv + 15.85\cdot ldp\]

  • An increase of 1 yen in exppv (election expenses per voter) is associated with an increase of 0.57 percentage points in the candidate’s vote share, which is statistically significant at the 1% level.
  • On average, LDP (Liberal Democratic Party) candidates hve vote shares that is 15.85 percentage points higher than non-LDP candidates, which is statistically significant at the 1% level.

\[\widehat{voteshare}\ = 11.78 + 0.57\cdot exppv\]

If the candidate is an LDP (Liberal Democratic Party) candidate (ldp = 1),

\[\widehat{voteshare}\ = 27.63 + 0.57 \cdot exppv\].

  • When visualized, these results would yield two parallel lines with a slope of 0.57.
## Create a data frame for predicted values and name it 'pred'
pred <- with(hr05, expand.grid(
  exppv = seq(min(exppv, na.rm=TRUE), max(exppv, na.rm=TRUE), length = 100),
  ldp = c(0,1)
))

## Use mutate to create and calculate the new variable, 'voteshare' (predictions)
pred <- mutate(pred, voteshare = predict(model_6, newdata = pred))

## Draw dots (points) for observed values (hr05) and lines for regression lines (pred)
p3 <- ggplot(hr05, aes(x = exppv, y = voteshare, color = as.factor(ldp))) +
  geom_point(size = 1) + geom_line(data = pred)
p3 <- p3 + labs(x = "Election Expenses Per Voter", y = "Vote Share (%)")
p3 <- p3 + scale_color_discrete(name = "Party Affiliation", 
                                labels = c("Non-LDP","LDP")) + guides(color = guide_legend(reverse = TRUE)) +
  theme_bw(base_family = "HiraKakuProN-W3")

print(p3 + ggtitle("Relationship Between Vote Share and Election Expenses (By Candidate's Party Affiliation)"))

  • This model is set up to have two parallel lines (with the same slope).
  • From this graph and the analysis results, it can be observed that higher election expenses per voter are associated with higher vote share, and LDP candidates have a vote share nearly 16 % points higher.
  • However, this model assumes that the impact’s magnitude (i.e., slope) is the same for both LDP and non-LDP candidates.
  • By adding interaction terms to the model, it is possible to determine the differences in the magnitude of impact (i.e., slope) between LDP candidates and non-LDP candidates.

3.5 Mechanism of Regression Analysis with Dummy Variables (Model_6)

  • To understand the mechanism of the multiple regression analysis with dummy variables included in the above analysis (model_6), let’s try three stages of regression analysis as follows:

3 Stages of Regression Analysis

Step 1: Simple linear regression with continuous and dummy variables (voteshare, ldp).
Step 2: Simple linear regression with two continuous variables (voteshare, exppv).
Step 3: Multiple regression analysis with dummy variable (voteshare, exppv, ldp).

Step 1: Simple Linear Regression with Continuous and Dummy Variables.

  • Consider a regression analysis with candidate’s vote share (voteshare) as the response variable and whether the candidate is endorsed by the Liberal Democratic Party (ldp) as the explanatory variable.
  • voteshare is a continuous variable, and ldp is a categorical variable.
  • Create a scatterplot to visualize the correlation between voteshare (on the y-axis) and ldp (on the x-axis).
ggplot(hr05, aes(ldp, voteshare)) +
  geom_point() +
  labs(x = "0 = Non-LDP, 1 = LDP", y = "Vote Share (%)",
         title = "Candidate Vote Share and LDP Affiliation") + 
  geom_jitter(width = 0.02) + # Scatter data for display  
  geom_smooth(method = lm, se = FALSE) + # se = FALSE → Remove 95% confidence interval
  theme_bw(base_family = "HiraKakuProN-W3")

  • The regression equation for this model is as follows:

\[voteshare_i \sim \mathrm{N}(\alpha_0 + \alpha_1 \cdot ldp_i, \sigma^2)\]

  • Calculate the regression equation using lm() function and name it model_7.
model_7<- lm(voteshare ~ ldp, data = hr05)
  • Display the regression analysis results using summary().
summary(model_7)

Call:
lm(formula = voteshare ~ ldp, data = hr05)

Residuals:
    Min      1Q  Median      3Q     Max 
-30.223 -14.313  -0.423  12.387  46.187 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  22.4132     0.5593   40.07   <2e-16 ***
ldp          27.0096     1.0329   26.15   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.79 on 987 degrees of freedom
Multiple R-squared:  0.4093,    Adjusted R-squared:  0.4087 
F-statistic: 683.8 on 1 and 987 DF,  p-value: < 2.2e-16
  • From the results, we obtain \(\widehat{\alpha}_0 = 22.41\), \(\widehat{\alpha}_1 = 27\), and \(\widehat{\sigma} = 14.79\).
  • Based on these results, the regression equation for model_7 is as follows:

\[\widehat{voteshare}\ = 22.41 + 27 \cdot ldp\]

Interpretation of Model_7 Analysis Results

  • The results of the F-test are shown at the bottom, and the p-value is 2.2e-16, indicating that the F-test is significant at the 1% level.
  • The null hypothesis is that the coefficient for ldp is 0.
  • The Pr(>|t|) value of “2e-16” is the p-value.
    → Since the p-value is smaller than 0.01, the null hypothesis is rejected at the significance level of α = 0.01 (1%).
    → Being an LDP candidate (ldp = 1) is associated with a 27% increase in vote share.
  • The Adjusted R-squared value is 0.4087, which means that 40.87% of the variance in the response variable (voteshare) can be explained by ldp.
  • In R, Signif. codes indicate the significance level, and the number of asterisks attached to the p-value in the upper-right corner indicates the level of significance at which the null hypothesis is rejected. Three asterisks indicate significance at α = 0.001 (0.1%).
  • Two asterisks indicate significance at α = 0.01 (1%).
  • One asterisk indicates significance at α = 0.05 (5%).
  • The stargazer() function can be used to display the results in a more readable format.
  • When displaying in R-Markdown, specify type = "html" and use chunk option ```{r, results = "asis"}
stargazer(model_7,
          type = "html",
          dep.var.labels = "Voteshare (%)", # Response variable name
          title = "Table 1: LDP Affiliation and Voteshare in the 2005HR Election") # Title
Table 1: LDP Affiliation and Voteshare in the 2005HR Election
Dependent variable:
Voteshare (%)
ldp 27.010***
(1.033)
Constant 22.413***
(0.559)
Observations 989
R2 0.409
Adjusted R2 0.409
Residual Std. Error 14.788 (df = 987)
F Statistic 683.759*** (df = 1; 987)
Note: p<0.1; p<0.05; p<0.01

Simple Linear Regression with Dummy Variable and t-test

  • The intercept of this regression line, 13.98, represents the average vote share (predicted vote share) for non-LDP candidates, calculated by substituting ldp = 0 into the equation:

\(22.41 + 27.01 \cdot 0 = 22.41\)

  • On the other hand, by substituting ldp = 1 into the prediction equation, we obtain the average vote share (predicted vote share) for LDP candidates.

\[22.41 + 27.01 \cdot 1 = 49.42\] - Let’s calculate the average vote share for each candidate in R and verify if they match the two predicted values obtained earlier.
- Average vote share (predicted vote share) for non-LDP candidates (ldp = 0):

hr05 %>%
  filter(ldp == 0) %>%  # Use filter to limit the data to ldp = 0 only
  with(mean(voteshare)) %>%    # Use with to calculate the mean of votehshare
  round(2)                     # Display up to 2 decimal places
[1] 22.41
  • Average vote share (predicted vote share) for LDP candidates (ldp = 1).
hr05 %>%
  filter(ldp == 1) %>%         # Use filter to limit the data to ldp = 1 only
  with(mean(voteshare)) %>%    # Use with to calculate the mean of votehshare
  round(2)                     # Display up to 2 decimal places
[1] 49.42
  • From the above calculations, we can see that the predicted values are the mean values of the response variable when given the values of the explanatory variable.
  • In fact, when performing a t-test comparing the means of voteshare for LDP and non-LDP candidates, we obtain results similar to the regression analysis.
  • Since voteshare is unpaired, we use the default settings for the t-test.
t.test(hr05$voteshare[hr05$ldp == 1], 
       hr05$voteshare[hr05$ldp == 0])

    Welch Two Sample t-test

data:  hr05$voteshare[hr05$ldp == 1] and hr05$voteshare[hr05$ldp == 0]
t = 32.446, df = 897.39, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 25.37584 28.64335
sample estimates:
mean of x mean of y 
 49.42276  22.41316 

Step2: Simple Regression Analysis with Continuous Variables.

  • Using the data frame hr05, let’s consider a regression analysis with candidate’s vote share (voteshare) as the response variable and campaign expenses per voter (exppv: in yen) as the explanatory variable.
  • Both voteshare and exppv are continuous variables.
  • We’ll create a scatter plot with voteshare on the vertical axis and exppv on the horizontal axis to visualize the correlation between these two variables.
ggplot(hr05, aes(exppv, voteshare)) +
  geom_point() +
  labs(x = "Campaign Expenses per Voter (yen)", y = "Vote Share (%)",
         title = "Candidate's Vote Share and Campaign Expenses") +
  theme_bw(base_family = "HiraKakuProN-W3") +
  geom_smooth(method = lm, se = FALSE)  # se = FALSE to remove 95% confidence intervals

  • We’ll use lm() to calculate the regression equation and name it model_8.
model_8 <- lm(voteshare ~ exppv, data = hr05)
  • et’s display the results of the regression analysis.
summary(model_8)

Call:
lm(formula = voteshare ~ exppv, data = hr05)

Residuals:
    Min      1Q  Median      3Q     Max 
-43.596  -9.661  -3.493   9.909  42.851 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 11.45334    0.72744   15.74   <2e-16 ***
exppv        0.76745    0.02389   32.12   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.42 on 983 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:  0.5121,    Adjusted R-squared:  0.5116 
F-statistic:  1032 on 1 and 983 DF,  p-value: < 2.2e-16
  • ased on these results, the regression equation for model_8 is as follows: \[\widehat{voteshare}\ = 11.45 + 0.77 \cdot exppv\]

Interpretation of model_8 Results

  • The F-test result is shown at the bottom, and the p-value is 2.2e-16, indicating that the F-test is significant at the 1% level.
  • The null hypothesis is that “the coefficient of exppv is 0.”
  • The “2e-16” under “Pr(>|t|)” is the p-value.
    → Since the p-value is smaller than 0.01, the null hypothesis is rejected at the significance level of α = 0.01 (1%).
    → When a candidate spends 1 yen on campaign expenses, their vote share increases by 0.767 percentage points.
  • The adjusted R-squared value is 0.512, indicating that 51.2% of the variance in the response variable (voteshare) can be explained by exppv.
  • You can use the stargazer() function to display the results in an easily readable format.
  • When displaying in R-Markdown and specifying type = "html", use the chunk option ```{r, results = "asis"}.

Step3: Multiple Regression Analysis with Dummy Variables.

Point: Did being a Liberal Democratic Party (LDP) candidate affect a candidate’s vote share?

  • In this step, we attempt a multiple regression analysis with voteshare (vote share) as the response variable and two explanatory variables, ldp (LDP candidate) and exppv (campaign expenses per eligible voter).
  • We will name this model model_9.

Assumption of model_9:
“Even if being an LDP candidate affects vote share, the impact (size of the slope) of campaign expenses on vote share is the same.”

model_9 <- lm(voteshare ~ ldp + exppv, data = hr05)

# Display the analysis results
summary(model_9)

Call:
lm(formula = voteshare ~ ldp + exppv, data = hr05)

Residuals:
    Min      1Q  Median      3Q     Max 
-42.990  -8.828  -2.427   9.284  46.290 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 11.77885    0.64431   18.28   <2e-16 ***
ldp         15.85224    0.96087   16.50   <2e-16 ***
exppv        0.56538    0.02444   23.13   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.88 on 982 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:  0.618, Adjusted R-squared:  0.6172 
F-statistic: 794.2 on 2 and 982 DF,  p-value: < 2.2e-16
  • Based on these results, the regression equation for model_9 is as follows:

\[\widehat{voteshare}\ = 11.78 + 15.85 \cdot ldp\ + 0.565 \cdot exppv`\]

Interpretation of model_9 Results

  • The F-test result is shown at the bottom, and the p-value is 2.2e-16, indicating that the F-test is significant at the 1% level.

  • Null hypotheses are as follows:

  • \(H_0\): “The coefficient of the variable ldp (whether the candidate is an LDP candidate) is 0.”

  • \(H_0\): “The coefficient of the variable exppv (campaign expenses per eligible voter) is 0.”

  • The “2e-16” under “Pr(>|t|)” is the p-value.
    → Both p-values for these variables are smaller than 0.01, so the null hypotheses are rejected at the significance level of α = 0.01 (1%).
    → When exppv is held constant at its mean, being an LDP candidate (ldp = 1) increases vote share by 15.85 percentage points.
    → When ldp is held constant at its mean, spending 1 yen on campaign expenses per eligible voter (exppv) increases vote share by 0.565 percentage points.

  • The coefficient of the intercept is 11.78, so when both exppv and ldp are 0, the predicted vote share is 11.78 percentage points.

  • The adjusted R-squared value is 0.6172, indicating that 61.72% of the variance in the response variable (voteshare) can be explained by exppv and ldp.

  • What does the coefficient of ldp (15.85) mean?
    → “If the ldp dummy variable increases by 1 unit, the vote share increases by 15.85 percentage points.”
    → “LDP-endorsed candidates have a vote share that is 15.85 percentage points higher than non-LDP candidates.”

  • When visualizing the results of model_9, it looks like this:

## Create a data frame for predicted values and name it 'pred'
pred <- with(hr05, expand.grid(
  exppv = seq(min(exppv, na.rm=TRUE), max(exppv, na.rm=TRUE), length = 100),
  ldp = c(0,1)
))

## Use 'mutate' to create a new variable 'voteshare' by calculating predictions
pred <- mutate(pred, voteshare = predict(model_9, newdata = pred))

## Scatterplot points (data) are plotted with observed values (hr05), and regression lines are plotted with predicted values (pred)
p9 <- ggplot(hr05, aes(x = exppv, y = voteshare, color = as.factor(ldp))) +
  geom_point(size = 1) + geom_line(data = pred)
p9 <- p9 + labs(x = "Campaign Expenses Per Eligible Voter (¥)", y = "Vote Share (%)")
p9 <- p9 + scale_color_discrete(name = "LDP Candidate", 
                                labels = c("No","Yes")) + guides(color = guide_legend(reverse = TRUE)) +
theme_bw(base_family = "HiraKakuProN-W3") 
  
print(p9 + ggtitle("Campaign Expenses and Vote Share (By LDP Affiliation)"))

  • This model is designed to have two parallel lines (equal slopes).

Regression Equation for model_9

\[\widehat{voteshare} = 11.78 + 15.85 \cdot ldp\ + 0.565 \cdot exppv\]

  • When we substitute non-Liberal Democratic Party (LDP) candidates (ldp = 0) into the model_9 regression equation, we obtain the tomato-colored regression equation.

\[\widehat{voteshare}\ = 11.78 + 0.565 \cdot exppv\]

  • When we substitute LDP candidates (ldp = 1) into the model_9 regression equation, we obtain the blue-colored regression equation.

\[\widehat{voteshare}\ = 27.63 + 0.565 \cdot exppv\]

  • When comparing the two equations with different ldp dummy values, the only difference is the intercept, and the slopes are the same.
  • While the two regression lines are parallel, their intercepts are different (11.78 and 27.63).
  • The factor of being an LDP candidate shifts the regression line upward. (= On average, LDP-endorsed candidates have a vote share that is 15.85 percentage points higher compared to non-LDP candidates)
  • Overall, the vote share of LDP-endorsed candidates (blue) is higher than that of non-LDP candidates (tomato).
  • These two regression lines capture the relationship between campaign expenses and vote share for both LDP-endorsed candidates (blue) and non-LDP candidates (tomato).
  • Adding dummy variables as explanatory variables allows us to capture changes where regression lines shift in parallel (i.e., whether being an LDP candidate affected a candidate’s vote share).
  • You can also present the analysis results of the three statistical models (model_7, model_8, model_9) in a single table using stargazer().
  • You can display the results with a simple command: stargazer(model_7, model_8, model_9, type = "html").
  • When displaying in R-Markdown, specify type = "html" in chunk options using ```{r, results = "asis"}.
stargazer(model_7, model_8, model_9, 
          type = "html",                     
          dep.var.labels = "Voteshare (%)", # Dependent variable name
          title = "Table 4: Campaign Expenses and Vote Share (By LDP Affiliation)") # Title
Table 4: Campaign Expenses and Vote Share (By LDP Affiliation)
Dependent variable:
Voteshare (%)
(1) (2) (3)
ldp 27.010*** 15.852***
(1.033) (0.961)
exppv 0.767*** 0.565***
(0.024) (0.024)
Constant 22.413*** 11.453*** 11.779***
(0.559) (0.727) (0.644)
Observations 989 985 985
R2 0.409 0.512 0.618
Adjusted R2 0.409 0.512 0.617
Residual Std. Error 14.788 (df = 987) 13.422 (df = 983) 11.883 (df = 982)
F Statistic 683.759*** (df = 1; 987) 1,031.634*** (df = 1; 983) 794.203*** (df = 2; 982)
Note: p<0.1; p<0.05; p<0.01

3.6 Adding a 95 Percent Confidence Interval to the Scatter Plot

  • Let’s add a 95 percent confidence interval to the scatter plot above.
  • To calculate the confidence interval, we use the standard error.
  • Since the standard error follows a t-distribution, we use qt() to find the range within which 95% of the distribution falls.
## Calculating lower and upper bounds of the confidence interval
## First, calculate predicted values (pred) and standard error (err)
err <- predict(model_9, newdata = pred, se.fit = TRUE)

## Using predicted values and standard error to calculate confidence interval (pred$lower, pred$upper)
pred$lower <- err$fit + qt(0.025, df = err$df) * err$se.fit
pred$upper <- err$fit + qt(0.975, df = err$df) * err$se.fit

p9.ci95 <- p9 +
  geom_smooth(data = pred, aes(ymin = lower, ymax = upper), stat ="identity")

# Display the created plot
print(p9.ci95 + ggtitle("Campaign Expenses and Vote Share (By LDP Affiliation)"))

4. Exercise

  • hr96-21.csv is a dataset containing the results of 9 House of Representatives elections held in Japan since the introduction of single-member districts in 1996 (1996, 2000, 2003, 2005, 2009, 2012, 2014, 2017, 2021).
  • Extract the data for the 2005 House of Representatives election from this dataset and perform a multiple regression analysis using the following 3 variables (voteshare, exppv, jcp):
Variable Type Variable Name Description
Response Variable voteshare Vote share (%)
Predictor Variable exppv Election expenses per eligible voter (in yen)
Control Variable jcp Communist Party Dummy (Communist Party candidate = 1, others = 0)
  • Note: You need to create the jcp dummy yourself.
    Q1: Use stargazer() to display the descriptive statistics for the three variables (voteshare, exppv, jcp) for the 2005 House of Representatives election.
    Q2: Create a scatter plot for exppv and voteshare. Also, include the regression line.
    Q3: Create a scatter plot for jcp and voteshare. Also, include the regression line.
    Q4: Perform a multiple regression analysis with voteshare as the response variable and exppv and jcp as predictor variables. Show the multiple regression equation and interpret the results.
    Q5: Create a scatter plot for voteshare with exppv and jcp as predictor variables.
    Differentiate the observations and regression lines by the dummy variable jcp. If you are plotting in black and white, consider changing the shape of the scatter plot dots or using two different types of regression lines.
Reference
  • 飯田健『計量政治分析』共立出版、2013年.
  • Ellis Goldberg (1996), Thinking about How Democracy Works, Politics & Society, Vol. 24, pp.7-18.
  • 宋財泫 (Jaehyun Song)- 矢内勇生 (Yuki Yanai)「私たちのR: ベストプラクティスの探究」
  • 土井翔平(北海道大学公共政策大学院)「Rで計量政治学入門」
  • 矢内勇生(高知工科大学)授業一覧
  • 浅野正彦, 矢内勇生.『Rによる計量政治学』オーム社、2018年
  • 浅野正彦, 中村公亮.『初めてのRStudio』オーム社、2018年
  • Winston Chang, R Graphics Cookbook, O’Reilly Media, 2012.
  • Kieran Healy, DATA VISUALIZATION, Princeton, 2019
  • Kosuke Imai, Quantitative Social Science: An Introduction, Princeton University Press, 2017