• R packages we use in this section
library(broom)
library(car)
library(jtools)
library(patchwork)
library(QuantPsyc)
library(sjPlot)
library(stargazer)
library(summarytools)
library(tidyverse)

1. Least Squares

1.1. Simple linear regression

  • The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems by minimizing the sum of the squares of the residuals made in the results of every single equation.
  • As shown in Section 13, correlation describes a linear relationship between two variables, x and y.

  • Multiple notations on X and Y
X (Cause) Y (Effect)
explanatory variable response variable
independent variable dependent variable
predictor outcome
explanatory variable explained variable
regressor regressand
  • This relationship is best characterized using the following linear model:

  • Y is the outcome or response variable and X is the predictor or independent (explanatory) variable.
  • Any line can be defined by the intercept \(α\) and the slope parameter \(β\).
  • The intercept \(α\) represents the average value of Y when X is zero.
  • The slope \(β\) measures the average increase in Y when X increases by one unit.
  • The intercept and slope parameters are together called coefficients.
  • The error (or disturbance) term, \(ε\), allows an observation to deviate from a perfect linear relationship.
  • Since the values of \(α\) and \(β\) in this equation are unknown to researchers, they must be estimated from the data.
  • The estimates of parameters are indicated by “hats,” where \(\hat{α}\) and \(\hat{β}\) represent the estimates of \(α\) and \(β\), respectively.

\[\hat{Y} = \hat{α} + \hat{β}x\]

  • Most likely, the predicted value will not equal the observed value.
  • The difference between the observed outcome and its predicted value is called the residual.
  • Formally, we can write the residual as:

\[\hat{ε} = Y - \hat{Y}\]

  • The least squares estimates of intercept (\(\hat{α}\)) and slope (\(\hat{β}\)) parameters are given as follows:

  • The outcome variable, \(\hat{Y}\), is given as follow:

\[\hat{Y} = \hat{α} + \hat{β}x \]

  • Let’s check the values of \(x\) and \(Y\) when \(x\) and \(Y\) equal to their mean

\[\hat{Y} = \hat{α} + \hat{β}\bar{X}\]

  • Substitute \(\hat{α}\) in (4.6) for \(\hat{α}\) in the equation

\[\hat{Y} = (\bar{Y} - \hat{β}\bar{X}) + \hat{β}\bar{X} = \bar{Y}\]

\[\bar{Y} - \hat{β}\bar{X} = \hat{α}\]

  • When the value of \(x\) is equal to its mean (\(\bar{X}\)), the value of \(Y\) is also equal to its mean (\(\bar{Y}\)).

  • In the above plot, we see that this is indeed the case.

  • The regression line runs through the intersection of the vertical and horizontal dotted lines, which represent the means of X and Y, \((\bar{X}, \bar{Y})\), respectively.

  • A linear regression model always has zero average prediction error across all data points in the sample, but this does not necessarily mean that the linear regression model accurately represents the actual data-generating process.

Source: Imai, Kosuke. Quantitative Social Science: An Introduction (p.141-147). Princeton University Press. Kindle.

1.2 Manual calculation for \(α\) and \(β\)

  • Let’s calculate \(α\) and \(β\) by hand using the following dataset.
  • This video helps you undersand how to calculate simple regression function by hand.
  • Although the video uses the different dataset, it helps you have a better understanding of simple linear regression.

Regression equation

\[\hat{Y} = \hat{α} + \hat{β}x\]

Slope (\(β\)) is difined as follows:

\[\hat{β} = r \frac{s_y}{s_x}\]

\(r\) → ピアソンの相関係数
\(s_y\) → y の標準偏差
\(s_x\) → x の標準偏差

Intercept (\(α\)) is difined as follows:

\[\hat{α} = \bar{Y} - β\bar{x}\]

  • We can calculate Pearson correlation coefficient (\(r\)) as follows:

\[r = \frac{\sum((x-\bar{x})(y-\bar{y}))}{\sqrt{Σ(x-\bar{x})^2Σ(y-\bar{y})^2}}= \frac{422.95}{\sqrt{(745.55)(297.55)}}= \frac{422.95}{{471}}= 0.89\]

\(s_y\) →  y’s standard deviation

\[s_y= \sqrt{\frac{Σ(y-\bar{y})^2}{{n-1}}}= \sqrt{\frac{297.55}{{19}}}=3.96\]

\(s_x\) →  x’s standard deviasion

\[s_x= \sqrt{\frac{Σ(x-\bar{x})^2}{{n-1}}}= \sqrt{\frac{745.55}{{19}}}=6.26\]
\[\hat{β} = r \frac{s_y}{s_x}= 0.89*\frac{3.96}{6.26}=0.56\]

\[\hat{α} = \bar{Y} - β\bar{x}= 9.15 - 0.55*11.35 =2.9\]

\[\hat{Y}=2.79+0.56x\]

2. 4 Steps in empirical analysis

① Find a puzzle
② Present your theory to explain the puzzle
③ Present a testable hypothesis drawn from your theory
④ Test your hypothesis with data

2.1 Find a puzzle

Research Question
Why do we witness the differences on local government’s performance in Italy?

Data

  • region: Abbreviation of Italian local governments
  • gov_p: Performance of Italian local governments

出典:飯田健『計量政治分析』p.28

2.2 Present your theory to explain the puzzle

Theory: Social capital enhances local government’s performance.

Source: Robert Putnam, (1994) Making Democracy Work: Civic Traditions in Modern Italy,
Princeton, NJ: Princeton University Press)

Theory

  • The differences on local government’s performance can be explained by the degree of social capital in each local government.

  • Social capital can be defined as “the networks of relationships among people who live and work in a particular society, enabling that society to function effectively”.

  • It involves the effective functioning of social groups through interpersonal relationships, a shared sense of identity, a shared understanding, shared norms, shared values, trust, cooperation, and reciprocity.

  • Social capital help people build cooperation one another.
    In the area with more social capital, the more people trust and cooperate one another, which leads to high quality government performance.

2.4 Present a testable hypothesis drawn from your theory

Hypothesis

  • If the theory is correct, then we should see that the larger civic community index (cc) leads to higher government performance (gov_p)

3. Test your hypothesis with data

3.1 Data preparation(putnam.csv)

  • Download (putnam.csv)
  • Put the file (putnam.csv) into data folder in your RProject folder
  • Load the data and name it putnam
putnam <- read_csv("data/putnam.csv") 
  • Show the list of variables the data contains
names(putnam)
[1] "region"   "gov_p"    "cc"       "econ"     "location"

Data

Types of variables Variables Details
Outcome gov_p Performance of Italian local governments
Predictor region Abbreviation of Italian local governments
Predictor cc Civic Community Index
Predictor econ Economy Index (the larger, the better)
Predictor location Area dummy (north,south
  • Check putnum
DT::datatable(putnam)

3.2 Descriptive statistics

2 ways of showing desciptive statistics

  • summary()
summary(putnam)
    region              gov_p             cc              econ      
 Length:20          Min.   : 1.50   Min.   : 1.000   Min.   : 2.50  
 Class :character   1st Qu.: 6.25   1st Qu.: 3.875   1st Qu.: 6.25  
 Mode  :character   Median :10.00   Median :15.000   Median :11.75  
                    Mean   : 9.15   Mean   :11.350   Mean   :10.43  
                    3rd Qu.:11.25   3rd Qu.:16.250   3rd Qu.:14.50  
                    Max.   :16.00   Max.   :18.000   Max.   :19.00  
   location        
 Length:20         
 Class :character  
 Mode  :character  
                   
                   
                   
  • stargazer()
library("stargazer")
stargazer(as.data.frame(putnam), 
          type = "text", 
          title = "Descriptve Statistics on Local Government's Performance", 
          digits = 2)

Descriptve Statistics on Local Government's Performance
======================================
Statistic N  Mean  St. Dev. Min   Max 
--------------------------------------
gov_p     20 9.15    3.96   1.50 16.00
cc        20 11.35   6.26   1.00 18.00
econ      20 10.43   5.08   2.50 19.00
--------------------------------------

3.3 Draw a scatter plot

  • Draw a scatter plot between cc and gov_p
ggplot(putnam, aes(cc, gov_p)) +
  geom_point() +
  labs(x = "Civic Community Index (cc)", y = "Performance of Italian local governments (gov_p)",
         title = "erformance of Italian local governments & Civic Community Index") + 
  stat_smooth(method = lm) +  
  geom_text(aes(y = gov_p + 0.2, label = region), size = 4, vjust = 0)

  • Calculate the correlation coefficient between gov_p and cc
cor.test(putnam$gov_p, putnam$cc)

    Pearson's product-moment correlation

data:  putnam$gov_p and putnam$cc
t = 8.6583, df = 18, p-value = 7.806e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7558102 0.9593028
sample estimates:
      cor 
0.8979883 
  • Correlation coefficient between gov_p and cc is 0.8979883
  • A stronger correlation
  • However, we need to use a regression analysis to seek for a causality between them

3.4 Simple Regression

Model_1<- lm(gov_p ~ cc, data = putnam)

3.4.1 How to show regression results

(1) Summary()

summary(Model_1)

Call:
lm(formula = gov_p ~ cc, data = putnam)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5043 -1.3481 -0.2087  0.9764  3.4957 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.71115    0.84443   3.211  0.00485 ** 
cc           0.56730    0.06552   8.658 7.81e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.789 on 18 degrees of freedom
Multiple R-squared:  0.8064,    Adjusted R-squared:  0.7956 
F-statistic: 74.97 on 1 and 18 DF,  p-value: 7.806e-08

(2) tidy()

tidy(Model_1)
# A tibble: 2 × 5
  term        estimate std.error statistic      p.value
  <chr>          <dbl>     <dbl>     <dbl>        <dbl>
1 (Intercept)    2.71     0.844       3.21 0.00485     
2 cc             0.567    0.0655      8.66 0.0000000781

(3) stargazer()

  • Note: replace {r} with {r, results = "asis"} as the chunk option
stargazer(Model_1, type = "html")
Dependent variable:
gov_p
cc 0.567***
(0.066)
Constant 2.711***
(0.844)
Observations 20
R2 0.806
Adjusted R2 0.796
Residual Std. Error 1.789 (df = 18)
F Statistic 74.967*** (df = 1; 18)
Note: p<0.1; p<0.05; p<0.01
  • From this result, we get the Sample Regression Function for Model_1 as follows:

\[\hat{gov_p}\ = 2.7 + .567cc\]

3.4.2 Interpretation of Regression Results

tidy(Model_1)
# A tibble: 2 × 5
  term        estimate std.error statistic      p.value
  <chr>          <dbl>     <dbl>     <dbl>        <dbl>
1 (Intercept)    2.71     0.844       3.21 0.00485     
2 cc             0.567    0.0655      8.66 0.0000000781
  • Coefficient on cc (.567):
  • When cc increases one unit, then gov_p increases by .567 points
  • What we want to know it if this is the case in population, too.
  • p-value is 0.0000000781
  • This p-valus is smaller than 0.05
    → We can reject the null hypothesis (coefficient on cc = 0 in population) with 95% confidence interval
    → Statistically significant with the 5% significant level
  • In general, “statistically significant” means statistically significant with the 5% significant level.
  • However, in this case, p-value (0.0000000781) is way smaller than 0.05.
  • It is also smaller than 0.01
    → We can also reject the null hypothesis (coefficient on cc = 0 in population) with 99% confidence interval
    Statistically significant with the 1% significant level
  • In social sciences, we visually report statistical significance using catapillar plots (I will talk about this later).
    → The coefficient on cc is not zero in population

Null Hypothesis \(H_0: β_1 = 0\)
Alternative Hypothesis \(H_a: β_1 ≠ 0\)

  • We reject the null hypothesis and take the alternative hypothesis
    cc is positively related to gov_p in population
Conclusion The larger civic community index (cc) leads to higher government performance (gov_p)

Other stuff to consider:

\(R^2\) = 0.8064
  • 81% of the variance of gov_p is explained by cc in Model_1
    → The other 19% remained unexplained
  • To increase the value of R-squared, we need to add another variable which affects the ourcome, gov_p

4. Multiple Regression

  • In 3.4.2, we see that when cc increases one unit, then gov_p increases by .567 points
  • However, it is not realistic that one outcome is caused by ONE variable.
  • It is natural to think that most of the outcome is caused by MULTPLE variables!
  • If we fail to include a variable (which affects the outcome for sure), then we face a selection bias problem.
    → We need to include another independent variable(s) to avoid selection bias.

A Solution to avoid selection bias → Multiple Regression Analysis
Multiple Regression analysis enables us to simultaneously control the effect of other variables
→ We can avoid selection bias problem to a certain degree
→ Multiple regression analysis is a most basic way of avoiding selection bias

  • We add another variable, econ as the control variable to our model

Data

Types of variables Variables Details
Outcome gov_p Performance of Italian local governments
Predictor region Abbreviation of Italian local governments
Predictor cc Civic Community Index
Predictor econ Economy Index (the larger, the better)
Predictor location Area dummy (north,south

4.1 Six steps in Multiple Regression

  1. Draw a scatter plot between X and Y
  2. Sample Regression Function formula
  3. Partial regression coefficient(\(α, β_1, β_2\))
  4. F test
  5. Adjusted Coefficient of Determination (\(Adjusted R^2\))
  6. Standard partial regression coefficient (\(β\))

(1) Draw a scatter plot between X and Y

  • Check the data we use
DT::datatable(putnam)
stargazer(as.data.frame(putnam), 
          type = "text", 
          title = "Descriptive Statistics on putnam", 
          digits = 2)

Descriptive Statistics on putnam
======================================
Statistic N  Mean  St. Dev. Min   Max 
--------------------------------------
gov_p     20 9.15    3.96   1.50 16.00
cc        20 11.35   6.26   1.00 18.00
econ      20 10.43   5.08   2.50 19.00
--------------------------------------
  • Draw a scatter plot between gov_p(Y-axis) and cc (X-axis)
plt_1 <- ggplot(putnam, aes(cc, gov_p)) +
  geom_point() +
  labs(x = "Civic Community Index (cc)", y = "Performance of Italian local governments (gov_p)",
         title = "Performance of Italian local governments & Civic Community Index") + 
  stat_smooth(method = lm) +  
  geom_text(aes(y = gov_p + 0.2, label = region), size = 4, vjust = 0)
plt_2 <-  ggplot(putnam, aes(econ, gov_p)) +
  geom_point() +
  labs(x = "Economy Index (econ)", y = "Performance of Italian local governments (gov_p)",
         title = "Performance of Italian local governments & Economy Index") + 
  stat_smooth(method = lm) +  
  geom_text(aes(y = gov_p + 0.2, label = region), size = 4, vjust = 0)
plt_1 + plt_2

  • We see that both have positive correlation

(2) Sample Regression Function formula

Model_1 <- lm(gov_p ~ cc, data = putnam)
Model_2 <- lm(gov_p ~ econ, data = putnam)
stargazer(Model_1, Model_2,     
          digits = 3,        
          style = "ajps", 
          dep.var.caption = "Outcome", 
          dep.var.labels = "Gov.Performance", 
          title = "Results of Model_1 & Model_2", 
          type ="html")
Results of Model& Model
Gov.Performance
Model 1 Model 2
cc 0.567***
(0.066)
econ 0.589***
(0.120)
Constant 2.711*** 3.011**
(0.844) (1.385)
N 20 20
R-squared 0.806 0.572
Adj. R-squared 0.796 0.549
Residual Std. Error (df = 18) 1.789 2.659
F Statistic (df = 1; 18) 74.967*** 24.097***
p < .01; p < .05; p < .1
  • Stargazer() automatically add asterisks to show the level of statistical significant levels (1%, 5%, and 10%)

  • p < .01(1%)- - - 「\(***\)

  • p < .05(5%)- - - 「\(**\)

  • p < .1 (10%)- - - 「\(*\)

Results (Model_1)

  • Coefficient on cc (.567):
  • When cc increases one unit, then gov_p increases by .567 points
  • What we want to know it if this is the case in population, too.
  • Three asterisks on the coefficient of cc
    → We can reject the null hypothesis (coefficient on cc = 0 in population) with 99% confidence interval
    Statistically significant with the 1% significant level

Results (Model_2)

  • Coefficient on econ (.589):
  • When econ increases one unit, then gov_p increases by .589 points
  • What we want to know it if this is the case in population, too.
  • Three asterisks on the coefficient of econ
    → We can reject the null hypothesis (coefficient on cc = 0 in population) with 99% confidence interval
    Statistically significant with the 1% significant level

 

Both cc and econ are statistically significant

Note

  • If you want to check t-value and p-value, you should use summary() or tidy() as follows:
tidy(Model_1, conf.int = TRUE)
# A tibble: 2 × 7
  term        estimate std.error statistic      p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>        <dbl>    <dbl>     <dbl>
1 (Intercept)    2.71     0.844       3.21 0.00485         0.937     4.49 
2 cc             0.567    0.0655      8.66 0.0000000781    0.430     0.705
tidy(Model_2, conf.int = TRUE)
# A tibble: 2 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)    3.01      1.38       2.17 0.0433      0.102     5.92 
2 econ           0.589     0.120      4.91 0.000113    0.337     0.841

(3) Partial regression coefficient (α, β1, β2)

Model_3 <- lm(gov_p ~ cc + econ, data = putnam)
stargazer(Model_1, Model_2, Model_3,    
          digits = 3,        
          style = "ajps", 
          dep.var.caption = "Outcome", 
          dep.var.labels = "Gov.Performance", 
          title = "Results of Model_1 & Model_2 & Model_3", 
          type ="html")
Results of Model& Model& Model
Gov.Performance
Model 1 Model 2 Model 3
cc 0.567*** 0.754***
(0.066) (0.152)
econ 0.589*** -0.253
(0.120) (0.187)
Constant 2.711*** 3.011** 3.236***
(0.844) (1.385) (0.912)
N 20 20 20
R-squared 0.806 0.572 0.825
Adj. R-squared 0.796 0.549 0.805
Residual Std. Error 1.789 (df = 18) 2.659 (df = 18) 1.749 (df = 17)
F Statistic 74.967*** (df = 1; 18) 24.097*** (df = 1; 18) 40.126*** (df = 2; 17)
p < .01; p < .05; p < .1

Sample Regression Function formula on Model_3

\[\hat{gov_p}\ = 3.236 + 0.754cc - 0.253econ\]

Interpretation :Model_2, Model_2, and Model_3

  • In Model_1, cc is statistically significant.
  • In Model_2, econ is statistically significant.
  • However, in Model_3, cc is statistically significant, but econ is not.
    Why?
    → The relationship between gov_p and econ is spurious correlation
  • Coefficient on cc (.754):
  • When cc increases one unit, then gov_p increases by .754 points
    → Statistically significant with the 1% significant level

(4) F test

  • An F-test is any statistical test in which the test statistic has an F-distribution under the null hypothesis.

  • It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model that best fits the population from which the data were sampled.

  • Check the values on F-statistics on Model_3

 F-statistic: 40.13 on 2 and 17 DF,  p-value: 3.645e-07
  • F-statistic enables us to test whether the coefficient of cc and econ are simultaneously equal to 0
  • Null hypothesis: the coefficient of cc and econ are simultaneously equal to 0
  • F-statistic = 40.13, p-value = 3.645e-07
    → We can reject the null hypothesis.
  • Statistically significant with the 1% significant level → We can conclude that this is an appropriate model

(5) Adjusted Coefficient of Determination (\(adjR^2\))

  • In multiple regression, you use \(adjR^2\) instead of \(R^2\)
  • Check the values on F-statistics on Model_3
 Adjusted R-squared:  0.805  
  • 80.5% of the variance of gov_p is explained by cc and econ in Model_3
    → The other 19.5% remained unexplained

(6) Standard partial regression coefficient (\(β\))  

  • Usually, the unit of each variable differs, such as cm, kg, US dollors, yen, %, etc.
    → It is not possible to see how each variable has an impact on the ourcome over the other variables by simply comparing the size of coefficient.
  • By comparing the size of \(β\) (Standardized Beta Coefficient), we can compare the relative impact of each variable on the outcome.  
library("QuantPsyc")  
Model_3 <- lm(gov_p ~ cc + econ, data = putnam)
lm.beta(Model_3)
        cc       econ 
 1.1932019 -0.3255233 

Interpretation of β (Model_3)

  • The coefficient (1.1932019) → when cc increases one standard deviation, gov_p increases its standard deviation by 1.19
    → Statistically significant with the 1% significant level

  • The coefficient (-0.3255233) → when econ increases one standard deviation, gov_p decreases its standard deviation by 0.33
    → Not statistically significant

4.2.8 How to present regression results

Tables

stargazer(Model_1, Model_2, Model_3,    
          digits = 3,        
          style = "ajps", 
          dep.var.caption = "Outcome", 
          dep.var.labels = "Gov.Performance", 
          title = "Results of Model_1 & Model_2 & Model_3", 
          type ="html")
Results of Model& Model& Model
Gov.Performance
Model 1 Model 2 Model 3
cc 0.567*** 0.754***
(0.066) (0.152)
econ 0.589*** -0.253
(0.120) (0.187)
Constant 2.711*** 3.011** 3.236***
(0.844) (1.385) (0.912)
N 20 20 20
R-squared 0.806 0.572 0.825
Adj. R-squared 0.796 0.549 0.805
Residual Std. Error 1.789 (df = 18) 2.659 (df = 18) 1.749 (df = 17)
F Statistic 74.967*** (df = 1; 18) 24.097*** (df = 1; 18) 40.126*** (df = 2; 17)
p < .01; p < .05; p < .1

Caterpillar plots
(1) sjPlot

coef_sjplot <- sjPlot::plot_model(Model_3, 
                          show.values = T,      
                          show.p = T,              
                          vline.color = "black",   
                          order.terms = c(1, 2), 
                          axis.labels = c("Economy Index (econ)", "Civic Community Index (cc)"),
                          axis.title = "Estimates", title = "Performance of Italian local governments, Civic Community, and Economy",
                          digits = 3) 

coef_sjplot

【How to interpret sjPlot】

  • dot (●) is the estimate (= coefficient) for each explanatory variable

  • Holizontal lines (blue and red) show the 95% confidence intervals with α = 0.05 (= 5%)

  • Blue line・・・The blue line does not cross the black vertical line
    → Statistically significant with 5% significant level

  • Red line・・・The blue line does cross the black vertical line
    → Not statistically significant with 5% significant level
     

(2) jtools

jtools::plot_summs(Model_3)

6. Excercise

Excercise 6.1

  • Download hr96_21.csv
  • Japanese lower house election results (1996-2021)
  • Put the file (hr96_21.csv) into data folder in your RProject folder
  • Load the data and name it hr
hr <- read_csv("data/hr96-21.csv", 
               na = ".")
  • Check the variable names in hr
names(hr)
 [1] "year"          "pref"          "ku"            "kun"          
 [5] "wl"            "rank"          "nocand"        "seito"        
 [9] "j_name"        "gender"        "name"          "previous"     
[13] "age"           "exp"           "status"        "vote"         
[17] "voteshare"     "eligible"      "turnout"       "seshu_dummy"  
[21] "jiban_seshu"   "nojiban_seshu"
  • df1 contains the following 23 variables
variable detail
year Election year (1996-2017)
pref Prefecture
ku Electoral district name
kun Number of electoral district
rank Ascending order of votes
wl 0 = loser / 1 = single-member district (smd) winner / 2 = zombie winner
nocand Number of candidates in each district
seito Candidate’s affiliated party (in Japanese)
j_name Candidate’s name (Japanese)
name Candidate’s name (English)
previous Previous wins
gender Candidate’s gender:“male”, “female”
age Candidate’s age
exp Election expenditure (yen) spent by each candidate
status 0 = challenger / 1 = incumbent / 2 = former incumbent
vote votes each candidate garnered
voteshare Voteshare (%)
eligible Eligible voters in each district
turnout Turnout in each district (%)
castvote Total votes cast in each district
seshu_dummy 0 = Not-hereditary candidates, 1 = hereditary candidate
jiban_seshu Relationship between candidate and his predecessor
nojiban_seshu Relationship between candidate and his predecessor
  • Select the 2009 election data from hr96-21.csv, and answer the following questions.

Q1: Using seito variable, make a party variable (DPJ dummy: dpj) where 民主党 = 1, others = 0. Add dpj to the dataframe, hr, as a new variable.

Q2: Using exp variable, make a new variable (expm) where the unit is not “Yen” but “million Yen”. Add expm to the dataframe, hr, as a new variable.

Q3: Using status variable, make a new variable, inc, where incumbent = 1, challenger and former incumbent = 0. Add inc to the dataframe, hr, as a new variable.

Q4: Show the descriptive statistics of hr using stargazer package.

Q5: Draw a scatter plot between expm (x-axis) and voteshare (y-axis)

Q6: Briely explain the relationship between expm (x-axis) and voteshare (y-axis)

Q7: Suppose you want to know whether campaign money (expm) affects votes (voteshare).

○ Explanatory Variable — expm
○ Response Variable — voteshare

  • Select eight control variables from the list below, run a multiple regression, and show the results using summary(), tidy(), and stargazer package.
  • You need to briefly explain why you think the eight control variables you select affect voteshare.
variable detail
1. mag District magnitude (Number of candidate elected)
2. rank Ascending order of votes
3. nocand Number of candidates in each district
4. j_name Candidate’s name (Japanese)
5. previous Previous wins
6. gender Candidate’s gender:“male”, “female”
7. age Candidate’s age
8. wl 0 = loser / 1 = single-member district (smd) winner / 2 = zombie winner
9. vote votes each candidate garnered
10. eligible Eligible voters in each district
11.turnout Turnout in each district (%)
12.castvote Total votes cast in each district
13. seshu_dummy 0 = Not-hereditary candidates, 1 = hereditary candidate
14. dpj_dummy 0 = Not-dpj candidate, 1 = dpj candidate
15. inc challenger and former incumbent = 0, incumbent = 1

Q8: Present your statistical results using sjPlot or jtools.

Q9: Interpret the results of multiple regression analysis.

References
  • 飯田健『計量政治分析』共立出版、2013年.
  • ロバート・パットナム(河田潤ー訳)『哲学する民主主義』NTT 出版、2001年.
  • 森田 果『実証分析入門』日本評論社、2014年.
  • 宋財泫 (Jaehyun Song)・矢内勇生 (Yuki Yanai)「私たちのR: ベストプラクティスの探究」
  • 土井翔平(北海道大学公共政策大学院)「Rで計量政治学入門」
  • 矢内勇生(高知工科大学)授業一覧
  • 浅野正彦, 矢内勇生.『Rによる計量政治学』オーム社、2018年
  • 浅野正彦, 中村公亮.『初めてのRStudio』オーム社、2018年
  • Winston Chang, R Graphics Cookbook, O’Reilly Media, 2012.
  • Kieran Healy, DATA VISUALIZATION, Princeton, 2019
  • Kosuke Imai, Quantitative Social Science: An Introduction, Princeton University Press, 2017