Regression

Martijn Wieling (University of Groningen)

This lecture

Correlation
Regression
- Linear regression
- Multiple regression
- Interpreting interactions
- Regression assumptions
Logistic regression

Dataset for this lecture

head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Area
1          5.1         3.5          1.4         0.2  setosa       0.14
2          4.9         3.0          1.4         0.2  setosa       0.14
3          4.7         3.2          1.3         0.2  setosa       0.13
4          4.6         3.1          1.5         0.2  setosa       0.15
5          5.0         3.6          1.4         0.2  setosa       0.14
6          5.4         3.9          1.7         0.4  setosa       0.34

Question 1

Correlation

Quantify relation between two numerical variables (interval or ratio scale)
- \(-1 \leq r \leq 1\) indicates strength (effect size) and direction

Obtain significance of correlation

data(iris) # load iris data set
cor(iris$Sepal.Length,iris$Petal.Length) # provides r

[1] 0.87175

cor.test(iris$Sepal.Length, iris$Petal.Length) # provides statistical test


    Pearson's product-moment correlation

data:  x and y
t = 21.6, df = 148, p-value <2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.82704 0.90551
sample estimates:
    cor 
0.87175

Correlation for ordinal data: Spearman \(\rho\)

# Also when residuals not normally distributed
cor.test(iris$Sepal.Length, iris$Petal.Length, method='spearman')


    Spearman's rank correlation rho

data:  x and y
S = 66429, p-value <2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
   rho 
0.8819

# Similar result
cor.test(rank(iris$Sepal.Length),rank(iris$Petal.Length))$estimate

   cor 
0.8819

Reporting results

The Pearson correlation between the sepal length and petal length was positive with \(r = 0.87\), \(df = 148\), \(p_{two-tailed} < 0.001\).
- Note that the number of degrees of freedom for a correlation is: \(N - 2\)

Visualizing multiple correlations

library(corrgram)
corrgram(iris[,c("Sepal.Width","Sepal.Length","Petal.Length")],lower.panel=panel.shade,
         upper.panel=panel.pie)

Correlation demo: sensitivity to outliers

Correlation is no causation!

See www.tylervigen.com for more examples

Linear regression

To assess relationship between numerical dependent variable and one (simple regression) or more (multiple regression) quantitative or categorical predictor variables
- Measures impact of each individual variable on dependent variable, while controlling for other variables in the model
- Note that regression is equivalent to ANOVA, but the focus is different: relation between numerical variables vs. group comparisons

Linear regression: formula

Linear regression captures a linear relationship between dependent variable (DV) and independent variable(s) (IV)
- Regression formula \(y = \beta_0 + \beta_1 x + \epsilon\)
  - \(y\): dependent variable (= predicted or fitted values), \(x\): independent variable
  - \(\beta_0\): intercept (value of \(y\) when \(x = 0\)), \(\beta_1\): influence (slope) of \(x\) on \(y\)
  - \(\epsilon\): errors (residuals)

Residuals: difference between actual and fitted values

Linear regression: minimizing the residuals

Visualization is essential!

Linear regression is only suitable in the top-left situation

Fitting a simple regression model in R

m0 <- lm(Petal.Area ~ Sepal.Length, data=iris)
summary(m0)


Call:
lm(formula = Petal.Area ~ Sepal.Length, data = iris)

Residuals:
   Min     1Q Median     3Q    Max 
-2.671 -0.794 -0.099  0.730  3.489 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -11.357      0.711   -16.0   <2e-16 ***
Sepal.Length    2.439      0.120    20.3   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.22 on 148 degrees of freedom
Multiple R-squared:  0.735, Adjusted R-squared:  0.733 
F-statistic:  410 on 1 and 148 DF,  p-value: <2e-16

Visualization

library(visreg) # package containing visualization function visreg
visreg(m0) # visualize regression line together with data points

The blue regression line shows the predicted (fitted) values of the model

Interpretation

\(y = \beta_1 x + \beta_0 + \epsilon\)

round(m0$coefficients,2)

 (Intercept) Sepal.Length 
      -11.36         2.44

Petal.Area = 2.44 \(\times\) Sepal.Length + -11.36
For sepal length of 5.1, predicted (fitted) petal area: 2.44 \(\times\) 5.1 + -11.36 = 1.08

iris$FittedPA <- fitted(m0)
head(iris,2)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Area FittedPA
1          5.1         3.5          1.4         0.2  setosa       0.14  1.08376
2          4.9         3.0          1.4         0.2  setosa       0.14  0.59589

Interpretation of intercept

 (Intercept) Sepal.Length 
      -11.36         2.44

For sepal length of 0, predicted petal area: 2.44 \(\times\) 0 + -11.36 = -11.36
The estimate for the intercept is an artificial result: there are no irises with a sepal length of 0!

Question 2

Refitting the model with a centered predictor

iris$Sepal.Length.c <- iris$Sepal.Length - mean(iris$Sepal.Length) # centered Sepal.Length
m0.c <- lm(Petal.Area ~ Sepal.Length.c, data=iris)

library(mosaic) # contains msummary function
msummary(m0.c) # more concise summary (slide friendly!)

               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      2.8970     0.0994    29.1   <2e-16 ***
Sepal.Length.c   2.4394     0.1204    20.3   <2e-16 ***

Residual standard error: 1.22 on 148 degrees of freedom
Multiple R-squared:  0.735, Adjusted R-squared:  0.733 
F-statistic:  410 on 1 and 148 DF,  p-value: <2e-16

Only the estimate of the intercept changed (the estimate for Sepal.Length was 2.4394)

Fitted values are identical

iris$FittedPA.c <- fitted(m0.c)
head(iris[,c("Sepal.Length","Sepal.Length.c","FittedPA","FittedPA.c")])

  Sepal.Length Sepal.Length.c FittedPA FittedPA.c
1          5.1       -0.74333  1.08376    1.08376
2          4.9       -0.94333  0.59589    0.59589
3          4.7       -1.14333  0.10801    0.10801
4          4.6       -1.24333 -0.13593   -0.13593
5          5.0       -0.84333  0.83982    0.83982
6          5.4       -0.44333  1.81558    1.81558

Interpretation of intercept with centered predictor

   (Intercept) Sepal.Length.c 
          2.90           2.44

For a centered sepal length of 0 (an average sepal length of about 5.8), the predicted petal area is: 2.44 \(\times\) 0 + 2.9 = 2.9
The estimate for the intercept is now more natural: it reflects the value of the dependent variable (petal area) when the predictor (sepal length) is at its average value!

Multiple regression

\(y = \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \beta_0 + \epsilon\)
- \(\beta_1\) should be interpreted as the effect of \(x_1\), while controlling for the other variables (\(x_2\) … \(x_n\)) in the model
- \(\beta_2\) should be interpreted as the effect of \(x_2\), while controlling for the other variables in the model
- \(\beta_0\) (Intercept) is the value when all \(x_1\), \(x_2\), …, \(x_n\) are 0
  - This might be a value which has no meaning!
  - To give it meaning, centering (subtracting mean) numerical variables helps
  - When the variables are centered, the intercept corresponds with the value of the dependent variable when all predictor values are at their mean value

Question 3

Fitting a multiple regression model

m1 <- lm(Petal.Area ~ Sepal.Length.c + Sepal.Width.c, data=iris)
msummary(m1)

               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      2.8970     0.0928   31.23  < 2e-16 ***
Sepal.Length.c   2.3757     0.1132   20.99  < 2e-16 ***
Sepal.Width.c   -1.0285     0.2150   -4.78  4.1e-06 ***

Residual standard error: 1.14 on 147 degrees of freedom
Multiple R-squared:  0.771, Adjusted R-squared:  0.768 
F-statistic:  247 on 2 and 147 DF,  p-value: <2e-16

Note that we use centered variables to facilitate interpretation: when both sepal length and width are at their average value, the predicted petal area equals 2.897

Visualization

visreg(m1)

2D visualization

visreg2d(m1, "Sepal.Length.c", "Sepal.Width.c")

library(rgl)
visreg2d(m1, "Sepal.Length.c", "Sepal.Width.c", 
         plot.type = 'rgl')

Interpretation

\(y = \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \beta_0 + \epsilon\)

round(m1$coefficients,2)

   (Intercept) Sepal.Length.c  Sepal.Width.c 
          2.90           2.38          -1.03

Petal.Area = 2.38 \(\times\) Sepal.Length.c + -1.03 \(\times\) Sepal.Width.c + 2.9
For a centered sepal length of about -0.7 (sepal length of 5.1) and a centered sepal width of about 0.4 (sepal width of 3.5), the predicted (fitted) petal area is:
2.38 \(\times\) -0.7 + -1.03 \(\times\) 0.4 + 2.9 = 0.68

iris$FittedPA <- fitted(m1)
head(iris[,c("Sepal.Length","Sepal.Length.c","Sepal.Width","Sepal.Width.c","FittedPA")], 1)

  Sepal.Length Sepal.Length.c Sepal.Width Sepal.Width.c FittedPA
1          5.1       -0.74333         3.5       0.44267  0.67577

Model comparison: more complex model necessary?

anova(m0.c,m1) # model comparison

Analysis of Variance Table

Model 1: Petal.Area ~ Sepal.Length.c
Model 2: Petal.Area ~ Sepal.Length.c + Sepal.Width.c
  Res.Df RSS Df Sum of Sq    F  Pr(>F)    
1    148 219                              
2    147 190  1      29.5 22.9 4.1e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AIC(m0.c) - AIC(m1) # AIC decrease >= 2: more complex model should be used

[1] 19.703

Evidence ratio: the model with lower AIC is \(e^{\frac{\Delta AIC}{2}}\) times more likely than the model with higher AIC (\(e^{9.85} \approx\) 19,000)

Multiple regression: adding an interaction

\(y = \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1 x_2 + \beta_0 + \epsilon\)
- \(\beta_{12}\) is the coefficient for the interaction effect

Modeling an interaction

m2 <- lm(Petal.Area ~ Sepal.Length.c * Sepal.Width.c, data=iris) 
# alternatively: Petal.Area ~ Sepal.Length.c + Sepal.Width.c + Sepal.Length.c:Sepal.Width.c
msummary(m2)

                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    2.9326     0.0915   32.05  < 2e-16 ***
Sepal.Length.c                 2.3334     0.1116   20.91  < 2e-16 ***
Sepal.Width.c                 -0.8716     0.2173   -4.01  9.6e-05 ***
Sepal.Length.c:Sepal.Width.c   0.8431     0.2987    2.82   0.0054 ** 

Residual standard error: 1.11 on 146 degrees of freedom
Multiple R-squared:  0.783, Adjusted R-squared:  0.778 
F-statistic:  175 on 3 and 146 DF,  p-value: <2e-16

Interpreting the coefficients

When interpreting an interaction (A:B), the significance of the terms (A and B) involved in the interaction are not main effects
- If the interaction A:B is significant, the significance of A indicates that the effect of A when B equals 0 is significantly different from 0
- Similarly, the significance of B (when A:B is signficant) indicates that the effect of B when A equals 0 is significantly different from 0

Dependency in coefficients

\(y = \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1 x_2 + \beta_0 + \epsilon\)

     (Intercept) Sepal.Length.c Sepal.Width.c Sepal.Length.c:Sepal.Width.c
[1,]        2.93           2.33         -0.87                         0.84

Petal.Area = 2.33 \(\times\) SL + -0.87 \(\times\) SW + 0.84 \(\times\) SL \(\times\) SW + 2.93
SL set to 0: 2.33 \(\times\) 0 + -0.87 \(\times\) SW + 0.84 \(\times\) 0 \(\times\) SW + 2.93 = -0.87 \(\times\) SW + 2.93
SL set to 1: 2.33 \(\times\) 1 + -0.87 \(\times\) SW + 0.84 \(\times\) 1 \(\times\) SW + 2.93 = -0.03 \(\times\) SW + 5.27
SL set to 2: 2.33 \(\times\) 2 + -0.87 \(\times\) SW + 0.84 \(\times\) 2 \(\times\) SW + 2.93 = 0.81 \(\times\) SW + 7.6
SW set to 0: 2.33 \(\times\) SL + -0.87 \(\times\) 0 + 0.84 \(\times\) SL \(\times\) 0 + 2.93 = 2.33 \(\times\) SL + 2.93
SW set to 1: 2.33 \(\times\) SL + -0.87 \(\times\) 1 + 0.84 \(\times\) SL \(\times\) 1 + 2.93 = 3.18 \(\times\) SL + 2.06
SW set to 2: 2.33 \(\times\) SL + -0.87 \(\times\) 2 + 0.84 \(\times\) SL \(\times\) 2 + 2.93 = 4.02 \(\times\) SL + 1.19

Visualization: simple slopes

par(mfrow=c(1,2))
visreg(m2,'Sepal.Width.c',by='Sepal.Length.c',overlay=T) # lines at 10/50/90th percentile
visreg(m2,'Sepal.Length.c',by='Sepal.Width.c',overlay=T)

Visualization: 2D

visreg2d(m2, "Sepal.Length.c", "Sepal.Width.c")

library(rgl)
visreg2d(m2, "Sepal.Length.c", "Sepal.Width.c", 
         plot.type = 'rgl')

Model comparison: interaction necessary?

anova(m1,m2) # model comparison

Analysis of Variance Table

Model 1: Petal.Area ~ Sepal.Length.c + Sepal.Width.c
Model 2: Petal.Area ~ Sepal.Length.c * Sepal.Width.c
  Res.Df RSS Df Sum of Sq    F Pr(>F)   
1    147 190                            
2    146 180  1      9.82 7.97 0.0054 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AIC(m1) - AIC(m2) # AIC decrease >= 2: more complex model should be used

[1] 5.9696

The more complex model is necessary (\(e^\frac{6}{2} \approx\) 20 times as likely)

Model comparison when including an interaction

Note that a model with an interaction term (A:B) should be compared to the best model without that term
- This is the model including A + B, if both terms are significant (determined via model comparison)
- This is the model including A, if A is significant and B is not
- This is the model including B, if B is significant and A is not
- This is the model witout A and B, if neither are significant
It is important to get this comparison right, as an interaction might provide an improvement over including A + B (if both are not significant), but may not be an improvement over the model without A and B.
- In that case one should stick with the model without A, B, or their interaction

Intermezzo: ANOVA gives identical results

library(car)
Anova(aov(Petal.Area ~ Sepal.Length.c * Sepal.Width.c, data=iris), type=3) # identical to: Anova(m2, type=3)

Anova Table (Type III tests)

Response: Petal.Area
                             Sum Sq  Df F value  Pr(>F)    
(Intercept)                    1266   1 1027.16 < 2e-16 ***
Sepal.Length.c                  539   1  437.24 < 2e-16 ***
Sepal.Width.c                    20   1   16.08 9.6e-05 ***
Sepal.Length.c:Sepal.Width.c     10   1    7.97  0.0054 ** 
Residuals                       180 146                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(m2)$coef

                             Estimate Std. Error t value   Pr(>|t|)
(Intercept)                   2.93257   0.091502 32.0493 6.0566e-68
Sepal.Length.c                2.33345   0.111593 20.9103 9.3645e-46
Sepal.Width.c                -0.87157   0.217325 -4.0104 9.6394e-05
Sepal.Length.c:Sepal.Width.c  0.84306   0.298685  2.8226 5.4291e-03

Numerical interpretation

\(y = \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1 x_2 + \beta_0 + \epsilon\)

     (Intercept) Sepal.Length.c Sepal.Width.c Sepal.Length.c:Sepal.Width.c
[1,]        2.93           2.33         -0.87                         0.84

Petal.Area = 2.33 \(\times\) Sepal.Length.c + -0.87 \(\times\) Sepal.Width.c + 0.84 \(\times\) Sepal.Length.c \(\times\) Sepal.Width.c + 2.93
For a centered sepal length of about -0.7 (sepal length of 5.1) and a centered sepal width of about 0.4 (sepal width of 3.5), the predicted (fitted) petal area is:
2.33 \(\times\) -0.7 + -0.87 \(\times\) 0.4 + 0.84 \(\times\) -0.7 \(\times\) 0.4 + 2.93 = 0.53

iris$FittedPA <- fitted(m2)
head(iris[,c("Sepal.Length","Sepal.Length.c","Sepal.Width","Sepal.Width.c","FittedPA")], 1)

  Sepal.Length Sepal.Length.c Sepal.Width Sepal.Width.c FittedPA
1          5.1       -0.74333         3.5       0.44267  0.53482

Categorical predictors

Categorical predictors (\(n\) levels) are modeled with (\(n-1\)) dummy variables
- Dummy variable: either 0 or 1
- Reference level: when all dummy variables are 0
  - Consequently, the reference level is incorporated in the Intercept (\(\beta_0\)) of a regression model
- Dummy (treatment) coding done automatically in regression

Dummy coding

levels(iris$Species) # factor levels of Species variable (by default ordered alphabetically)

[1] "setosa"     "versicolor" "virginica"

contrasts(iris$Species) # show internal dummy coding

           versicolor virginica
setosa              0         0
versicolor          1         0
virginica           0         1

iris$Species <- relevel(iris$Species, 'versicolor') # change reference level
contrasts(iris$Species) # dummy coding

           setosa virginica
versicolor      0         0
setosa          1         0
virginica       0         1

Fitting a model with a categorical predictor

m3 <- lm(Petal.Area ~ Species, data=iris)
msummary(m3)

                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)         2.860      0.105    27.4   <2e-16 ***
Speciessetosa      -2.677      0.148   -18.1   <2e-16 ***
Speciesvirginica    2.788      0.148    18.9   <2e-16 ***

Residual standard error: 0.739 on 147 degrees of freedom
Multiple R-squared:  0.903, Adjusted R-squared:  0.902 
F-statistic:  683 on 2 and 147 DF,  p-value: <2e-16

Manually creating dummy variables: identical results

iris$SpeciesSetosa <- (iris$Species == 'setosa')*1
iris$SpeciesVirginica <- (iris$Species == 'virginica')*1
msummary(lm(Petal.Area ~ SpeciesSetosa + SpeciesVirginica, data=iris))

                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)         2.860      0.105    27.4   <2e-16 ***
SpeciesSetosa      -2.677      0.148   -18.1   <2e-16 ***
SpeciesVirginica    2.788      0.148    18.9   <2e-16 ***

Residual standard error: 0.739 on 147 degrees of freedom
Multiple R-squared:  0.903, Adjusted R-squared:  0.902 
F-statistic:  683 on 2 and 147 DF,  p-value: <2e-16

Visualization

visreg(m3)

Interpretation

\(y = \beta_1 x_1 + \beta_2 x_2 + \beta_0 + \epsilon\)

     (Intercept)    Speciessetosa Speciesvirginica 
            2.86            -2.68             2.79

Petal.Area = -2.68 \(\times\) Speciessetosa + 2.79 \(\times\) Speciesvirginica + 2.86
- Petal area for species setosa: -2.68 \(\times\) 1 + 2.79 \(\times\) 0 + 2.86 = 0.18
- Petal area for species versicolor: -2.68 \(\times\) 0 + 2.79 \(\times\) 0 + 2.86 = 2.86
- Petal area for species viriginica: -2.68 \(\times\) 0 + 2.79 \(\times\) 1 + 2.86 = 5.65

iris$FittedPA <- round(fitted(m3),2)
unique(iris[,c("Species","FittedPA")])

       Species FittedPA
1       setosa     0.18
51  versicolor     2.86
101  virginica     5.65

Post-hoc tests for factorial predictors

library(multcomp)
summary(glht(m3, linfct = mcp(Species = "Tukey")))


     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = Petal.Area ~ Species, data = iris)

Linear Hypotheses:
                            Estimate Std. Error t value Pr(>|t|)    
setosa - versicolor == 0      -2.677      0.148   -18.1   <2e-16 ***
virginica - versicolor == 0    2.788      0.148    18.9   <2e-16 ***
virginica - setosa == 0        5.465      0.148    37.0   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)

Question 4

Combining categorical and numerical predictors

msummary(m4 <- lm(Petal.Area ~ Sepal.Length.c + Sepal.Width.c  + Species, data=iris)) # w/o interactions

                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        2.9236     0.0987   29.61  < 2e-16 ***
Sepal.Length.c     0.6995     0.1103    6.34  2.7e-09 ***
Sepal.Width.c      0.4461     0.1672    2.67   0.0085 ** 
Speciessetosa     -2.3204     0.2196  -10.57  < 2e-16 ***
Speciesvirginica   2.2408     0.1318   17.00  < 2e-16 ***

Residual standard error: 0.584 on 145 degrees of freedom
Multiple R-squared:  0.94,  Adjusted R-squared:  0.939 
F-statistic:  571 on 4 and 145 DF,  p-value: <2e-16

Regression assumptions

Independent observations (✓)
Response at least interval-scaled (✓)
Relation between dependent and independent variables linear
Residuals are not autocorrelated
No strong multicollinearity
Homoscedasticity of variance in residuals
Residuals are normally distributed

Evaluating linearity: component-residual plot

par(mfrow=c(1,2))
crPlot(m4, var='Sepal.Length.c') # from library(car)
crPlot(m4, var='Sepal.Width.c') # suggests non-linear pattern (covered in GAM lectures)

Assessing autocorrelation in residuals

durbinWatsonTest(m4) # from library(car)

 lag Autocorrelation D-W Statistic p-value
   1        -0.01467        2.0219   0.988
 Alternative hypothesis: rho != 0

acf(resid(m4)) # autocorrelation only expected for time-series data

Assessing multicollinearity: variance inflation factors

car::vif(m4) # Preferably (G)VIF < 5 (rule of thumb, but see: https://goo.gl/a6aKdU)

                 GVIF Df GVIF^(1/(2*Df))
Sepal.Length.c 3.6484  1          1.9101
Sepal.Width.c  2.3215  1          1.5237
Species        6.0044  2          1.5654

msummary(lm(Sepal.Width.c ~ Sepal.Length.c, data=iris))$r.squared

[1] 0.013823

msummary(lm(Sepal.Width.c ~ Species, data=iris))$r.squared

[1] 0.40078

msummary(lm(Sepal.Length.c ~ Species, data=iris))$r.squared # R^2 > 0.5: potentially problematic

[1] 0.61871

Excluding a variable to reduce collinearity

# Exclude length (best choice: lowest AIC)
AIC(m5 <- lm(Petal.Area ~ Sepal.Width.c + Species,data=iris))

[1] 305.83

# Or exclude species (worst choice: highest AIC)
AIC(m5a <- lm(Petal.Area ~ Sepal.Width.c + Sepal.Length.c,data=iris))

[1] 468.91

car::vif(m5) # OK

                GVIF Df GVIF^(1/(2*Df))
Sepal.Width.c 1.6688  1          1.2918
Species       1.6688  2          1.1366

Simplified model

msummary(m5)

                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)         3.150      0.104   30.38  < 2e-16 ***
Sepal.Width.c       1.008      0.160    6.32  3.1e-09 ***
Speciessetosa      -3.341      0.168  -19.85  < 2e-16 ***
Speciesvirginica    2.582      0.135   19.06  < 2e-16 ***

Residual standard error: 0.657 on 146 degrees of freedom
Multiple R-squared:  0.924, Adjusted R-squared:  0.922 
F-statistic:  589 on 3 and 146 DF,  p-value: <2e-16

Note that the estimate for Sepal.Width.c is now positive as including Species controls for Setosa irises having the highest sepal width and the lowest sepal area

Homoscedasticity of error variance

plot(fitted(m5), resid(m5))

Clear heteroscedasticity (more variance for larger fitted values)

Non-constant variance test

ncvTest(m5) # from library(car)

Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 32.18, Df = 1, p = 1.41e-08

Significant heteroscedasticity (\(p\)-value < 0.05)

Possible solution: power transformation

library(MASS)
boxcox(m5) # shows optimal transformation at around 0.3

No heteroscedasticity anymore

iris$Petal.Area.0.3 <- iris$Petal.Area^0.3
m6 <- lm(Petal.Area.0.3 ~ Sepal.Width.c + Species, data=iris)
msummary(m6)

                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        1.4012     0.0130  108.01  < 2e-16 ***
Sepal.Width.c      0.1360     0.0200    6.81  2.4e-10 ***
Speciessetosa     -0.8642     0.0211  -41.04  < 2e-16 ***
Speciesvirginica   0.2848     0.0169   16.80  < 2e-16 ***

Residual standard error: 0.0823 on 146 degrees of freedom
Multiple R-squared:  0.97,  Adjusted R-squared:  0.969 
F-statistic: 1.56e+03 on 3 and 146 DF,  p-value: <2e-16

ncvTest(m6)

Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 0.48356, Df = 1, p = 0.487

No heteroscedasticity per explanatory variable

ncvTest(m6, ~ Sepal.Width.c)

Non-constant Variance Score Test 
Variance formula: ~ Sepal.Width.c 
Chisquare = 0.63821, Df = 1, p = 0.424

ncvTest(m6, ~ Species)

Non-constant Variance Score Test 
Variance formula: ~ Species 
Chisquare = 0.31587, Df = 2, p = 0.854

Normally distributed residuals

shapiro.test(resid(m6))$p.value

[1] 0.65129

qqnorm(resid(m6))
qqline(resid(m6))

Question 5

Linear regression: diagnostics

Model building and hypothesis testing

With a clear hypothesis, you should only fit the model reflecting the hypothesis
For exploratory research (without a clear hypothesis), model comparison is necessary to determine the optimal model
- Stepwise model comparison may be problematic, as the importance of additional predictors is assessed when other predictors are already in the model and some variables may be correlated
A hypothesis-testing analysis may also be followed by an exploratory analysis, but make this explicit in your report

My approach

For exploratory models, I generally start from the simpler model and then build a more complex model
- The advantage of this forward selection approach is that it also builds up your knowledge of the model
- But backwards selection (starting with the most complex model) has a lower risk of missing out on important variables
- In practice, I tend to combine both approaches by also assessing if variables excluded at an earlier stage need to be included after other variables have been added

Best model

Using model comparison in an exploratory analysis, we obtain the following best model (only including the two selected predictors):

bestmodel <- lm(Petal.Area.0.3 ~ Sepal.Width.c * Species, data=iris)
msummary(bestmodel)

                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      1.4234     0.0152   93.36  < 2e-16 ***
Sepal.Width.c                    0.2132     0.0360    5.92  2.3e-08 ***
Speciessetosa                   -0.8557     0.0219  -39.06  < 2e-16 ***
Speciesvirginica                 0.2660     0.0191   13.90  < 2e-16 ***
Sepal.Width.c:Speciessetosa     -0.1602     0.0468   -3.42   0.0008 ***
Sepal.Width.c:Speciesvirginica  -0.0356     0.0503   -0.71   0.4799    

Residual standard error: 0.0791 on 144 degrees of freedom
Multiple R-squared:  0.972, Adjusted R-squared:  0.971 
F-statistic: 1.01e+03 on 5 and 144 DF,  p-value: <2e-16

Visualization

visreg(bestmodel,'Sepal.Width.c',by='Species',overlay=T)

Numerical interpretation

     (Intercept) Sepal.Width.c Speciessetosa Speciesvirginica Sepal.Width.c:Speciessetosa
[1,]        1.42          0.21         -0.86             0.27                       -0.16
     Sepal.Width.c:Speciesvirginica
[1,]                          -0.04

Petal.Area = 0.21 \(\times\) Sepal.Width.c + -0.86 \(\times\) Speciessetosa + 0.27 \(\times\) Speciesvirginica + -0.16 \(\times\) Sepal.Width.c \(\times\) Speciessetosa + -0.04 \(\times\) Sepal.Width.c \(\times\) Speciesvirginica + 1.42
- Petal area for species setosa with centered sepal width of about 0.4 (sepal width of 3.5): 0.21 \(\times\) 0.4 + -0.86 \(\times\) 1 + 0.27 \(\times\) 0 + -0.16 \(\times\) 0.4 \(\times\) 1 + -0.04 \(\times\) 0.4 \(\times\) 0 + 1.42 = 0.59

iris$FittedPA <- fitted(bestmodel)
head(iris[,c("Sepal.Width","Sepal.Width.c","Species","FittedPA")], 1)

  Sepal.Width Sepal.Width.c Species FittedPA
1         3.5       0.44267  setosa  0.59118

Same model, but different questions answered

msummary(similarmodel <- lm(Petal.Area.0.3 ~ Species + Sepal.Width.c : Species, data=iris))

                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                       1.4234     0.0152   93.36  < 2e-16 ***
Speciessetosa                    -0.8557     0.0219  -39.06  < 2e-16 ***
Speciesvirginica                  0.2660     0.0191   13.90  < 2e-16 ***
Speciesversicolor:Sepal.Width.c   0.2132     0.0360    5.92  2.3e-08 ***
Speciessetosa:Sepal.Width.c       0.0530     0.0298    1.78    0.077 .  
Speciesvirginica:Sepal.Width.c    0.1776     0.0351    5.07  1.2e-06 ***

Residual standard error: 0.0791 on 144 degrees of freedom
Multiple R-squared:  0.972, Adjusted R-squared:  0.971 
F-statistic: 1.01e+03 on 5 and 144 DF,  p-value: <2e-16

AIC(similarmodel) - AIC(bestmodel)

[1] 0

Effect of sepal width per species, not compared to versicolor (reference level)

Side-by-side comparison

coef(msummary(bestmodel))[c(1,3,4,2,5,6), c(1,4)]

                                Estimate    Pr(>|t|)
(Intercept)                     1.423357 1.0332e-130
Speciessetosa                  -0.855657  1.6530e-78
Speciesvirginica                0.266043  2.1412e-28
Sepal.Width.c                   0.213226  2.2782e-08
Sepal.Width.c:Speciessetosa    -0.160178  8.0257e-04
Sepal.Width.c:Speciesvirginica -0.035608  4.7990e-01

coef(summary(similarmodel))[,c(1,4)]

                                 Estimate    Pr(>|t|)
(Intercept)                      1.423357 1.0332e-130
Speciessetosa                   -0.855657  1.6530e-78
Speciesvirginica                 0.266043  2.1412e-28
Speciesversicolor:Sepal.Width.c  0.213226  2.2782e-08
Speciessetosa:Sepal.Width.c      0.053048  7.7425e-02
Speciesvirginica:Sepal.Width.c   0.177618  1.2238e-06

Standardized effect size of individual predictors (\(\eta^2_p\))

library(lsr)
etaSquared(bestmodel,type=3) # suitable for interaction, use type=2 without interaction present

                         eta.sq eta.sq.part
Sepal.Width.c         0.0067221    0.195631
Species               0.6431114    0.958794
Sepal.Width.c:Species 0.0026368    0.087093

\(\eta^2_p\): 0.01 (small), 0.06 (medium), > 0.14 (large); however, I generally tend to interpret effect sizes in terms of actual influence on the dependent variable
- The effects of individual predictors can also be compared directly when all numerical predictors are \(z\)-transformed
The effect size of the full model is the adjusted \(R^2\): 0.9714 (shown in model summary)

Intermezzo: visualizing a three-way interaction

tmp <- lm(Petal.Area.0.3 ~ Species * Sepal.Length.c * Sepal.Width.c, data=iris) # model used for demonstration purposes
par(mfrow=c(1,3))
visreg2d(tmp, 'Sepal.Length.c', 'Sepal.Width.c', cond=list(Species='versicolor'), plot.type='persp', main='versicolor', nn=50)
visreg2d(tmp, 'Sepal.Length.c', 'Sepal.Width.c', cond=list(Species='setosa'), plot.type='persp', main='setosa', nn=50)
visreg2d(tmp, 'Sepal.Length.c', 'Sepal.Width.c', cond=list(Species='virginica'), plot.type='persp', main='virginica', nn=50)

Model criticism for final model

Idea: check outlier characteristics and refit model with large outliers excluded to verify your effects are not “carried” by these outliers
- Important: no a priori exclusion of outliers without a clear reason
  - A good reason is not that the value is over 2.5 SD above the mean
  - A good reason (e.g.,) is that the response is faster than possible (so it must have been a lucky guess)
- See Baayen (2008)

Model criticism procedure

iris2 <- iris[abs(scale(resid(bestmodel))) < 2, ] # exclude outliers
(noutliers <- sum(abs(scale(resid(bestmodel))) >= 2)) # 7 outliers

[1] 7

bestmodel2 <- lm(Petal.Area.0.3 ~ Sepal.Width.c * Species, data=iris2) 
msummary(bestmodel2) # only small changes in p-values: same pattern as before

                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      1.4228     0.0129  110.25  < 2e-16 ***
Sepal.Width.c                    0.2549     0.0317    8.04  3.8e-13 ***
Speciessetosa                   -0.8666     0.0186  -46.50  < 2e-16 ***
Speciesvirginica                 0.2629     0.0162   16.24  < 2e-16 ***
Sepal.Width.c:Speciessetosa     -0.1842     0.0410   -4.49  1.5e-05 ***
Sepal.Width.c:Speciesvirginica  -0.0578     0.0436   -1.33     0.19    

Residual standard error: 0.0666 on 137 degrees of freedom
Multiple R-squared:  0.98,  Adjusted R-squared:  0.98 
F-statistic: 1.37e+03 on 5 and 137 DF,  p-value: <2e-16

Distribution of residuals before and after MC

Question 6

Alternative to identifying outliers: influence plot

# x-axis: hat values (leverage) - amount of influence a predictor has on fitted values
# y-axis: standardized residuals
# circle size: Cook's distance - effect of removing observation on coefficients & fitted values
# influential outliers: those with large absolute residuals and large Cook's distance
outliers <- influencePlot(bestmodel) # from library(car)

Validation: bootstrap sampling

summary(bestmodel.boot <- Boot(bestmodel2,R=1000)) # from library(car), 1000 bootstrap samples


Number of bootstrap replications R = 1000 
                               original  bootBias bootSE bootMed
(Intercept)                      1.4228  4.62e-04 0.0101  1.4229
Sepal.Width.c                    0.2549  2.08e-03 0.0313  0.2549
Speciessetosa                   -0.8666 -1.74e-03 0.0171 -0.8681
Speciesvirginica                 0.2629 -7.92e-04 0.0147  0.2626
Sepal.Width.c:Speciessetosa     -0.1842  9.37e-07 0.0451 -0.1833
Sepal.Width.c:Speciesvirginica  -0.0578 -1.99e-03 0.0424 -0.0594

confint(bestmodel.boot,type='norm') # confidence intervals based on bootstrap sampling

Bootstrap normal confidence intervals

                                  2.5 %    97.5 %
(Intercept)                     1.40256  1.442159
Sepal.Width.c                   0.19159  0.314142
Speciessetosa                  -0.89831 -0.831355
Speciesvirginica                0.23493  0.292424
Sepal.Width.c:Speciessetosa    -0.27262 -0.095789
Sepal.Width.c:Speciesvirginica -0.13888  0.027253

Reporting results (example)

In our exploratory analysis, we determined the optimal model via model comparison on the basis of AIC. To counter heteroscedasticity, we transformed the dependent variable using a power transformation (\(\lambda = 0.3\), determined using a Box Cox profile-likelihood procedure). After determining the optimal model specification, we applied model criticism by excluding all observations with absolute residuals larger than 2 standard deviations above the mean (2% of the observations; see Baayen, 2008). The final model (see Table 1) reflects the results on the basis of this trimmed dataset. Table 1 shows that the variables of interest are significant at a significance threshold \(\alpha\) of 0.05 (all \(|t|\)’s > 2, \(p\) < 0.05) and their effect size (\(\eta^2_p\)). The interpretation of the effects is as follows: … [take note of the interpretation of interactions]. The residuals followed a normal distribution, did not show auto-correlation or heteroscedasticity. Table 1 also shows the 95%-confidence intervals which were determined via bootstrap sampling. The adjusted \(R^2\) of the model was 0.98.

Logistic regression

Dependent variable is binary (1: success, 0: failure), not continuous
Transform to continuous variable via log odds: \(\log(\frac{p}{1-p})\) = logit\((p)\)
- Automatically in regression by using glm (or lrm) with family=“binomial”
- Transformation of dependent variable: generalized regression model
interpret coefficients w.r.t. success as logits: in R: plogis(x)

Logistic regression example

# dependent variable equals 0 or 1
m <- glm(SpeciesVirginica ~ Petal.Area, data=iris, family='binomial') 
msummary(m)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -19.49       5.24   -3.72  0.00020 ***
Petal.Area      4.88       1.33    3.67  0.00024 ***

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 190.954  on 149  degrees of freedom
Residual deviance:  22.333  on 148  degrees of freedom
AIC: 26.33

Number of Fisher Scoring iterations: 9

Visualization

par(mfrow=c(1,2))
visreg(m)
visreg(m, scale='response')

Interpreting logit coefficients I

# chance for an iris with a petal area of 
# 4.5 to be a Virginica iris 
logitvalue = coef(m)["(Intercept)"] + 
             4.5 * coef(m)["Petal.Area"]
logitvalue

(Intercept) 
     2.4692

(Intercept) 
    0.92195

Interpreting logit coefficients II

# chance for an iris with a petal area of 
# 3.5 to be a Virginica iris 
logitvalue = coef(m)["(Intercept)"] + 
             3.5 * coef(m)["Petal.Area"]
logitvalue

(Intercept) 
     -2.411

(Intercept) 
    0.08234

Assessing model fit

library(rms)
(malt <- lrm(SpeciesVirginica ~ Petal.Area, data=iris))

Logistic Regression Model

lrm(formula = SpeciesVirginica ~ Petal.Area, data = iris)

                       Model Likelihood    Discrimination    Rank Discrim.    
                             Ratio Test           Indexes          Indexes    
Obs           150    LR chi2     168.62    R2       0.938    C       0.997    
 0            100    d.f.             1    R2(1,150)0.673    Dxy     0.994    
 1             50    Pr(> chi2) <0.0001    R2(1,100)0.813    gamma   0.995    
max |deriv| 3e-08                          Brier    0.025    tau-a   0.445    

           Coef     S.E.   Wald Z Pr(>|Z|)
Intercept  -19.4916 5.2409 -3.72  0.0002  
Petal.Area   4.8802 1.3296  3.67  0.0002

Index of concordance \(C\) is a good measure of discrimination (0.5: no discrimination; 0.7 - 0.8: acceptable; 0.8 - 0.9: excellent; > 0.9 outstanding)

Logistic regression assumptions

Independent observations (✓)
Relation between logit and quantitative predictors is linear
- Function crPlot from library(car) can be used to assess this
No strong multicollinearity between predictors
- Function vif from library(car) can be used to assess this

Final remarks about logistic regression

Overfitting can be evaluated with bootstrap analysis
Outliers can be identified with the function influencePlot()
Instead of having a dependent variable with value 0 or 1, a dependent variable containing the number of successes and the number of failures can be used
- For example, rather than having a column “correct” in a dataframe which indicates if someone answered a specific question correctly (1) or not (0), it may contain the number of correctly and incorrectly answered questions on the whole text in two columns “ncorrect” and “nincorrect”
- The model specification then becomes:
  glm(cbind(ncorrect,nincorrect) ~ …) as opposed to
  glm(correct ~ …)

Question 7

Alternative regression models

If the dependent variable consists of count data, Poisson regression is necessary
- This is not covered in this lecture, but you can fit these types of models with glm() using family="poisson"
If the dependent variable has more than two levels, multinomial (polytomous) logistic regression may be used
- This is not covered in this lecture, but see Levshina (2015, Ch. 13)

Recap

In this lecture, we’ve covered:
- Correlation
- Regression: linear regression, multiple regression
  - Interpreting interactions
  - Regression assumptions and model criticism
- Logistic regression
Associated lab session and additional exercises:
- https://www.let.rug.nl/wieling/Statistics/Regression-large/lab
- If you can, try Lessons 1 to 12 (or 13) from the swirl course Regression Models:
  library(swirl); install_from_swirl("Regression_Models")

Evaluation

Questions?

Thank you for your attention!

https://www.martijnwieling.nl

m.b.wieling@rug.nl