Statistiek I

(Multiple) linear regression

Martijn Wieling

Question 1: last lecture

Last lecture

  • Correlation
  • Linear regression with a single numerical independent variable
    • Formula: \(y = \beta_0 + \beta_1 \times x\)
    • Example: m <- lm(ratingNL ~ BAC, data=lls)
      • Dependent variable: ratingNL
      • Independent variable: BAC
    • Interpretation of intercept and slope (influence centering and \(z\)-transformation)

This lecture

  • Part I: Linear regression with a single nominal independent variable
  • Part II: Multiple linear regression with multiple independent variables
  • Part III: Introduction to multiple linear regression with an interaction

Data: Lowlands Science data from last lecture

  • RQ: Is there a different influence of alcohol (BAC) on L1 vs. L2 language proficiency?
    • Hypothesis (alternative): BAC has a stronger negative effect on L1 than L2
  • BAC and L1 and L2 speech samples collected for about 90 native Dutch speakers
  • Dutch speech samples were rated by native Dutch speakers (at Lowlands)
  • English speech samples were rated by native American English speakers (online)
  • Recall from lecture and lab of last week:
    • Significant negative effect of BAC for Dutch
    • Small (non-significant) negative effect of BAC for English
  • Here we assess whether this really represents a difference
    • Note that we again (incorrectly!) ignore variability associated with individual speakers (each speaker received two ratings)
  • We start simple, by first comparing ratings across the two languages

Part I: Comparing group means using regression

  • Regression allows for the inclusion of nominal variables
  • Behind the scenes, several dummy variables (binary: value 0 or 1) are created for this
  • Number of dummy variables created equals nr. of levels of nominal variable - 1
    • At most a single dummy variable associated with a nominal variable can be set to 1
    • Intercept represents one of the levels (all dummy variables are 0) = reference level
  • Example: for modeling 3 colors (blue, green, red), 2 dummy variables are needed
    • colorGreen equals 1 when green, 0 otherwise
    • colorRed equals 1 when red, 0 otherwise
    • When both colorGreen and colorRed equal 0, color equals blue

Comparing group means using regression

  • Hypotheses:
    • \(H_0\): no differences between group means
    • \(H_a\): group means differ
    • Specifically for comparing two groups \(a\) and \(b\)
      • \(H_0\): \(\mu_a = \mu_b\), \(H_a\): \(\mu_a \neq \mu_b\) (for two-tailed hypothesis)

First example: comparing Dutch vs English proficiency

m <- lm(rating ~ lang, data=lls) # (Intercept) is included by default
summary(m) # significant difference between EN and NL

Call:
lm(formula = rating ~ lang, data = lls)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9843 -0.4177  0.0753  0.4134  1.6346 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  -0.0128     0.0689   -0.19   0.8525   
langNL        0.2939     0.0980    3.00   0.0031 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.665 on 182 degrees of freedom
Multiple R-squared:  0.0471,    Adjusted R-squared:  0.0418 
F-statistic: 8.99 on 1 and 182 DF,  p-value: 0.0031

Question 2

Interpretation

summary(m)$coef 
             Estimate Std. Error  t value  Pr(>|t|)
(Intercept) -0.012838   0.068947 -0.18621 0.8524890
langNL       0.293941   0.098040  2.99818 0.0030957
  • Regression formula: \(y = \beta_0 + \beta_1 \times x \implies\) rating = -0.01 + 0.29 * langNL
    • For EN (langNL == 0) fitted rating of: -0.01 + 0.29 \(\times\) 0 = -0.01 (= Intercept)
    • For NL (langNL == 1) fitted rating of: -0.01 + 0.29 \(\times\) 1 = 0.28

Examining dummy variables and releveling

lls$lang <- as.factor(lls$lang) # ensuring it is a factor (nominal) variable
contrasts(lls$lang) # shows the levels mapped to the single dummy variable (alphabetic)
   NL
EN  0
NL  1
  • We can change the reference level using the function relevel
lls$lang <- relevel(lls$lang, 'NL') # set reference level to NL
contrasts(lls$lang) # shows changed contrasts
   EN
NL  0
EN  1

Model using releveled variable

m2 <- lm(rating ~ lang, data=lls) 
summary(m2)$coef
            Estimate Std. Error t value   Pr(>|t|)
(Intercept)  0.28110    0.06970  4.0330 8.0925e-05
langEN      -0.29394    0.09804 -2.9982 3.0957e-03
  • Regression formula: \(y = \beta_0 + \beta_1 \times x \implies\) rating = 0.28 + -0.29 * langEN
    • For EN (langEN == 1) fitted rating of: 0.28 + -0.29 \(\times\) 1 = -0.01
    • For NL (langEN == 0) fitted rating of: 0.28 + -0.29 \(\times\) 0 = 0.28 (= Intercept)

Question 3

Side-by-side comparison: identical predictions

summary(m)$coef
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   -0.013      0.069   -0.19   0.8525
langNL         0.294      0.098    3.00   0.0031
visreg(m)

summary(m2)$coef
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    0.281     0.0697    4.03 8.09e-05
langEN        -0.294     0.0980   -3.00 3.10e-03
visreg(m2)

Effect size

summary(m)$r.squared
[1] 0.0471
  • Effect size is small to medium

Second example: comparing males and females

m3 <- lm(rating ~ sex, data=lls)
summary(m3) # significant difference between males and females

Call:
lm(formula = rating ~ sex, data = lls)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7478 -0.4059  0.0715  0.4830  1.5074 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.2218     0.0657    3.38   0.0009 ***
sexM         -0.2080     0.1003   -2.07   0.0395 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.673 on 182 degrees of freedom
Multiple R-squared:  0.0231,    Adjusted R-squared:  0.0177 
F-statistic:  4.3 on 1 and 182 DF,  p-value: 0.0395

Interpretation

summary(m3)$coef 
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    0.222     0.0657    3.38 0.000898
sexM          -0.208     0.1003   -2.07 0.039455
  • Regression formula: \(y = \beta_0 + \beta_1 \times x \implies\) rating = 0.22 + -0.21 * sexM
    • For (sexM == 0) fitted rating of: 0.22 + -0.21 \(\times\) 0 = 0.22 (= Intercept)
    • For (sexM == 1) fitted rating of: 0.22 + -0.21 \(\times\) 1 = 0.01

Releveling

lls$sex <- as.factor(lls$sex) # ensuring it is a factor (nominal) variable
contrasts(lls$sex) # shows the levels mapped to the single dummy variable (alphabetic)
  M
F 0
M 1
  • We now change the reference level again
lls$sex <- relevel(lls$sex, 'M') # set reference level to M
contrasts(lls$sex) # shows changed contrasts
  F
M 0
F 1

Model using releveled variable: identical predictions

m4 <- lm(rating ~ sex, data=lls) 
summary(m4)$coef
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.0138     0.0757   0.183   0.8551
sexF          0.2080     0.1003   2.074   0.0395
  • Regression formula: \(y = \beta_0 + \beta_1 \times x \implies\) rating = 0.01 + 0.21 * sexF
    • For (sexF == 1) fitted rating of: 0.01 + 0.21 \(\times\) 1 = 0.22
    • For (sexF == 0) fitted rating of: 0.01 + 0.21 \(\times\) 0 = 0.01 (= Intercept)

Visualization and effect size

visreg(m4) 

summary(m4)$r.squared # small effect size
[1] 0.0231

Third example: comparing educational levels

lls$edu = relevel(lls$edu, 'OTH') # set OTH as reference level
summary(m5 <- lm(rating ~ edu, data=lls)) # fit model and show summary in one line

Call:
lm(formula = rating ~ edu, data = lls)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.0038 -0.4183  0.0858  0.4763  1.4945 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.1221     0.0992   -1.23  0.21999    
eduHBO        0.1489     0.1395    1.07  0.28731    
eduWO         0.4227     0.1200    3.52  0.00054 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.658 on 181 degrees of freedom
Multiple R-squared:  0.0715,    Adjusted R-squared:  0.0613 
F-statistic: 6.97 on 2 and 181 DF,  p-value: 0.00121

Interpretation

summary(m5)$coef
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    -0.12      0.099    -1.2  0.21999
eduHBO          0.15      0.140     1.1  0.28731
eduWO           0.42      0.120     3.5  0.00054
  • Regression formula: rating = -0.12 + 0.15 * eduHBO + 0.42 * eduWO
    • rating for OTH: -0.12 + 0.15 \(\times\) 0 + 0.42 \(\times\) 0 = -0.12 (= Intercept)
    • rating for HBO: -0.12 + 0.15 \(\times\) 1 + 0.42 \(\times\) 0 = 0.03
    • rating for WO: -0.12 + 0.15 \(\times\) 0 + 0.42 \(\times\) 1 = 0.30  
  • Note that WO has a significantly higher score than OTH, but what about WO vs. HBO?

Pairwise comparisons

  • If you are interested in statistically comparing all levels of a nominal variable, you effectively need to run multiple statistical tests (i.e. setting the reference level to each of the individual levels)
  • Using multiple tests/comparisons risks finding significance through sheer chance
  • Suppose we need to run three comparisons, always using \(\alpha\) = 0.05
    • Chance of finding one or more significant values (family-wise error rate) is: \(1 - (1 - \alpha)^3\) = \(1 - 0.95^3 = 0.143\) (almost thrice as high as we’d like!)
  • To guarantee a family-wise error rate of 0.05, we should correct for the number of tests: the Bonferroni correction does this
    • This correction effectively multiplies the original \(p\)-values by the number of tests (maximized by 1)

Correction for multiple comparisons in R

library(emmeans)
emmeans(m5, list(pairwise ~ edu), adjust='bonferroni')
$`emmeans of edu`
 edu emmean    SE  df lower.CL upper.CL
 OTH -0.122 0.099 181    -0.32     0.07
 HBO  0.027 0.098 181    -0.17     0.22
 WO   0.301 0.068 181     0.17     0.43

Confidence level used: 0.95 

$`pairwise differences of edu`
 1         estimate    SE  df t.ratio p.value
 OTH - HBO    -0.15 0.140 181  -1.100  0.8600
 OTH - WO     -0.42 0.120 181  -3.500  <.0001
 HBO - WO     -0.27 0.119 181  -2.300  0.0700

P value adjustment: bonferroni method for 3 tests 

Question 4

Visualization and effect size

visreg(m5)

summary(m5)$r.squared # small to medium effect size
[1] 0.072

Part II: Multiple regression

  • Multiple regression captures a linear relationship between dependent variable (DV) and independent variables (IVs)
    • Regression formula \(y = \beta_0 + \beta_1 x_1 + ... + \beta_n x_n + \epsilon\)
      • \(y\): dependent variable (= predicted or fitted values), \(x_i\): independent variables
      • \(\beta_0\): intercept (value of \(y\) when all \(x_i's\) are \(0\))
      • \(\beta_i\): influence (slope) of \(x_i\) on \(y\)
      • \(\epsilon\): errors (residuals)
  • In a multiple regression model, the effect (estimate) of each IV is determined while controlling for the effect of all other variables in the model
  • Unless there are interactions (covered later), each IV’s effect is assumed to be independent of the other IVs

Our first multiple regression model

summary(m6 <- lm(rating ~ lang + sex, data=lls)) # ref. levels lang=EN, sex=F

Call:
lm(formula = rating ~ lang + sex, data = lls)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8657 -0.3810  0.0659  0.3684  1.6524 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   0.0764     0.0803    0.95   0.3424   
langNL        0.2936     0.0971    3.02   0.0029 **
sexM         -0.2075     0.0981   -2.12   0.0358 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.66 on 181 degrees of freedom
Multiple R-squared:  0.0701,    Adjusted R-squared:  0.0598 
F-statistic: 6.82 on 2 and 181 DF,  p-value: 0.0014

Whole model effect size: adjusted \(R^2\)

  • We have used \(R^2\) as a measure of effect size in simple regression
  • Adding additional independent variables will generally increase the \(R^2\)
  • Adjusted \(R^2\) corrects for the number of predictors in the model
  • A large difference between the \(R^2\) value and the adjusted \(R^2\) value indicates the model contains independent variables which are not useful
  • Adjusted \(R^2\) value is the best effect size measure (of a multiple regression model)
summary(m6)$adj.r.squared # model explains about 6% of the variation in the ratings
[1] 0.06

Model assumptions for multiple regression

  • Identical to linear regression
    • Distribution of residuals is assumed to follow a normal distribution
    • Variance of residuals should be equal along the regression line (homoscedastic)
  • Additional requirement: no collinearity among multiple independent variables

Collinearity

  • When including multiple variables, these may be correlated
  • high correlations are problematic as regr. estimates are harder to reliably determine
  • Collinearity can be measured via the variance inflation factor (VIF)
    • \(VIF_i = \frac{1}{1-R^2_i}\), with \(R^2_i\) from the model with \(x_i\) as DV and the other \(x's\) as the IVs
    • VIF \(\geq\) 10 is problematic
library(car)
vif(m6) # VIF OK
lang  sex 
   1    1 

Model comparison: both predictors necessary?

anova(m, m6) 
Analysis of Variance Table

Model 1: rating ~ lang
Model 2: rating ~ lang + sex
  Res.Df  RSS Df Sum of Sq    F Pr(>F)  
1    182 80.5                           
2    181 78.5  1      1.94 4.48  0.036 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • The additional model complexity is supported by the improved fit to the data

Model comparison vs. summary

summary(m6)$coef
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    0.076      0.080    0.95   0.3424
langNL         0.294      0.097    3.02   0.0029
sexM          -0.208      0.098   -2.12   0.0358
anova(m, m6)
Analysis of Variance Table

Model 1: rating ~ lang
Model 2: rating ~ lang + sex
  Res.Df   RSS Df Sum of Sq     F Pr(>F)  
1    182 80.46                            
2    181 78.52  1     1.942 4.475 0.0358 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Both \(p\)-values will be identical if Df equals 1

Comparing the strength of the effect of the two variables

summary(m6)$coef
             Estimate Std. Error   t value  Pr(>|t|)
(Intercept)  0.076419  0.0802789  0.951919 0.3424074
langNL       0.293622  0.0971169  3.023386 0.0028626
sexM        -0.207524  0.0980954 -2.115528 0.0357529
  • Both predictors are nominal and have two levels, so estimates can be compared
  • lang is most important as the abs. value of langNL is larger than that of sexM

Question 5

Interpreting the regression model: intercept

summary(m6)$coef
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    0.076      0.080    0.95   0.3424
langNL         0.294      0.097    3.02   0.0029
sexM          -0.208      0.098   -2.12   0.0358
  • Females (sexM == 0) who speak English (langNL == 0) have an exp. rating of 0.076

Interpreting the regression model: slopes (1)

summary(m6)$coef 
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    0.076      0.080    0.95   0.3424
langNL         0.294      0.097    3.02   0.0029
sexM          -0.208      0.098   -2.12   0.0358
  • Summary shows \(\beta_\textrm{langNL}\) = 0.29
    • When the language changes from English to Dutch, the rating increases by 0.29
  • Summary shows \(\beta_\textrm{sexM}\) = -0.21
    • Males have a rating that is 0.21 lower than females
  • Fitted (predicted) value of the model can be determined using regression formula
    • \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \implies\) rating = 0.08 + 0.29*langNL + -0.21*sexM
    • langNL equals 1 for NL and 0 for EN, and sexM equals 1 for M, and 0 for F

Interpreting the regression model: slopes (2)

summary(m6)$coef 
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    0.076      0.080    0.95   0.3424
langNL         0.294      0.097    3.02   0.0029
sexM          -0.208      0.098   -2.12   0.0358
  • rating = 0.08 + 0.29*langNL + -0.21*sexM
    • For lang EN and sex F: 0.08 + 0.29 \(\times\) 0 + -0.21 \(\times\) 0 = 0.08 (= Intercept)
    • For lang EN and sex M: 0.08 + 0.29 \(\times\) 0 + -0.21 \(\times\) 1 = -0.13
    • For lang NL and sex F: 0.08 + 0.29 \(\times\) 1 + -0.21 \(\times\) 0 = 0.37
    • For lang NL and sex M: 0.08 + 0.29 \(\times\) 1 + -0.21 \(\times\) 1 = 0.16

Visualization

par(mfrow=c(1,2))
visreg(m6, 'lang')
visreg(m6, 'sex')

Multiple regression: combining nominal and numeric IVs

m7 <- lm(rating ~ lang + BAC.z, data=lls) # sex is excluded due to space restrictions on my slides 
summary(m7)

Call:
lm(formula = rating ~ lang + BAC.z, data = lls)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8354 -0.3981  0.0604  0.4021  1.6615 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  -0.1431     0.0825   -1.74   0.0843 . 
langNL        0.2966     0.0963    3.08   0.0024 **
BAC.z        -0.3505     0.1265   -2.77   0.0062 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.653 on 181 degrees of freedom
Multiple R-squared:  0.0858,    Adjusted R-squared:  0.0757 
F-statistic:  8.5 on 2 and 181 DF,  p-value: 0.000298

Model comparison: both IVs necessary?

anova(m, m7) # Yes!
Analysis of Variance Table

Model 1: rating ~ lang
Model 2: rating ~ lang + BAC.z
  Res.Df  RSS Df Sum of Sq    F Pr(>F)   
1    182 80.5                            
2    181 77.2  1      3.27 7.67 0.0062 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Checking assumptions

vif(m7) # VIF OK
 lang BAC.z 
    1     1 
  • Collinearity is OK
  • Residuals are normally distributed and homoscedastic (not shown)

How good is the model: Adjusted \(R^2\)

summary(m7)$adj.r.squared 
[1] 0.0757
  • The model explains about 7.5% of the variation in the ratings

Which variable is most important?

summary(m7)$coef
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    -0.14      0.082    -1.7   0.0843
langNL          0.30      0.096     3.1   0.0024
BAC.z          -0.35      0.127    -2.8   0.0062
  • Effect of BAC.z is largest, as each increase of 1 results in a decrease of rating by 0.35
    • Changing the language (lang) only changes the rating by 0.3

Question 6

Interpreting the regression model: intercept

summary(m7)$coef 
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    -0.14      0.082    -1.7   0.0843
langNL          0.30      0.096     3.1   0.0024
BAC.z          -0.35      0.127    -2.8   0.0062
  • A person speaking English (langNL == 0) with an average BAC (BAC.z == 0) has an expected rating of -0.14

Interpreting the regression model: slopes (1)

summary(m7)$coef 
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    -0.14      0.082    -1.7   0.0843
langNL          0.30      0.096     3.1   0.0024
BAC.z          -0.35      0.127    -2.8   0.0062
  • Summary shows \(\beta_\textrm{langNL}\) = 0.3
    • When the language changes from English to Dutch, the rating increases by 0.3
  • Summary shows \(\beta_\textrm{BAC.z}\) = -0.35
    • For every increase of BAC.z with 1 (standard deviation) the rating decreases by 0.35
  • Fitted (predicted) value of the model can be determined using regression formula
    • \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \implies\) rating = -0.14 + 0.30*langNL + -0.35*BAC.z
    • langNL equals 1 for NL and 0 for EN

Interpreting the regression model: slopes (2)

summary(m7)$coef 
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    -0.14      0.082    -1.7   0.0843
langNL          0.30      0.096     3.1   0.0024
BAC.z          -0.35      0.127    -2.8   0.0062
  • rating = -0.14 + 0.30*langNL + -0.35*BAC.z
    • For lang EN and BAC.z equals 0: -0.14 + 0.30 \(\times\) 0 + -0.35 \(\times\) 0 = -0.14 (= Intercept)
    • For lang NL and BAC.z equals 2: -0.14 + 0.30 \(\times\) 1 + -0.35 \(\times\) 2 = -0.54
  • Note that the effects are independent and do not influence each other!

Visualization shows independence

visreg(m7,'BAC.z', by='lang', overlay=T) # overlay=T visualizes the lines in one graph

Part III: Multiple regression with an interaction

  • Multiple regression captures a linear relationship between dependent variable (DV) and independent variables (IVs)
  • An interaction indicates that the effect of the interacting IVs on the DV is dependent on each other
    • Regression formula \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{1,2} x_1 x_2 + \epsilon\)
    • \(y\): dependent variable (= predicted or fitted values), \(x_i\): independent variables, \(\epsilon\): resid.
      • \(\beta_0\): intercept (value of \(y\) when all \(x_i's\) are \(0\))
      • \(\beta_1\): influence (slope) of \(x_1\) on \(y\) when \(x_2\) equals 0
      • \(\beta_2\): influence (slope) of \(x_2\) on \(y\) when \(x_1\) equals 0
      • \(\beta_{1,2}\): interaction effect (slope) of \(x_1\) and \(x_2\)

Be careful when interpreting an interaction!

  • Many people interpret an interaction in regression incorrectly
  • In an interaction, the terms involved in the interaction cannot be interpreted by themselves anymore
    • They have to be interpreted jointly (i.e. always in the context of the other variable)
  • Visualization helps interpretation

Our first interaction: assessing our RQ

summary(m8 <- lm(rating ~ BAC.z * lang, data=lls))

Call:
lm(formula = rating ~ BAC.z * lang, data = lls)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7972 -0.3765  0.0637  0.3826  1.6101 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept)   -0.1029     0.0974   -1.06    0.292  
BAC.z         -0.2421     0.1881   -1.29    0.200  
langNL         0.2236     0.1345    1.66    0.098 .
BAC.z:langNL  -0.1982     0.2544   -0.78    0.437  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.654 on 180 degrees of freedom
Multiple R-squared:  0.0889,    Adjusted R-squared:  0.0737 
F-statistic: 5.85 on 3 and 180 DF,  p-value: 0.000776

Question 7

Better than simpler model?

anova(m7, m8) # note that the order of terms in each model is irrelevant (A * B == B * A)
Analysis of Variance Table

Model 1: rating ~ lang + BAC.z
Model 2: rating ~ BAC.z * lang
  Res.Df  RSS Df Sum of Sq    F Pr(>F)
1    181 77.2                         
2    180 76.9  1     0.259 0.61   0.44
  • Our hypothesis was one-tailed, but the \(p\)-values of anova and summary are two-tailed
    • One-tailed \(p\)-value (as direction in line with hypothesis): \(\frac{0.44}{2} = 0.22\) (\(> \alpha\) of 0.05)
    • Not enough support for a sign. difference between the negative BAC effect on L1 vs. L2
      • Despite our earlier finding that the effect was significant for Dutch, but not English
  • Next, we will learn how to interpret the (non-significant) interaction pattern

Interpreting the interaction numerically

             Estimate Std. Error t value Pr(>|t|)
(Intercept)     -0.10      0.097   -1.06    0.292
BAC.z           -0.24      0.188   -1.29    0.200
langNL           0.22      0.134    1.66    0.098
BAC.z:langNL    -0.20      0.254   -0.78    0.437
  • Regression formula \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{1,2} x_1 x_2 \implies\)
    rating = -0.10 + -0.24 * BAC.z + 0.22 * langNL + -0.20 * BAC.z * langNL
    • For BAC.z of 0 and lang EN: -0.10 + -0.24\(\times\)0 + 0.22\(\times\)0 + -0.20\(\times\)0\(\times\)0 = -0.10
    • For BAC.z of 0 and lang NL: -0.10 + -0.24\(\times\)0 + 0.22\(\times\)1 + -0.20\(\times\)0\(\times\)1 = 0.12
    • For BAC.z of 2 and lang NL: -0.10 + -0.24\(\times\)2 + 0.22\(\times\)1 + -0.20\(\times\)2\(\times\)1 = -0.76

Interpreting the interaction logically

             Estimate Std. Error t value Pr(>|t|)
(Intercept)     -0.10      0.097   -1.06    0.292
BAC.z           -0.24      0.188   -1.29    0.200
langNL           0.22      0.134    1.66    0.098
BAC.z:langNL    -0.20      0.254   -0.78    0.437
  • For the EN lang, each unit increase of BAC.z changes rating by -0.24
  • For the NL lang, each unit increase of BAC.z changes rating by -0.24 + -0.20 = -0.44
  • The BAC.z slope is shifted downwards by 0.22 for the NL lang
  • So: the effect of BAC.z is more negative for L1 than L2 (but not significantly so)

Interpreting the interaction visually

             Estimate Std. Error t value Pr(>|t|)
(Intercept)     -0.10      0.097   -1.06    0.292
BAC.z           -0.24      0.188   -1.29    0.200
langNL           0.22      0.134    1.66    0.098
BAC.z:langNL    -0.20      0.254   -0.78    0.437
visreg(m8,"BAC.z",by="lang", overlay=TRUE) # visualization shows dependency of BAC en lang

Question 8

Recap

  • In this lecture, we’ve covered
    • Linear regression with a single nominal independent variable
    • Multiple linear regression with multiple independent variables
    • Introduction to multiple linear regression with an interaction
  • Next lecture: Multiple linear regression with interactions, Cronbach’s alpha and recap

Please evaluate this lecture!

Exam question

Questions?

Thank you for your attention!

 

https://www.martijnwieling.nl

m.b.wieling@rug.nl