Martijn Wieling

University of Groningen

- Introduction
- Recap: multiple regression
- Mixed-effects regression analysis: explanation
- Methodological issues
- Case-study: Lexical decision latencies (Baayen, 2008: 7.5.1)
- Conclusion

- Consider the following situation (taken from Clark, 1973):
- Mr. A and Mrs. B study reading latencies of verbs and nouns
- Each randomly selects 20 words and tests 50 subjects
- Mr. A finds (using a sign test) verbs to have faster responses
- Mrs. B finds nouns to have faster responses

- How is this possible?

- The problem is that Mr. A and Mrs. B disregard the (huge) variability in the words
- Mr. A included a difficult noun, but Mrs. B included a difficult verb
- Their set of words does not constitute the complete population of nouns and verbs, therefore their results are limited to their words

- This is known as the language-as-fixed-effect fallacy (LAFEF)
- Fixed-effect factors have repeatable and a small number of levels
- Word is a random-effect factor (a non-repeatable random sample from a larger population)

- LAFEF occurs frequently in linguistic research until the 1970's
- Many reported significant results are wrong (the method is anti-conservative)!

- Clark (1973) combined a by-subject (\(F_1\)) analysis and by-item (\(F_2\)) analysis in measure
*min F'*- Results are significant and generalizable across subjects and items when
*min F'*is significant - Unfortunately many researchers (>50%!) incorrectly interpreted this study and may report wrong results (Raaijmakers et al., 1999)
- E.g., they only use \(F_1\) and \(F_2\) and not
*min F'*or they use \(F_2\) while unneccesary (e.g., counterbalanced design)

- Results are significant and generalizable across subjects and items when

- Apparently, analyzing this type of data is difficult...
- Fortunately, using mixed-effects regression models solves these problems!
- The method is easier than using the approach of Clark (1973)
- Results can be generalized across subjects and items
- Mixed-effects models are robust to missing data (Baayen, 2008, p. 266)
- We can easily test if it is necessary to treat item as a random effect
- No balanced design necessary (as in repeated-measures ANOVA)!

- But first some words about regression...

- Multiple regression: predict one numerical variable on the basis of other independent variables (numerical or categorical)
- We can write a regression formula as \(y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \epsilon\)
- \(y\): (value of the) dependent variable
- \(x_i\): (value of the) predictor
- \(\beta_0\): intercept, value of \(y\) when all \(x_i = 0\)
- \(\beta_i\): slope, change in \(y\) when the value of \(x_i\) increases with 1
- \(\epsilon\): residuals, difference between observed values and predicted (fitted) values

- Factorial predictors are (automatically) represented by binary-valued predictors: \(x_i = 0\) (reference level) or \(x_i = 1\) (alternative level)
- Factor with \(n\) levels: \(n-1\) binary predictors
- Interpretation of factorial \(\beta_i\): change from reference to alternative level

- Example of regression formula:
- Predict the reaction time of a subject on the basis of word frequency, word length, and subject age:
**RT = 200 - 5WF + 3WL + 10SA**

- Predict the reaction time of a subject on the basis of word frequency, word length, and subject age:

- Mixed-effects regression distinguishes fixed effects and random-effect factors
- Fixed effects:
- All numerical predictors
- Factorial predictors with a repeatable and small number of levels (e.g., Gender)

- Random-effect
**factors**:- Only factorial predictors!
- Levels are a non-repeatable random sample from a larger population
- Generally a large number of levels (e.g., Subject, Item)

- Random-effect factors are factors which are likely to introduce systematic variation (here:
**subject**and**item**)- Some
**subjects**have a slow response (RT), while others are fast

= Random Intercept for Subject (i.e. \(\beta_0\) varies per subject) - Some
**items**are easy to recognize, others hard

= Random Intercept for Item (i.e. \(\beta_0\) varies per item) - The effect of item frequency on RT might be higher for one
**subject**than another: e.g., non-native participants might benefit more from frequent words than native participants

= Random Slope for Item Frequency per Subject (i.e. \(\beta_{\textrm{WF}}\) varies per subject) - The effect of subject age on RT might be different for one
**item**than another: e.g., modern words might be recognized faster by younger participants

= Random Slope for Subject Age per Item (i.e. \(\beta_{\textrm{SA}}\) varies per item)

- Some
- Note that it is essential to test for random slopes!

```
Estimate Std. Error t value Pr(>|t|)
Linear regression DistOrigin -6.418e-05 1.808e-06 -35.49 <2e-16
+ Random intercepts DistOrigin -2.224e-05 6.863e-06 -3.240 <0.001
+ Random slopes DistOrigin -1.478e-05 1.519e-05 -0.973 n.s.
```

(This example is explained at the HLP/Jaeger lab blog)

- Mixed-effects regression allow us to use random intercepts and slopes (i.e. adjustments to the population intercept and slopes) to include the variance structure in our data
- Parsimony: a single parameter (standard deviation) models this variation for every random slope or intercept (a normal distribution with mean 0 is assumed)
- The slope and intercept adjustments are Best Linear Unbiased Predictors
- Model comparison determines the inclusion of random intercepts and slopes

- Mixed-effects regression is only required when each level of the random-effect factor has multiple observations (e.g., participants respond to multiple items)

- RT = 200 - 5WF + 3WL + 10SA (general model)
- The intercepts and slopes may vary (according to the estimated variation for each parameter) and this influences the word- and subject-specific values

- RT = 400 - 5WF + 3WL - 2SA (
**word**: scythe) - RT = 300 - 5WF + 3WL + 15SA (
**word**: twitter) - RT = 300 - 7WF + 3WL + 10SA (
**subject**: non-native) - RT = 150 - 5WF + 3WL + 10SA (
**subject**: fast) - And it is not hard to use!
`lmer( RT ~ WF + WL + SA + (1+SA|Wrd) + (1+WF|Subj) )`

- (
`lmer`

automatically discovers random-effects structure: nested/crossed)

- For example:

Subject | Intercept | WF slope |
---|---|---|

S1 | 525 | -2 |

S2 | 400 | -1 |

S3 | 500 | -2 |

S4 | 550 | -3 |

S5 | 450 | -2 |

S6 | 600 | -4 |

S7 | 300 | 0 |

`lmer`

do not suffer from shrinkage- The BLUPS (i.e. adjustment to the model estimates per item/participant) are close to the real adjustments, as
`lmer`

takes into account regression towards the mean (fast subjects will be slower next time, and slow subjects will be faster) thereby avoiding overfitting and improving prediction - see Efron & Morris (1977)

- Otherwise random slopes and intercepts may show a spurious correlation
- Also helps the interpretation of interactions (see this lecture)

- Independent observations within each level of the random-effect factor
- Relation between dependent and independent variables linear
- No strong multicollinearity
- Residuals are not autocorrelated
- Homoscedasticity of variance in residuals
- Residuals are normally distributed
- (Similar assumptions as for regression)

- Check the distribution of residuals: if not normally distributed and/or heteroscedastic residuals then transform dependent variable or use generalized linear mixed-effects regression modeling
- Check outlier characteristics and refit the model when large outliers are excluded to verify that your effects are not 'carried' by these outliers

- "The data analyst knows more than the computer." (Henderson & Velleman, 1981)
- Get to know your data!

- There is no adequate
*automatic*procedure to find the best model - Fitting the most complex random effects structure (Barr et al., 2013) is mostly not a good idea
- The model may not converge, or is uninterpretable (see Baayen et al., 2017)
- The power of your method is reduced (Matuschek et al., 2017)

- My stepwise variable-selection procedure (for exploratory analysis):
- Include random intercepts
- Add other potential explanatory variables one-by-one
- Insignificant predictors are dropped
- Test predictors for inclusion which were excluded at an earlier stage
- Test possible interactions (don't make it too complex)
- Try to break the model by including significant predictors as random slopes
- Only choose a more complex model if supported by model comparison

- For a hypothesis-driven analysis, stepwise selection is problematic
- Confidence intervals too narrow: \(p\)-values too low (multiple comparisons)
- See, e.g., Burnham & Anderson (2002)

- Solutions:
- Careful specification of potential
*a priori*models lining up with the hypotheses (including optimal random-effects structure) and evaluating only these models- This may be followed by an exploratory procedure

- Validating a stepwise procedure via cross validation (e.g., bootstrap analysis)

- Careful specification of potential

- De Vaan, Schreuder & Baayen (The Mental Lexicon, 2007)
- Design
- Long-distance priming (39 intervening items)
- Neologism condition: neologism followed prime (fluffy - fluffiness)
- Identity condition: same word followed prime (fluffiness - fluffiness)
- Only items judged to be real words are included

- Prediction
- Subjects in identity condition faster than those in neologism condition

- Dependent variable is log-transformed reaction time (similar results using GLMM)

```
head(dat)
```

```
# Subject Word Condition RT.log PrevRT.log.c RTtoPrime.log.c
# 261 pp1 basaalheid identity 6.69 0.56968 0.2739
# 285 pp1 markantheid neologism 6.81 0.35185 -0.0524
# 299 pp1 ontroerdheid neologism 6.51 -0.23613 0.0410
# 333 pp1 contentheid identity 6.58 0.17521 0.3498
# 334 pp1 riantheid identity 6.86 0.00174 0.2159
# 26 pp1 tembaarheid neologism 6.35 0.33002 0.2138
# ResponseToPrime BaseFrequency.c Trial.c
# 261 correct -0.334 -127.7
# 285 correct 1.276 -125.7
# 299 correct 1.660 -119.7
# 333 correct 0.611 -113.7
# 334 correct 0.643 -112.7
# 26 incorrect -3.889 -97.7
```

`lme4`

version 1.1.17)```
library(lme4)
dat.lmer1 <- lmer(RT.log ~ Condition + (1 | Word) + (1 | Subject), data = dat)
```

- This model represents our hypothesis test

```
summary(dat.lmer1, cor = F) # suppress expected correlation of regression coefficients
```

```
# Linear mixed model fit by REML ['lmerMod']
# Formula: RT.log ~ Condition + (1 | Word) + (1 | Subject)
# Data: dat
#
# REML criterion at convergence: -102
#
# Scaled residuals:
# Min 1Q Median 3Q Max
# -2.359 -0.691 -0.134 0.590 4.261
#
# Random effects:
# Groups Name Variance Std.Dev.
# Word (Intercept) 0.00341 0.0584
# Subject (Intercept) 0.04084 0.2021
# Residual 0.04408 0.2100
# Number of obs: 832, groups: Word, 40; Subject, 26
#
# Fixed effects:
# Estimate Std. Error t value
# (Intercept) 6.6342 0.0421 157.55
# Conditionneologism -0.0313 0.0147 -2.13
```

- Counterintuitive inhibition
- But various potential factors are not accounted for in the model
- Longitudinal effects
- RT to prime as predictor
- Response to the prime (correct/incorrect): a yes response to a target associated with a previously rejected prime may take longer
- The presence of atypical outliers

- Next step: exploratory analysis

```
dat.lmer2 <- lmer(RT.log ~ Trial.c + Condition + (1 | Subject) + (1 | Word), data = dat)
summary(dat.lmer2)$coef # show only fixed effects
```

```
# Estimate Std. Error t value
# (Intercept) 6.633867 4.21e-02 157.44
# Trial.c -0.000146 9.62e-05 -1.52
# Conditionneologism -0.030977 1.47e-02 -2.11
```

```
anova(dat.lmer1, dat.lmer2) # Trial.c does not improve the model significantly
```

```
# Data: dat
# Models:
# dat.lmer1: RT.log ~ Condition + (1 | Word) + (1 | Subject)
# dat.lmer2: RT.log ~ Trial.c + Condition + (1 | Subject) + (1 | Word)
# Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
# dat.lmer1 5 -104 -79.9 56.8 -114
# dat.lmer2 6 -104 -75.5 57.9 -116 2.3 1 0.13
```

```
dat.lmer3 <- lmer(RT.log ~ PrevRT.log.c + Condition + (1 | Subject) + (1 | Word), data = dat)
summary(dat.lmer3)$coef
```

```
# Estimate Std. Error t value
# (Intercept) 6.6297 0.0386 171.70
# PrevRT.log.c 0.1212 0.0334 3.63
# Conditionneologism -0.0279 0.0146 -1.90
```

```
anova(dat.lmer1, dat.lmer3)$P[2] # extract p-value from anova: PrevRT.log.c necessary
```

```
# [1] 0.000332
```

```
dat.lmer4 <- lmer(RT.log ~ RTtoPrime.log.c + PrevRT.log.c + Condition + (1 | Subject) +
(1 | Word), data = dat)
summary(dat.lmer4)$coef
```

```
# Estimate Std. Error t value
# (Intercept) 6.60782 0.0341 193.992
# RTtoPrime.log.c 0.16379 0.0319 5.141
# PrevRT.log.c 0.11901 0.0330 3.605
# Conditionneologism 0.00612 0.0160 0.383
```

```
anova(dat.lmer3, dat.lmer4)$P[2] # RTtoPrime.log.c necessary
```

```
# [1] 5.97e-07
```

```
dat.lmer5 <- lmer(RT.log ~ RTtoPrime.log.c + ResponseToPrime + PrevRT.log.c + Condition +
(1 | Subject) + (1 | Word), data = dat)
summary(dat.lmer5)$coef
```

```
# Estimate Std. Error t value
# (Intercept) 6.5870 0.0341 193.44
# RTtoPrime.log.c 0.1650 0.0315 5.24
# ResponseToPrimeincorrect 0.1004 0.0226 4.45
# PrevRT.log.c 0.1142 0.0327 3.49
# Conditionneologism 0.0178 0.0161 1.11
```

```
anova(dat.lmer4, dat.lmer5)$P[2] # ResponseToPrime necessary
```

```
# [1] 1.16e-05
```

```
dat.lmer6 <- lmer(RT.log ~ RTtoPrime.log.c * ResponseToPrime + PrevRT.log.c + Condition +
(1 | Subject) + (1 | Word), data = dat)
summary(dat.lmer6)$coef
```

```
# Estimate Std. Error t value
# (Intercept) 6.5807 0.0331 199.10
# RTtoPrime.log.c 0.2276 0.0359 6.33
# ResponseToPrimeincorrect 0.1162 0.0228 5.09
# PrevRT.log.c 0.1183 0.0325 3.64
# Conditionneologism 0.0266 0.0162 1.64
# RTtoPrime.log.c:ResponseToPrimeincorrect -0.2025 0.0606 -3.34
```

```
anova(dat.lmer5, dat.lmer6)$P[2] # interaction necessary
```

```
# [1] 0.000903
```

```
library(visreg)
visreg(dat.lmer6, "RTtoPrime.log.c", by = "ResponseToPrime", overlay = T, points = list(cex = 0.1))
```

- Interpretation: the RT to the prime is only predictive for the RT of the target word when the prime was judged to be a correct word

```
dat.lmer7 <- lmer(RT.log ~ BaseFrequency.c + RTtoPrime.log.c * ResponseToPrime + PrevRT.log.c +
Condition + (1 | Subject) + (1 | Word), data = dat)
summary(dat.lmer7)$coef
```

```
# Estimate Std. Error t value
# (Intercept) 6.58197 0.03319 198.33
# BaseFrequency.c -0.00924 0.00437 -2.11
# RTtoPrime.log.c 0.21824 0.03615 6.04
# ResponseToPrimeincorrect 0.11468 0.02276 5.04
# PrevRT.log.c 0.11542 0.03246 3.56
# Conditionneologism 0.02466 0.01618 1.52
# RTtoPrime.log.c:ResponseToPrimeincorrect -0.19399 0.06055 -3.20
```

```
anova(dat.lmer6, dat.lmer7)$P[2] # BaseFrequency.c seems necessary...
```

```
# [1] 0.0357
```

```
summary(dat.lmer1)$varcor # original
```

```
# Groups Name Std.Dev.
# Word (Intercept) 0.0584
# Subject (Intercept) 0.2021
# Residual 0.2100
```

```
summary(dat.lmer7)$varcor # new
```

```
# Groups Name Std.Dev.
# Word (Intercept) 0.0339
# Subject (Intercept) 0.1549
# Residual 0.2055
```

- Note the lower standard deviation of the random intercept for word

```
dat.lmer7a <- lmer(RT.log ~ RTtoPrime.log.c * ResponseToPrime + PrevRT.log.c + BaseFrequency.c +
Condition + (1 | Subject) + (0 + BaseFrequency.c | Subject) + (1 | Word), data = dat)
# alternative: (1+BaseFrequency||Subject)
summary(dat.lmer7a)$varcor
```

```
# Groups Name Std.Dev.
# Word (Intercept) 0.0342
# Subject BaseFrequency.c 0.0131
# Subject.1 (Intercept) 0.1563
# Residual 0.2037
```

```
anova(dat.lmer7, dat.lmer7a, refit = F) # refit=F: don't refit with ML, but use REML (lmer default)
```

```
# Data: dat
# Models:
# dat.lmer7: RT.log ~ BaseFrequency.c + RTtoPrime.log.c * ResponseToPrime +
# dat.lmer7: PrevRT.log.c + Condition + (1 | Subject) + (1 | Word)
# dat.lmer7a: RT.log ~ RTtoPrime.log.c * ResponseToPrime + PrevRT.log.c + BaseFrequency.c +
# dat.lmer7a: Condition + (1 | Subject) + (0 + BaseFrequency.c | Subject) +
# dat.lmer7a: (1 | Word)
# Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
# dat.lmer7 10 -126 -78.3 72.8 -146
# dat.lmer7a 11 -126 -74.4 74.2 -148 2.78 1 0.096 .
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

- Restricted Maximum Likelihood (REML; default
`lmer`

fitting method) is appropriate when comparing 2 models differing exclusively in the random-effects structure- The REML score depends on which fixed effects are in the model, so REML values are not comparable if the fixed effects change
- As REML is considered to give better estimates for the random effects, it is recommended to fit your final model (for reporting and inference) using REML

- ML is appropriate when comparing models differing only in the fixed effects
`anova`

automatically refits the models with ML, unless`refit=F`

is specified

- Never compare 2 models differing in both fixed and random effects
- Only compare models which differ minimally (e.g., only in 1 random slope)

```
dat.lmer7b <- lmer(RT.log ~ RTtoPrime.log.c * ResponseToPrime + PrevRT.log.c + BaseFrequency.c +
Condition + (1 + BaseFrequency.c | Subject) + (1 | Word), data = dat)
summary(dat.lmer7b)$varcor
```

```
# Groups Name Std.Dev. Corr
# Word (Intercept) 0.0344
# Subject (Intercept) 0.1583
# BaseFrequency.c 0.0136 0.67
# Residual 0.2035
```

```
anova(dat.lmer7, dat.lmer7b, refit = F)$P[2] # correlated random slope necessary
```

```
# [1] 0.0386
```

```
plot(coef(dat.lmer7b)$Subject[, "(Intercept)"], coef(dat.lmer7b)$Subject[, "BaseFrequency.c"],
xlab = "Random intercept for subject", ylab = "By-subject random slope for BaseFrequency.c")
```

- Slower subjects (i.e. those with higher intercepts) experience less benefit from word frequency (i.e. the BLUPs are less negative, sometimes even positive)

```
summary(dat.lmer7b)$coef
```

```
# Estimate Std. Error t value
# (Intercept) 6.58280 0.03378 194.90
# RTtoPrime.log.c 0.21680 0.03590 6.04
# ResponseToPrimeincorrect 0.11586 0.02263 5.12
# PrevRT.log.c 0.10763 0.03231 3.33
# BaseFrequency.c -0.00764 0.00515 -1.48
# Conditionneologism 0.02389 0.01604 1.49
# RTtoPrime.log.c:ResponseToPrimeincorrect -0.19693 0.06010 -3.28
```

```
dat.lmer8 <- lmer(RT.log ~ RTtoPrime.log.c * ResponseToPrime + PrevRT.log.c + Condition +
(1 + BaseFrequency.c | Subject) + (1 | Word), data = dat) # w/o BaseFrequency.c as fixed effect
anova(dat.lmer8, dat.lmer7b)$P[2] # not enough support for more complex model (dat.lmer7b)
```

```
# [1] 0.144
```

```
dat.lmer8a <- lmer(RT.log ~ RTtoPrime.log.c * ResponseToPrime + PrevRT.log.c + Condition +
(1 + BaseFrequency.c + Condition | Subject) + (1 | Word), data = dat)
anova(dat.lmer8, dat.lmer8a, refit = F) # no improvement
```

```
# Data: dat
# Models:
# dat.lmer8: RT.log ~ RTtoPrime.log.c * ResponseToPrime + PrevRT.log.c + Condition +
# dat.lmer8: (1 + BaseFrequency.c | Subject) + (1 | Word)
# dat.lmer8a: RT.log ~ RTtoPrime.log.c * ResponseToPrime + PrevRT.log.c + Condition +
# dat.lmer8a: (1 + BaseFrequency.c + Condition | Subject) + (1 | Word)
# Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
# dat.lmer8 11 -137 -84.7 79.3 -159
# dat.lmer8a 14 -131 -64.5 79.3 -159 0.02 3 1
```

- Other predictors are also not supported as random slopes:
`dat.lmer8`

best!

```
library(car)
vif(dat.lmer8) # OK: lower < 5
```

```
# RTtoPrime.log.c ResponseToPrime
# 1.61 1.07
# PrevRT.log.c Condition
# 1.01 1.27
# RTtoPrime.log.c:ResponseToPrime
# 1.38
```

```
acf(resid(dat.lmer8)) # no autocorrelation
```

```
qqnorm(resid(dat.lmer8)) # not entirely normal
qqline(resid(dat.lmer8))
```

```
plot(fitted(dat.lmer8), resid(dat.lmer8)) # some heteroscedasticity
```

```
dat2 <- dat[abs(scale(resid(dat.lmer8))) < 2.5, ] # 98% of original data
dat2.lmer8 <- lmer(RT.log ~ RTtoPrime.log.c * ResponseToPrime + PrevRT.log.c + Condition +
(1 + BaseFrequency.c | Subject) + (1 | Word), data = dat2)
summary(dat2.lmer8)$coef
```

```
# Estimate Std. Error t value
# (Intercept) 6.5785 0.0312 211.10
# RTtoPrime.log.c 0.2394 0.0317 7.55
# ResponseToPrimeincorrect 0.1312 0.0201 6.54
# PrevRT.log.c 0.0942 0.0291 3.23
# Conditionneologism 0.0373 0.0143 2.61
# RTtoPrime.log.c:ResponseToPrimeincorrect -0.2224 0.0529 -4.20
```

- Items in the
`neologism`

condition are responded to**significantly**slower than those in the`identity`

condition: in line with original hypothesis (note: \(p\)-value from`anova`

: 0.01)

```
par(mfrow = c(1, 2))
qqnorm(resid(dat2.lmer8)) # better
qqline(resid(dat2.lmer8))
plot(fitted(dat2.lmer8), resid(dat2.lmer8)) # better
```

```
library(boot)
(bs.lmer8 <- confint(dat2.lmer8, method = "boot", nsim = 1000, level = 0.95))
```

```
# 2.5 % 97.5 %
# .sig01 0.00000 0.0425
# .sig02 0.11087 0.2049
# .sig03 0.11135 1.0000
# .sig04 0.00408 0.0218
# .sigma 0.16985 0.1886
# (Intercept) 6.51524 6.6375
# RTtoPrime.log.c 0.17331 0.3026
# ResponseToPrimeincorrect 0.09269 0.1716
# PrevRT.log.c 0.04144 0.1512
# Conditionneologism 0.00815 0.0654
# RTtoPrime.log.c:ResponseToPrimeincorrect -0.31859 -0.1181
```

- 95%-confidence interval of
`Conditionneologism`

does not contain 0:**significant**

- We have learned:
- How to interpret mixed-effects regression results
- How to use
`lmer`

to conduct mixed-effects regression - How to include random intercepts and random slopes in
`lmer`

and why these are essential when you have multiple responses per subject or item

- Associated lab session:
- http://www.let.rug.nl/wieling/Statistics/Mixed-Effects/lab
- Lab session contains additional information: how to do multiple comparisons, using other optimizers, and conducting logistic regression

- http://www.let.rug.nl/wieling/Statistics/Mixed-Effects/lab

Thank you for your attention!