# Introduction to mixed-effects regression for (psycho)linguists

## Lecture 1 of advanced regression for linguists

Martijn Wieling
Computational Linguistics Research Group

## Course setup (1)

• Five sessions (2 - 3 hours each):
• Mon. Jan. 23, 10 AM - 1 PM: Intro to mixed-effects regression with reaction time data
• Tue. Jan. 24, 10 AM - 1 PM: Generalized mixed-effects regression and eye-tracking data
• Mon. Jan. 30, 10 AM - 1 PM: Intro to generalized additive modeling with articulography data
• Wed. Feb. 1, 10 AM - 1 PM: Generalized additive modeling with EEG data
• Thu. Feb. 2, 10 AM - 1 PM: Generalized additive logistic modeling with dialect data

## Course setup (2)

• User-centered, so each lecture:
• Part I: introductory lecture (max. 90 minutes)
• Short break
• Part II: hands-on lab session (max. 90 minutes)
• You probably won't be able to finish all exercises from the lab session during the lecture
• To get the most out of the course, try to finish them by yourself (additional information is present in the lab sessions)
• Questions: ask immediately when something is unclear!

## This lecture

• Introduction
• Recap: multiple regression
• Mixed-effects regression analysis: explanation
• Methodological issues
• Case-study: Lexical decision latencies (Baayen, 2008: 7.5.1)
• Conclusion

## Introduction

• Consider the following situation (taken from Clark, 1973):
• Mr. A and Mrs. B study reading latencies of verbs and nouns
• Each randomly selects 20 words and tests 50 participants
• Mr. A finds (using a sign test) verbs to have faster responses
• Mrs. B finds nouns to have faster responses
• How is this possible?

## The language-as-fixed-effect fallacy

• The problem is that Mr. A and Mrs. B disregard the variability in the words (which is huge)
• Mr. A included a difficult noun, but Mrs. B included a difficult verb
• Their set of words does not constitute the complete population of nouns and verbs, therefore their results are limited to their words
• This is known as the language-as-fixed-effect fallacy (LAFEF)
• Fixed-effect factors have repeatable and a small number of levels
• Word is a random-effect factor (a non-repeatable random sample from a larger population)

## Why linguists are not always good statisticians

• LAFEF occurs frequently in linguistic research until the 1970's
• Many reported significant results are wrong (the method is anti-conservative)!
• Clark (1973) combined a by-subject ($F_1$) analysis and by-item ($F_2$) analysis in measure min F'
• Results are significant and generalizable across subjects and items when min F' is significant
• Unfortunately many researchers (>50%!) incorrectly interpreted this study and may report wrong results (Raaijmakers et al., 1999)
• E.g., they only use $F_1$ and $F_2$ and not min F' or they use $F_2$ while unneccesary (e.g., counterbalanced design)

## Our problems solved...

• Apparently, analyzing this type of data is difficult...
• Fortunately, using mixed-effects regression models solves these problems!
• The method is easier than using the approach of Clark (1973)
• Results can be generalized across subjects and items
• Mixed-effects models are robust to missing data (Baayen, 2008, p. 266)
• We can easily test if it is necessary to treat item as a random effect
• But first some words about regression...

## Regression vs. ANOVA (1)

• Most people either use ANOVA or regression
• ANOVA: categorical predictor variables
• Regression: continuous predictor variables
• Both can be used for the same thing!
• ANCOVA: continuous and categorical predictors
• Regression: categorical (dummy coding) and continuous predictors

## Regression vs. ANOVA (2)

• Why I use regression as opposed to ANOVA
• No temptation to dichotomize continuous predictors
• Intuitive interpretation (your mileage may vary)
• Mixed-effects analysis is relatively easy to do and does not require a balanced design (which is generally necessary for repeated-measures ANOVA)
• This course will focus on regression

## Recap: multiple regression (1)

• Multiple regression: predict one numerical variable on the basis of other independent variables (numerical or categorical)
• (logistic regression is used to predict a binary dependent variable)
• We can write a regression formula as $y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \epsilon$
• $y$: (value of the) dependent variable
• $x_i$: (value of the) predictor
• $\beta_0$: intercept, where the regression line crosses the $y$-axis (i.e. value of $y$ when all $x_i = 0$)
• $\beta_i$: change of the dependent variable when the predictor value increases with 1 (i.e. slope)
• $\epsilon$: residuals (difference between observed values and predicted values)

## Recap: multiple regression (2)

• Example of regression formula:
• Predict the reaction time of a participant on the basis of word frequency, word length, and subject age: RT = 200 - 5WF + 3WL + 10SA
• Categorical (i.e. factorial) predictors are represented by binary-valued predictors: $x_i = 0$ (reference level) or $x_i = 1$ (the alternative level)
• Factor with $n$ levels: $n-1$ binary predictors
• Interpretation of $\beta_i$ for categorical predictors: change from reference level to alternative level

## Mixed-effects regression modeling: introduction

• Mixed-effects regression modeling distinguishes fixed-effect and random-effect factors
• Fixed-effect factors:
• Repeatable levels
• Small number of levels (e.g., Gender, Word Category)
• Same treatment as in multiple regression (treatment coding)
• Random-effect factors:
• Levels are a non-repeatable random sample from a larger population
• Often large number of levels (e.g., Subject, Item)

## What are random-effects factors?

• Random-effect factors are factors which are likely to introduce systematic variation
• Some participants have a slow response (RT), while others are fast
= Random Intercept for Subject
• Some words are easy to recognize, others hard
= Random Intercept for Item
• The effect of word frequency on RT might be higher for one participant than another: e.g., non-native participants might benefit more from frequent words than native participants
= Random Slope for Item Frequency per Subject
• The effect of subject age on RT might be different for one word than another: e.g., modern words might be recognized faster by younger participants
= Random Slope for Subject Age per Item
• Note that it is essential to test for random slopes!

## Random slopes are necessary!

Estimate  Std. Error  t value  Pr(>|t|)
Linear regression     DistOrigin  -6.418e-05   1.808e-06   -35.49    <2e-16
+ Random intercepts   DistOrigin  -2.224e-05   6.863e-06   -3.240    <0.001
+ Random slopes       DistOrigin  -1.478e-05   1.519e-05   -0.973    n.s.

This example is explained at the HLP/Jaeger lab blog

## Specific models for every observation

• Mixed-effects regression allow us to use random intercepts and slopes (i.e. adjustments to the population intercept and slopes) to include the variance structure in our data
• Parsimony: a single parameter (standard deviation) models this variation for every random slope or intercept (a normal distribution with mean 0 is assumed)
• The adjustments to population slopes and intercepts are Best Linear Unbiased Predictors
• AIC comparisons assess whether the inclusion of random intercepts and slopes is warranted
• Note that multiple observations for each level of a random effect are necessary for mixed-effects analysis to be useful (e.g., participants respond to multiple items)

## Specific models for every observation

• RT = 200 - 5WF + 3WL + 10SA (general model)
• The intercepts and slopes may vary (according to the estimated standard variation for each parameter) and this influences the word- and subject-specific values
• RT = 400 - 5WF + 3WL - 2SA (word: scythe)
• RT = 300 - 5WF + 3WL + 15SA (word: twitter)
• RT = 300 - 7WF + 3WL + 10SA (subject: non-native)
• RT = 150 - 5WF + 3WL + 10SA (subject: fast)
• And it is not hard to use!
• lmer( RT ~ WF + WL + SA + (1+SA|Wrd) + (1+WF|Subj) )
• lmer automatically discovers random-effects structure (nested/crossed)

## BLUPs of lmer do not suffer from shrinkage

• The BLUPS (i.e. adjustment to the model estimates per item/participant) are close to the real adjustments, as lmer takes into account regression towards the mean (fast subjects will be slower next time, and slow subjects will be faster) thereby avoiding overfitting and improving prediction - see Efron & Morris (1977)

## Methodological issues

• Parsimony
• Centering
• Model criticism
• How to select the "best" model?

## Parsimony

• All models are wrong
• The correct model can never be known with certainty
• Some models are better than others
• Simpler models explaining the data are better than complex models

## Center your variables (i.e. subtract the mean)

• Otherwise random slopes and intercepts may show a spurious correlation
• Also helps the interpretation of factorial predictors in model (marking differences at means of other variables, rather than at values equal to 0)

## Residuals: normally distributed and homoskedastic

• In linear mixed-effects regression, the errors should follow a normal distribution with mean zero and the same standard deviation for any cell in your design, and for any covariate
• If not then transforming the dependent variable often helps: log(Y), or -1000/Y
• In some cases it is better to separate the distribution from the relationship between the dependent and independent variables by using generalized linear mixed-effects regression (e.g., Gamma distribution with identity link) as the distribution

## Residuals: no trial-by-trial dependencies

• Residuals should be independent
• With trial-by-trial dependencies, this assumption is violated
• Possible remedies:
• Include trial as a predictor in your model
• Include the value of the dependent variable at the previous trial as a predictor
• (Note that GAMs have other facilities to correct for these dependencies)

## Trial-by-trial dependencies in a word naming task

• Word naming (reading aloud) of Dutch verbs
• Trial-by-trial dependencies in the residuals of the model
acf(resid(model), main=" ", ylab="autocorrelation function")

## Model criticism

• Check the distribution of residuals: if not normally distributed and/or hetereoscedastic residuals then use generalized linear mixed-effects regression modeling
• Check outlier characteristics and refit the model when large outliers are excluded to verify that your effects are not 'carried' by these outliers
• Important: no a priori exclusion of outliers without a clear reason
• A good reason is not that the value is over 2.5 SD above the mean
• A good reason (e.g.,) is that the response is faster than possible

## Model selection II

• My stepwise variable-selection procedure (for exploratory analysis):
• Include random intercepts
• Add other potential explanatory variables one-by-one
• Insignificant predictors are dropped
• Test predictors for inclusion which were excluded at an earlier stage
• Test possible interactions (don't make it too complex)
• Try to break the model by including significant predictors as random slopes
• Only choose a more complex model if it decreases AIC by at least 2
• Then the lowest AIC model is at least 2.7 ($e^{\frac{\Delta\textrm{AIC}}{2}}$) times more likely to represent the data

## Model selection III

• For a hypothesis-driven analysis, stepwise selection is problematic
• Solutions:
• Careful specification of potential a priori models lining up with the hypotheses (including random intercepts and slopes) and evaluating only these models (e.g., via AIC)
• This may be followed by an exploratory procedure
• Validating a stepwise procedure via cross validation (e.g., bootstrap analysis)

## Case study: long-distance priming

• De Vaan, Schreuder & Baayen (The Mental Lexicon, 2007)
• Design
• long-distance priming (39 intervening items)
• base condition (baseheid): base preceded neologism (fluffy - fluffiness)
• derived condition (heid): identity priming (fluffiness - fluffiness)
• only items judged to be real words are included
• Prediction
• Subjects in derived condition (heid) faster than those in base condition (baseheid)
• Dependent variable is log-transformed (but similar results using a GLMM)

## A first model: counterintuitive results!

#### (note $t > 2 \Rightarrow p < 0.05$, for $N \gg 100$; R version 3.3.2 (2016-10-31), lme4 version 1.1.12)

library(lme4)
dat.lmer1 <- lmer(RT.log ~ Condition + (1 | Word) + (1 | Subject), data = dat)
summary(dat.lmer1, cor = F)
# Linear mixed model fit by REML ['lmerMod']
# Formula: RT.log ~ Condition + (1 | Word) + (1 | Subject)
#    Data: dat
#
# REML criterion at convergence: -102
#
# Scaled residuals:
#    Min     1Q Median     3Q    Max
# -2.359 -0.691 -0.134  0.590  4.261
#
# Random effects:
#  Groups   Name        Variance Std.Dev.
#  Word     (Intercept) 0.00341  0.0584
#  Subject  (Intercept) 0.04084  0.2021
#  Residual             0.04408  0.2100
# Number of obs: 832, groups:  Word, 40; Subject, 26
#
# Fixed effects:
#               Estimate Std. Error t value
# (Intercept)     6.6030     0.0421   156.7
# Conditionheid   0.0313     0.0147     2.1

## Evaluation

• Counterintuitive inhibition
• But various potential factors are not accounted for in the model
• Longitudinal effects
• RT to prime as predictor
• Response to the prime (correct/incorrect): a yes response to a target associated with a previously rejected prime may take longer
• The presence of atypical outliers

## An effect of trial?

dat.lmer2 <- lmer(RT.log ~ Trial + Condition + (1 | Subject) + (1 | Word), data = dat)
summary(dat.lmer2)$coef # Estimate Std. Error t value # (Intercept) 6.633384 4.67e-02 142.15 # Trial -0.000146 9.62e-05 -1.52 # Conditionheid 0.030977 1.47e-02 2.11 ## An effect of previous trial RT? dat.lmer3 <- lmer(RT.log ~ PrevRT.log + Condition + (1 | Subject) + (1 | Word), data = dat) summary(dat.lmer3)$coef
#               Estimate Std. Error t value
# (Intercept)     5.8047     0.2230   26.03
# PrevRT.log      0.1212     0.0334    3.63
# Conditionheid   0.0279     0.0146    1.90

## An effect of RT to prime?

dat.lmer4 <- lmer(RT.log ~ RTtoPrime.log + PrevRT.log + Condition + (1 | Subject) + (1 | Word), data = dat)
summary(dat.lmer4)$coef # Estimate Std. Error t value # (Intercept) 4.74877 0.2953 16.080 # RTtoPrime.log 0.16379 0.0319 5.141 # PrevRT.log 0.11901 0.0330 3.605 # Conditionheid -0.00612 0.0160 -0.383 ## An effect of the decision for the prime? dat.lmer5 <- lmer(RT.log ~ RTtoPrime.log + ResponseToPrime + PrevRT.log + Condition + (1 | Subject) + (1 | Word), data = dat) summary(dat.lmer5)$coef
#                          Estimate Std. Error t value
# (Intercept)                4.7634     0.2923   16.30
# RTtoPrime.log              0.1650     0.0315    5.24
# ResponseToPrimeincorrect   0.1004     0.0226    4.45
# PrevRT.log                 0.1142     0.0327    3.49
# Conditionheid             -0.0178     0.0161   -1.11

## An effect of the decision for the prime (in interaction)?

dat.lmer6 <- lmer(RT.log ~ RTtoPrime.log * ResponseToPrime + PrevRT.log + Condition + (1 | Subject) +
(1 | Word), data = dat)
summary(dat.lmer6)$coef # Estimate Std. Error t value # (Intercept) 4.3244 0.3152 13.72 # RTtoPrime.log 0.2276 0.0359 6.33 # ResponseToPrimeincorrect 1.4548 0.4052 3.59 # PrevRT.log 0.1183 0.0325 3.64 # Conditionheid -0.0266 0.0162 -1.64 # RTtoPrime.log:ResponseToPrimeincorrect -0.2025 0.0606 -3.34 • Interpretation: the RT to the prime is only predictive for the RT of the target word when the prime was judged to be a correct word ## An effect of base frequency? #### (Note the lower standard deviation of the random intercept for word: initial value was 0.058) dat.lmer7 <- lmer(RT.log ~ RTtoPrime.log * ResponseToPrime + PrevRT.log + BaseFrequency + Condition + (1 | Subject) + (1 | Word), data = dat) summary(dat.lmer7)$varcor
#  Groups   Name        Std.Dev.
#  Word     (Intercept) 0.0339
#  Subject  (Intercept) 0.1549
#  Residual             0.2055
summary(dat.lmer7)$coef # Estimate Std. Error t value # (Intercept) 4.44098 0.31961 13.90 # RTtoPrime.log 0.21824 0.03615 6.04 # ResponseToPrimeincorrect 1.39705 0.40516 3.45 # PrevRT.log 0.11542 0.03246 3.56 # BaseFrequency -0.00924 0.00437 -2.11 # Conditionheid -0.02466 0.01618 -1.52 # RTtoPrime.log:ResponseToPrimeincorrect -0.19399 0.06055 -3.20 ## Testing random slopes: no main frequency effect! dat.lmer7a <- lmer(RT.log ~ RTtoPrime.log * ResponseToPrime + PrevRT.log + BaseFrequency + Condition + (1 | Subject) + (0 + BaseFrequency | Subject) + (1 | Word), data = dat) summary(dat.lmer7a)$coef
#                                        Estimate Std. Error t value
# (Intercept)                             4.48231    0.31736   14.12
# RTtoPrime.log                           0.21812    0.03595    6.07
# ResponseToPrimeincorrect                1.41675    0.40206    3.52
# PrevRT.log                              0.10849    0.03235    3.35
# BaseFrequency                          -0.00795    0.00535   -1.48
# Conditionheid                          -0.02453    0.01603   -1.53
# RTtoPrime.log:ResponseToPrimeincorrect -0.19667    0.06008   -3.27

## The model with random slope is better

AIC(dat.lmer7) - AIC(dat.lmer7a)  # compare AIC
# [1] 3.65
# Alternative to AIC comparison: Likelihood Ratio Test
anova(dat.lmer7, dat.lmer7a, refit = F)  # compares models
# Data: dat
# Models:
# dat.lmer7: RT.log ~ RTtoPrime.log * ResponseToPrime + PrevRT.log + BaseFrequency +
# dat.lmer7:     Condition + (1 | Subject) + (1 | Word)
# dat.lmer7a: RT.log ~ RTtoPrime.log * ResponseToPrime + PrevRT.log + BaseFrequency +
# dat.lmer7a:     Condition + (1 | Subject) + (0 + BaseFrequency | Subject) +
# dat.lmer7a:     (1 | Word)
#            Df  AIC   BIC logLik deviance Chisq Chi Df Pr(>Chisq)
# dat.lmer7  10 -126 -78.3   72.8     -146
# dat.lmer7a 11 -129 -77.2   75.6     -151  5.65      1      0.017 *
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Testing for correlation parameters in random effects

dat.lmer7b <- lmer(RT.log ~ RTtoPrime.log * ResponseToPrime + PrevRT.log + BaseFrequency + Condition +
(1 + BaseFrequency | Subject) + (1 | Word), data = dat)
summary(dat.lmer7b)$varcor # Groups Name Std.Dev. Corr # Word (Intercept) 0.0344 # Subject (Intercept) 0.1291 # BaseFrequency 0.0136 0.41 # Residual 0.2035 AIC(dat.lmer7a) - AIC(dat.lmer7b) # correlation parameter not necessary # [1] -1.14 anova(dat.lmer7a, dat.lmer7b, refit = F)$P[2]  # p-value from anova
# [1] 0.353

## Model criticism: normality

qqnorm(resid(dat.lmer7a))
qqline(resid(dat.lmer7a))

## Model criticism: heteroscedasticity

plot(fitted(dat.lmer7a), resid(dat.lmer7a))

## The trimmed model

dat2 <- dat[abs(scale(resid(dat.lmer7a))) < 2.5, ]
dat2.lmer7a <- lmer(RT.log ~ RTtoPrime.log * ResponseToPrime + PrevRT.log + BaseFrequency + Condition +
(1 | Subject) + (0 + BaseFrequency | Subject) + (1 | Word), data = dat2)
summary(dat2.lmer7a)$coef # Estimate Std. Error t value # (Intercept) 4.44735 0.28598 15.55 # RTtoPrime.log 0.23511 0.03200 7.35 # ResponseToPrimeincorrect 1.56001 0.35551 4.39 # PrevRT.log 0.09554 0.02926 3.27 # BaseFrequency -0.00815 0.00459 -1.78 # Conditionheid -0.03814 0.01435 -2.66 # RTtoPrime.log:ResponseToPrimeincorrect -0.21616 0.05316 -4.07 • items in the heid condition are responded faster to than those in the baseheid condition ## The trimmed model • Just 2% of the data removed and an improved fit (explained variance) (noutliers <- sum(abs(scale(resid(dat.lmer7a))) >= 2.5)) # [1] 17 noutliers/nrow(dat) # [1] 0.0204 cor(dat$RT, fitted(dat.lmer7a))^2
# [1] 0.521
cor(dat2\$RT, fitted(dat2.lmer7a))^2
# [1] 0.572

## Checking the residuals of the trimmed model

library(car)
par(mfrow = c(1, 2))
qqp(resid(dat2.lmer7a))
plot(fitted(dat2.lmer7a), resid(dat2.lmer7a))

## Bootstrap sampling to validate results

library(boot)
(bs.lmer7a <- confint(dat2.lmer7a, method = "boot", nsim = 1000, level = 0.95))
#                                            2.5 %    97.5 %
# .sig01                                  0.00e+00  0.040543
# .sig02                                  1.41e-05  0.021509
# .sig03                                  9.75e-02  0.190691
# .sigma                                  1.71e-01  0.189190
# (Intercept)                             3.83e+00  4.994796
# RTtoPrime.log                           1.74e-01  0.302521
# ResponseToPrimeincorrect                8.53e-01  2.255276
# PrevRT.log                              3.66e-02  0.157534
# BaseFrequency                          -1.76e-02  0.000579
# Conditionheid                          -6.73e-02 -0.010342
# RTtoPrime.log:ResponseToPrimeincorrect -3.19e-01 -0.112836

## Recap

• We have learned:
• how to interpret regression results
• how to use lmer to conduct mixed-effects regression
• how to include random intercepts and random slopes in lmer and why these are essential when you have multiple responses per subject or item
• how to conduct model criticism
• that ANOVA is less flexible than mixed-effecs regression
• After the break:
• Next lecture: more about mixed-effects regression using eye-tracking data