# Mixed-effects regression

Martijn Wieling
University of Groningen

## Introduction

• Consider the following situation (taken from Clark, 1973):
• Mr. A and Mrs. B study reading latencies of verbs and nouns
• Each randomly selects 20 words and tests 50 subjects
• Mr. A finds (using a sign test) verbs to have faster responses
• Mrs. B finds nouns to have faster responses
• How is this possible?

## The language-as-fixed-effect fallacy

• The problem is that Mr. A and Mrs. B disregard the (huge) variability in the words
• Mr. A included a difficult noun, but Mrs. B included a difficult verb
• Their set of words does not constitute the complete population of nouns and verbs, therefore their results are limited to their words
• This is known as the language-as-fixed-effect fallacy (LAFEF)
• Fixed-effect factors have repeatable and a small number of levels
• Word is a random-effect factor (a non-repeatable random sample from a larger population)

## Why linguists are not always good statisticians

• LAFEF occurs frequently in linguistic research until the 1970's
• Many reported significant results are wrong (the method is anti-conservative)!
• Clark (1973) combined a by-subject ($$F_1$$) analysis and by-item ($$F_2$$) analysis in measure min F'
• Results are significant and generalizable across subjects and items when min F' is significant
• Unfortunately many researchers (>50%!) incorrectly interpreted this study and may report wrong results (Raaijmakers et al., 1999)
• E.g., they only use $$F_1$$ and $$F_2$$ and not min F' or they use $$F_2$$ while unneccesary (e.g., counterbalanced design)

## Our problems solved...

• Apparently, analyzing this type of data is difficult...
• Fortunately, using mixed-effects regression models solves these problems!
• The method is easier than using the approach of Clark (1973)
• Results can be generalized across subjects and items
• Mixed-effects models are robust to missing data (Baayen, 2008, p. 266)
• We can easily test if it is necessary to treat item as a random effect
• No balanced design necessary (as in repeated-measures ANOVA)!
• But first some words about regression...

## Recap: multiple regression (1)

• Multiple regression: predict one numerical variable on the basis of other independent variables (numerical or categorical)
• We can write a regression formula as $$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \epsilon$$
• $$y$$: (value of the) dependent variable
• $$x_i$$: (value of the) predictor
• $$\beta_0$$: intercept, value of $$y$$ when all $$x_i = 0$$
• $$\beta_i$$: slope, change in $$y$$ when the value of $$x_i$$ increases with 1
• $$\epsilon$$: residuals, difference between observed values and predicted (fitted) values

## Recap: multiple regression (2)

• Factorial predictors are (automatically) represented by binary-valued predictors: $$x_i = 0$$ (reference level) or $$x_i = 1$$ (alternative level)
• Factor with $$n$$ levels: $$n-1$$ binary predictors
• Interpretation of factorial $$\beta_i$$: change from reference to alternative level
• Example of regression formula:
• Predict the reaction time of a subject on the basis of word frequency, word length, and subject age: RT = 200 - 5WF + 3WL + 10SA

## Mixed-effects regression modeling: introduction

• Mixed-effects regression distinguishes fixed effects and random-effect factors
• Fixed effects:
• All numerical predictors
• Factorial predictors with a repeatable and small number of levels (e.g., Gender)
• Random-effect factors:
• Only factorial predictors!
• Levels are a non-repeatable random sample from a larger population
• Generally a large number of levels (e.g., Subject, Item)

## What are random-effects factors?

• Random-effect factors are factors which are likely to introduce systematic variation (here: subject and item)
• Some subjects have a slow response (RT), while others are fast
= Random Intercept for Subject (i.e. $$\beta_0$$ varies per subject)
• Some items are easy to recognize, others hard
= Random Intercept for Item (i.e. $$\beta_0$$ varies per item)
• The effect of item frequency on RT might be higher for one subject than another: e.g., non-native participants might benefit more from frequent words than native participants
= Random Slope for Item Frequency per Subject (i.e. $$\beta_{\textrm{WF}}$$ varies per subject)
• The effect of subject age on RT might be different for one item than another: e.g., modern words might be recognized faster by younger participants
= Random Slope for Subject Age per Item (i.e. $$\beta_{\textrm{SA}}$$ varies per item)
• Note that it is essential to test for random slopes!

## Random slopes are necessary!

                                    Estimate  Std. Error  t value  Pr(>|t|)
Linear regression     DistOrigin  -6.418e-05   1.808e-06   -35.49    <2e-16
+ Random intercepts   DistOrigin  -2.224e-05   6.863e-06   -3.240    <0.001
+ Random slopes       DistOrigin  -1.478e-05   1.519e-05   -0.973    n.s.


(This example is explained at the HLP/Jaeger lab blog)

## Modeling the variance structure

• Mixed-effects regression allow us to use random intercepts and slopes (i.e. adjustments to the population intercept and slopes) to include the variance structure in our data
• Parsimony: a single parameter (standard deviation) models this variation for every random slope or intercept (a normal distribution with mean 0 is assumed)
• The slope and intercept adjustments are Best Linear Unbiased Predictors
• Model comparison determines the inclusion of random intercepts and slopes
• Mixed-effects regression is only required when each level of the random-effect factor has multiple observations (e.g., participants respond to multiple items)

## Specific models for every observation

• RT = 200 - 5WF + 3WL + 10SA (general model)
• The intercepts and slopes may vary (according to the estimated variation for each parameter) and this influences the word- and subject-specific values
• RT = 400 - 5WF + 3WL - 2SA (word: scythe)
• RT = 300 - 5WF + 3WL + 15SA (word: twitter)
• RT = 300 - 7WF + 3WL + 10SA (subject: non-native)
• RT = 150 - 5WF + 3WL + 10SA (subject: fast)
• And it is not hard to use!
• lmer( RT ~ WF + WL + SA + (1+SA|Wrd) + (1+WF|Subj) )
• (lmer automatically discovers random-effects structure: nested/crossed)

## Random slopes and intercepts may be (cor)related

• For example:
Subject Intercept WF slope
S1 525 -2
S2 400 -1
S3 500 -2
S4 550 -3
S5 450 -2
S6 600 -4
S7 300 0

## BLUPs of lmer do not suffer from shrinkage

• The BLUPS (i.e. adjustment to the model estimates per item/participant) are close to the real adjustments, as lmer takes into account regression towards the mean (fast subjects will be slower next time, and slow subjects will be faster) thereby avoiding overfitting and improving prediction - see Efron & Morris (1977)

## Center your variables (i.e. subtract the mean)!

• Otherwise random slopes and intercepts may show a spurious correlation
• Also helps the interpretation of interactions (see this lecture)

## Mixed-effects regression assumptions

• Independent observations within each level of the random-effect factor
• Relation between dependent and independent variables linear
• No strong multicollinearity
• Residuals are not autocorrelated
• Homoscedasticity of variance in residuals
• Residuals are normally distributed
• (Similar assumptions as for regression)

## Model criticism

• Check the distribution of residuals: if not normally distributed and/or heteroscedastic residuals then transform dependent variable or use generalized linear mixed-effects regression modeling
• Check outlier characteristics and refit the model when large outliers are excluded to verify that your effects are not 'carried' by these outliers

## Model selection II

• My stepwise variable-selection procedure (for exploratory analysis):
• Include random intercepts
• Add other potential explanatory variables one-by-one
• Insignificant predictors are dropped
• Test predictors for inclusion which were excluded at an earlier stage
• Test possible interactions (don't make it too complex)
• Try to break the model by including significant predictors as random slopes
• Only choose a more complex model if supported by model comparison

## Model selection III

• For a hypothesis-driven analysis, stepwise selection is problematic
• Solutions:
• Careful specification of potential a priori models lining up with the hypotheses (including optimal random-effects structure) and evaluating only these models
• This may be followed by an exploratory procedure
• Validating a stepwise procedure via cross validation (e.g., bootstrap analysis)

## Case study: influence of alcohol on L1 and L2

• Reported in Wieling et al. (2019) and Offrede et al. (2020)
• Assess influence of alcohol (BAC) on L1 (clarity) vs. L2 (nativelikeness) ratings
• Prediction: higher BAC has negative effect on L1, but positive effect on L2 nativelikeness (based on Renner et al., 2018)
• ~80 recordings from native Dutch speakers included (all BACs < 0.8, no drugs)
• Dutch ratings were given by >100 native (sober) Dutch speakers (at Lowlands)
• English ratings were given by >100 native American English speakers (online)
• Dependent variable is $$z$$-transformed rating (5-point Likert scale)
• Numerical variables are centered (not $$z$$-transformed)

## Dataset

load("lls.rda")

#           SID   BAC Gender BirthYear L2cnt SelfEN LivedEN L2anxiety   Edu      LID Lang Rating
# 1 S0045188-17 0.392      F     -5.66 -1.04  0.463       Y     0.357 0.817 L0637009   NL  0.946
# 2 S0045188-17 0.392      F     -5.66 -1.04  0.463       Y     0.357 0.817     L196   EN  0.330
# 3 S0045188-17 0.392      F     -5.66 -1.04  0.463       Y     0.357 0.817      L86   EN -0.298
# 4 S0045188-17 0.392      F     -5.66 -1.04  0.463       Y     0.357 0.817 L0614758   NL  0.325
# 5 S0045188-17 0.392      F     -5.66 -1.04  0.463       Y     0.357 0.817     L220   EN -0.435
# 6 S0045188-17 0.392      F     -5.66 -1.04  0.463       Y     0.357 0.817     L225   EN -0.239


## Fitting our first model

#### (fitted in R version 4.0.5 (2021-03-31), lme4 version 1.1.27)

library(lme4)
m <- lmer(Rating ~ BAC + (1 | SID) + (1 | LID), data = lls)  # does not converge

# boundary (singular) fit: see ?isSingular

summary(m)$coef # show fixed effects  # Estimate Std. Error t value # (Intercept) 0.144 0.0638 2.252 # BAC -0.221 0.3262 -0.677  summary(m)$varcor  # show random-effects part only: no variability per listener (due to z-scaling)

#  Groups   Name        Std.Dev.
#  LID      (Intercept) 0.000
#  SID      (Intercept) 0.560
#  Residual             0.738


## Evaluating our hypothesis (1)

#### (note: random intercept for LID excluded)

m1 <- lmer(Rating ~ Lang * BAC + (1 | SID), data = lls)

• This model represents our hypothesis test (but likely without the correct random-effects structure)

## Results (likely overestimating significance)

summary(m1, cor = F)  # suppress expected correlation of regression coefficients; Note: |t| > 2 => p < .05 (N > 100)

# Linear mixed model fit by REML ['lmerMod']
# Formula: Rating ~ Lang * BAC + (1 | SID)
#    Data: lls
#
# REML criterion at convergence: 5853
#
# Scaled residuals:
#    Min     1Q Median     3Q    Max
# -3.668 -0.647  0.032  0.707  3.187
#
# Random effects:
#  Groups   Name        Variance Std.Dev.
#  SID      (Intercept) 0.305    0.552
#  Residual             0.533    0.730
# Number of obs: 2541, groups:  SID, 82
#
# Fixed effects:
#             Estimate Std. Error t value
# (Intercept)   0.0883     0.0635    1.39
# LangNL        0.2430     0.0358    6.78
# BAC          -0.0592     0.3255   -0.18
# LangNL:BAC   -0.7286     0.1938   -3.76


## Visualization helps interpretation

library(visreg)
visreg(m1, "BAC", by = "Lang", overlay = T, xlab = "BAC (centered)", ylab = "Rating")