Generalized additive modeling

Martijn Wieling
University of Groningen

This lecture

  • Introduction
    • Generalized additive modeling
    • Articulography
    • Using articulography to study L2 pronunciation differences
  • Design
  • Methods: R code
  • Results
  • Discussion

Generalized additive modeling (1)

  • Generalized additive model (GAM): relaxing assumption of linear relation between dependent variable and predictor
  • Relationship between individual predictors and (possibly transformed) dependent variable is estimated by a non-linear smooth function: \(g(y) = s(x_1) +s(x_2,x_3) + \beta_4x_4 + ...\)
    • Multiple predictors can be combined in a (hyper)surface smooth (other lecture) plot of chunk unnamed-chunk-1

Question 1

Generalized additive modeling (2)

  • Advantage of GAM over manual specification of non-linearities: the optimal shape of the non-linearity is determined automatically
  • Appropriate degree of smoothness is automatically determined by minimizing combined error and "wigglyness" (no overfitting)
  • Maximum number of basis functions limits the maximum amount of non-linearity

First ten basis functions

plot of chunk unnamed-chunk-2

Generalized additive modeling (3)

  • Choosing a smoothing basis
    • Single predictor or isotropic predictors: thin plate regression spline (this lecture)
      • Efficient approximation of the optimal (thin plate) spline
    • Combining non-isotropic predictors: tensor product spline
  • Generalized Additive Mixed Modeling:
    • Random effects can be treated as smooths as well (Wood, 2008)
    • R: gam and bam (package mgcv)
  • For more (mathematical) details, see Wood (2006) and Wood (2017)

Articulography

Obtaining data

Recorded data

Present study: goal and setup

  • 19 native Dutch speakers from Groningen
  • 22 native Standard Southern British English speakers from London
  • Material: 10 minimal pairs [t]:[θ] repeated twice:
    • 'tent'-'tenth', 'fate'-'faith', 'forth'-'fort', 'kit'-'kith', 'mitt'-'myth'
    • 'tank'-'thank', 'team'-'theme', 'tick'-'thick', 'ties'-'thighs', 'tongs'-'thongs'
    • Note that the sound [θ] does not exist in the Dutch language
  • Goal: compare distinction between this sound contrast for both groups
  • Preprocessing:
    • Articulatory segmentation: gestural onset to offset (within /ə/ context)
    • Positions \(z\)-transformed per axis and time normalized (from 0 to 1) per speaker

Data: much individual variation and noisy data

plot of chunk unnamed-chunk-3

Data overview

load("full.rda")  # only includes anterior-posterior position of tongue tip sensor
head(full)
#   Participant Group Word Sound  Loc Trial   Time TTfront
# 1   VENI_EN_1    EN tick     T Init     1 0.0000  -0.392
# 2   VENI_EN_1    EN tick     T Init     1 0.0161  -0.440
# 3   VENI_EN_1    EN tick     T Init     1 0.0323  -0.440
# 4   VENI_EN_1    EN tick     T Init     1 0.0484  -0.503
# 5   VENI_EN_1    EN tick     T Init     1 0.0645  -0.513
# 6   VENI_EN_1    EN tick     T Init     1 0.0806  -0.677
dim(full)
# [1] 126177      8

First model: tongue frontness for "tenth" and "tent"

(R version 3.5.1 (2018-07-02), mgcv version 1.8.24, itsadug version 2.3)

library(mgcv)
library(itsadug)
tt <- full[full$Word %in% c("tenth", "tent"), ]

# sort data per individual trajectory (necessary to detect autocorrelation)
tt <- start_event(tt, event = c("Participant", "Trial"))

# fit first GAM
m0 <- bam(TTfront ~ s(Time), data = tt)

Model summary

summary(m0)
# 
# Family: gaussian 
# Link function: identity 
# 
# Formula:
# TTfront ~ s(Time)
# 
# Parametric coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  0.40281    0.00825    48.8   <2e-16 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Approximate significance of smooth terms:
#         edf Ref.df   F p-value    
# s(Time) 8.4    8.9 170  <2e-16 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# R-sq.(adj) =  0.106   Deviance explained = 10.6%
# fREML =  17375  Scale est. = 0.8739    n = 12839

Visualizing the non-linear time pattern of the model

(Interpreting GAM results always involves visualization)

plot(m0, rug = F, shade = T, main = "Partial effect", ylab = "TTfront", ylim = c(-1, 1))
plot_smooth(m0, view = "Time", rug = F, main = "Full effect", ylim = c(-1, 1))

plot of chunk unnamed-chunk-10

Check if number of basis functions is adequate

(if p-value is low and edf close to k')

gam.check(m0)
# 
# Method: fREML   Optimizer: perf newton
# full convergence after 9 iterations.
# Gradient range [-1.25e-07,1.02e-07]
# (score 17375 & scale 0.874).
# Hessian positive definite, eigenvalue range [3.46,6419].
# Model rank =  10 / 10 
# 
# Basis dimension (k) checking results. Low p-value (k-index<1) may
# indicate that k is too low, especially if edf is close to k'.
# 
#          k' edf k-index p-value
# s(Time) 9.0 8.4    1.02    0.84

Increasing the number of basis functions with \(k\)

(double \(k\) if higher \(k\) is needed, but not higher than number of data points)

m0b <- bam(TTfront ~ s(Time, k = 20), data = tt)
summary(m0b)$s.table  # extract smooths from summary: new edf
#          edf Ref.df   F   p-value
# s(Time) 11.2   13.5 112 1.91e-304
summary(m0)$s.table  # original edf
#         edf Ref.df   F   p-value
# s(Time) 8.4    8.9 170 1.24e-310

Effect of increasing \(k\)

plot_smooth(m0, view = "Time", rug = F, main = "m0 (k=10)")
plot_smooth(m0b, view = "Time", rug = F, main = "m0b (k=20)")

plot of chunk unnamed-chunk-14

Including individual variation: random intercepts

m1 <- bam(TTfront ~ s(Time) + s(Participant, bs = "re"), data = tt)
summary(m1)  # starting from here, slides only show the relevant part of the summary
# Parametric coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    0.432     0.0648    6.67 2.75e-11 ***
# 
# Approximate significance of smooth terms:
#                  edf Ref.df     F p-value    
# s(Time)         8.49   8.93 206.1  <2e-16 ***
# s(Participant) 40.41  41.00  65.6  <2e-16 ***
# 
# Deviance explained = 26.3%
  • Deviance explained of model without random intercept (\(r^2\)): 10.6%

Effect of including a random intercept

plot_smooth(m0, view = "Time", rug = F, main = "m0", ylim = c(-0.5, 1.5))
plot_smooth(m1, view = "Time", rug = F, main = "m1", ylim = c(-0.5, 1.5), rm.ranef = T)

plot of chunk unnamed-chunk-17

Including individual variation: random slopes

m2 <- bam(TTfront ~ s(Time) + s(Participant, bs = "re") + s(Participant, Time, bs = "re"), 
    data = tt)
summary(m2)
# Parametric coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    0.432      0.106    4.08  4.5e-05 ***
# 
# Approximate significance of smooth terms:
#                       edf Ref.df     F p-value    
# s(Time)              8.53   8.94   163  <2e-16 ***
# s(Participant)      39.91  41.00  9294  <2e-16 ***
# s(Participant,Time) 39.12  41.00 10292  <2e-16 ***
# 
# Deviance explained = 31.7%

Effect of including a random slope

plot_smooth(m1, view = "Time", rug = F, rm.ranef = T, main = "m1", ylim = c(-0.5, 1.5))
plot_smooth(m2, view = "Time", rug = F, rm.ranef = T, main = "m2", ylim = c(-0.5, 1.5))

plot of chunk unnamed-chunk-20

Random effects in lme4 vs. mgcv

  • (1|Subject): s(Subject, bs="re")
  • (0 + WF|Subject): s(Subject, WF, bs="re")
  • (1 + WF|Subject): not possible in mgcv

Including individual variation: factor smooths

(i.e. including a non-linear random effect instead of a random intercept and random slope)

m3 <- bam(TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1), data = tt)
summary(m3)
# Parametric coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    0.414     0.0802    5.16 2.54e-07 ***
# 
# Approximate significance of smooth terms:
#                        edf Ref.df  F p-value    
# s(Time)               8.15   8.52 24  <2e-16 ***
# s(Time,Participant) 287.25 377.00 16  <2e-16 ***
# 
# Deviance explained = 40.6%

Visualization of individual variation

plot(m3, select = 2)

plot of chunk unnamed-chunk-23

Effect of including a random non-linear effect

plot_smooth(m2, view = "Time", rug = F, main = "m2", ylim = c(-0.5, 1.2), rm.ranef = T)
plot_smooth(m3, view = "Time", rug = F, main = "m3", ylim = c(-0.5, 1.2), rm.ranef = T)

plot of chunk unnamed-chunk-24

Influence of random effects

Speeding up computation

(via discrete and nthreads, only for default method="fREML")

system.time(bam(TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1), data = tt))
#    user  system elapsed 
#   20.04   32.52    4.44
system.time(bam(TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1), data = tt, 
    discrete = T))
#    user  system elapsed 
#    7.06   12.21    2.18
system.time(bam(TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1), data = tt, 
    discrete = T, nthreads = 2))
#    user  system elapsed 
#    5.33    6.90    2.33
  • Speed improvement with nthreads is larger for more complex models

Comparing "tenth" and "tent" in one model

(smooths are centered, so the factorial predictor also needs to be included in the fixed effects)

m4 <- bam(TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, bs = "fs", m = 1), 
    data = tt, discrete = T, nthreads = 2)
summary(m4)
# Parametric coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    0.072     0.0833    0.86    0.388    
# Wordtenth      0.690     0.0118   58.66   <2e-16 ***
# 
# Approximate significance of smooth terms:
#                        edf Ref.df     F  p-value    
# s(Time):Wordtent      7.28   7.93  9.37 1.09e-12 ***
# s(Time):Wordtenth     8.51   8.79 43.37   <2e-16 ***
# s(Time,Participant) 306.18 377.00 22.54   <2e-16 ***
  • Does the word distinction improve the model?

Assessing model improvement

(comparing fixed effects, so method='ML' and no discrete=T)

m3.ml <- bam(TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1), data = tt, method = "ML")
m4.ml <- bam(TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, bs = "fs", m = 1), 
    data = tt, method = "ML")
compareML(m3.ml, m4.ml)  # model m4.ml is much better!
# m3.ml: TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1)
# 
# m4.ml: TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, bs = "fs", 
#     m = 1)
# 
# Chi-square test of ML scores
# -----
#   Model Score Edf Difference    Df  p.value Sig.
# 1 m3.ml 15292   5                               
# 2 m4.ml 13298   8   1993.465 3.000  < 2e-16  ***
# 
# AIC difference: 4105.39, model m4.ml has lower AIC.

Visualizing the two patterns

plot_smooth(m4, view = "Time", rug = F, plot_all = "Word", main = "m4", ylim = c(-0.5, 
    2), rm.ranef = T)

plot of chunk unnamed-chunk-30

Visualizing the difference

plot_diff(m4, view = "Time", comp = list(Word = c("tenth", "tent")), ylim = c(-0.5, 2), 
    rm.ranef = T)

plot of chunk unnamed-chunk-31

Question 2

Including individual variation for "tenth" vs. "tent"

m5 <- bam(TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, by = Word, bs = "fs", 
    m = 1), data = tt, discrete = T, nthreads = 2)
summary(m5)
# Parametric coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    0.091     0.0973    0.94    0.349    
# Wordtenth      0.598     0.1431    4.18 2.92e-05 ***
# 
# Approximate significance of smooth terms:
#                                  edf Ref.df     F  p-value    
# s(Time):Wordtent                7.56   7.97  9.62 4.01e-13 ***
# s(Time):Wordtenth               8.40   8.56 22.61   <2e-16 ***
# s(Time,Participant):Wordtent  315.77 377.00 38.05   <2e-16 ***
# s(Time,Participant):Wordtenth 327.33 368.00 43.14   <2e-16 ***

More uncertainty in the difference

plot_diff(m4, view="Time", comp=list(Word=c("tenth","tent")), ylim=c(-0.5,2), 
    main="m4: difference", rm.ranef=T) 
plot_diff(m5, view="Time", comp=list(Word=c("tenth","tent")), ylim=c(-0.5,2), 
    main="m5: difference", rm.ranef=T) 

plot of chunk unnamed-chunk-35

Autocorrelation in the data is a problem!

(residuals should be independent, otherwise the standard errors and p-values are wrong)

m5acf <- acf_resid(m5)  # show autocorr.

plot of chunk unnamed-chunk-36

plot of chunk unnamed-chunk-37

Correcting for autocorrelation

rhoval <- m5acf[2]  # correlation of residuals at time t with those at time t-1 (about 0.93)
m6 <- bam(TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, by = Word, bs = "fs", 
    m = 1), data = tt, rho = rhoval, AR.start = tt$start.event, discrete = T, nthreads = 2)
summary(m6)
# Parametric coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   0.0855     0.0878    0.97    0.330    
# Wordtenth     0.5905     0.1294    4.56 5.05e-06 ***
# 
# Approximate significance of smooth terms:
#                                  edf Ref.df     F  p-value    
# s(Time):Wordtent                7.41   8.08  8.89 2.15e-12 ***
# s(Time):Wordtenth               8.32   8.60 22.08   <2e-16 ***
# s(Time,Participant):Wordtent  228.84 377.00  2.87   <2e-16 ***
# s(Time,Participant):Wordtenth 267.20 368.00  3.82   <2e-16 ***

Autocorrelation has been removed

par(mfrow = c(1, 2))
acf_resid(m5, main = "ACF of m5")
acf_resid(m6, main = "ACF of m6")

plot of chunk unnamed-chunk-42

Clear model improvement

compareML(m5.ml, m6.ml)
# m5.ml: TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, by = Word, 
#     bs = "fs", m = 1)
# 
# m6.ml: TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, by = Word, 
#     bs = "fs", m = 1)
# 
# Model m6.ml preferred: lower ML score (12683.931), and equal df (0.000).
# -----
#   Model Score Edf Difference    Df
# 1 m5.ml  9393  10                 
# 2 m6.ml -3291  10 -12683.931 0.000
# 
# AIC difference: 24357.55, model m6.ml has lower AIC.

Question 3

Random effects and autocorrelation

Distinguishing the two speaker groups

tt$WordGroup <- interaction(tt$Word, tt$Group)
m7 <- bam(TTfront ~ s(Time, by = WordGroup) + WordGroup + s(Time, Participant, by = Word, bs = "fs", 
    m = 1), data = tt, rho = rhoval, AR.start = tt$start.event, discrete = T, nthreads = 2)
summary(m7)
# Parametric coefficients:
#                   Estimate Std. Error t value Pr(>|t|)    
# (Intercept)        -0.0955      0.119   -0.80    0.423    
# WordGrouptenth.EN   0.7452      0.174    4.27 1.94e-05 ***
# WordGrouptent.NL    0.3731      0.174    2.14    0.032 *  
# WordGrouptenth.NL   0.8047      0.182    4.42    1e-05 ***
# 
# Approximate significance of smooth terms:
#                                  edf Ref.df     F  p-value    
# s(Time):WordGrouptent.EN        3.86   4.56  2.30    0.043 *  
# s(Time):WordGrouptenth.EN       7.94   8.36 16.08   <2e-16 ***
# s(Time):WordGrouptent.NL        7.46   8.13 10.86 9.74e-16 ***
# s(Time):WordGrouptenth.NL       7.74   8.21 12.04   <2e-16 ***
# s(Time,Participant):Wordtent  217.47 376.00  2.66   <2e-16 ***
# s(Time,Participant):Wordtenth 254.62 367.00  3.46   <2e-16 ***

Visualizing the patterns

plot_smooth(m7, view = "Time", rug = F, plot_all = "WordGroup", main = "m7", rm.ranef = T, ylim = c(-1, 2))
plot_diff(m7, view = "Time", rm.ranef = T, comp = list(WordGroup = c("tenth.EN", "tent.EN")), main = "EN difference", 
    ylim = c(-1, 2))
plot_diff(m7, view = "Time", rm.ranef = T, comp = list(WordGroup = c("tenth.NL", "tent.NL")), main = "NL difference", 
    ylim = c(-1, 2))

plot of chunk unnamed-chunk-50

Including language is necessary

# m6.ml: TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, by = Word, 
#     bs = "fs", m = 1)
# 
# m7.ml: TTfront ~ s(Time, by = WordGroup) + WordGroup + s(Time, Participant, 
#     by = Word, bs = "fs", m = 1)
# 
# Chi-square test of ML scores
# -----
#   Model Score Edf Difference    Df   p.value Sig.
# 1 m6.ml -3291  10                                
# 2 m7.ml -3306  16     14.985 6.000 3.982e-05  ***
# 
# AIC difference: 7.61, model m7.ml has lower AIC.

The full tongue tip model: all words

(Word is now a random-effect factor, Sound distinguishes T from TH words)

full$SoundGroup <- interaction(full$Sound, full$Group)
full <- start_event(full, event = c("Participant", "Trial"))

# duration discrete=F: 105 sec., discrete=T: 54 s. (1 thr.) / 37 s. (2 thr.)
system.time(model <- bam(TTfront ~ s(Time, by = SoundGroup) + SoundGroup + s(Time, Participant, 
    by = Sound, bs = "fs", m = 1) + s(Time, Word, by = Group, bs = "fs", m = 1), data = full, 
    rho = rhoval, AR.start = full$start.event, discrete = T, nthreads = 4))
#    user  system elapsed 
#    74.8    17.1    21.6

The full tongue tip model: results

summary(model, re.test = FALSE)  # exclude random effects from summary (much faster)
# Parametric coefficients:
#                 Estimate Std. Error t value Pr(>|t|)    
# (Intercept)      -0.0602     0.0969   -0.62    0.535    
# SoundGroupTH.EN   0.6397     0.1581    4.05  5.2e-05 ***
# SoundGroupT.NL    0.1052     0.1461    0.72    0.472    
# SoundGroupTH.NL   0.4340     0.1642    2.64    0.008 ** 
# 
# Approximate significance of smooth terms:
#                          edf Ref.df    F  p-value    
# s(Time):SoundGroupT.EN  5.16   5.49 2.96    0.014 *  
# s(Time):SoundGroupTH.EN 6.61   6.86 6.54 1.39e-07 ***
# s(Time):SoundGroupT.NL  7.25   7.44 4.38 2.68e-05 ***
# s(Time):SoundGroupTH.NL 6.90   7.12 5.88  1.9e-06 ***
# 
# Deviance explained = 53.5%

L1-based differences

plot_diff(model, view = 'Time', rm.ranef = T, ylim=c(-0.5, 1.5),
    comp = list(SoundGroup = c('TH.EN', 'T.EN')))
plot_diff(model, view = 'Time', rm.ranef = T, ylim=c(-0.5, 1.5),
    comp = list(SoundGroup = c('TH.NL', 'T.NL')))

plot of chunk unnamed-chunk-55

Discussion

  • Native English speakers appear to make the /t/-/θ/ distinction, whereas Dutch L2 speakers of English generally do not
    • But this does not mean the difference between these distinctions is significant (see this tutorial)

Question 4

How to report?

Recap

  • We have applied GAMs to articulography data and learned how to:
    • use s(Time) to model a non-linearity over time
    • use the k parameter to control the number of basis functions
    • use the plotting functions plot_smooth and plot_diff (rm.ranef=T!)
    • use the parameter setting bs="re" to add random intercepts and slopes
    • add non-linear random effects using s(Time,...,bs="fs",m=1)
    • use the by-parameter to obtain separate non-linearities
      • Note that this factorial predictor also needs to be included in fixed effects!
    • use compareML to compare models (comparing fixed effects: method='ML')
  • Associated lab session:

Evaluation

Questions?

Thank you for your attention!

http://www.martijnwieling.nl
wieling@gmail.com