Generalized additive modeling

Martijn Wieling
University of Groningen

This lecture

  • Introduction
    • Generalized additive modeling
    • Articulography
    • Using articulography to study L2 pronunciation differences
  • Design
  • Methods: R code
  • Results
  • Discussion

Generalized additive modeling (1)

  • Generalized additive model (GAM): relaxing assumption of linear relation between dependent variable and predictor
  • Relationship between individual predictors and (possibly transformed) dependent variable is estimated by a non-linear smooth function: \(g(y) = s(x_1) +s(x_2,x_3) + \beta_4x_4 + ...\)
    • Multiple predictors can be combined in a (hyper)surface smooth (next lecture) plot of chunk unnamed-chunk-1

Question 1

Generalized additive modeling (2)

  • Advantage of GAM over manual specification of non-linearities: the optimal shape of the non-linearity is determined automatically
  • Appropriate degree of smoothness is automatically determined by minimizing combined error and "wigglyness" (no overfitting)
  • Maximum number of basis functions limits the maximum amount of non-linearity

First ten basis functions

plot of chunk unnamed-chunk-2

Generalized additive modeling (3)

  • Choosing a smoothing basis
    • Single predictor or isotropic predictors: thin plate regression spline (this lecture)
      • Efficient approximation of the optimal (thin plate) spline
    • Combining non-isotropic predictors: tensor product spline
  • Generalized Additive Mixed Modeling:
    • Random effects can be treated as smooths as well (Wood, 2008)
    • R: gam and bam (package mgcv)
  • For more (mathematical) details, see Wood (2006) and Wood (2017)

Articulography

Obtaining data

Recorded data

Differences between native and non-native English

  • 19 native Dutch speakers from Groningen
  • 22 native Southern Standard British English speakers from London
  • Material: 10 minimal pairs [t]:[θ] repeated twice:
    • 'tent'-'tenth', 'fate'-'faith', 'forth'-'fort', 'kit'-'kith', 'mitt'-'myth'
    • 'tank'-'thank', 'team'-'theme', 'tick'-'thick', 'ties'-'thighs', 'tongs'-'thongs'
    • Note that the sound [θ] does not exist in the Dutch language
  • Goal: compare distinction between this sound contrast for both groups
  • Preprocessing:
    • Articulatory segmentation: gestural onset to offset (within /ə/ context)
    • Positions \(z\)-transformed per axis and time normalized per speaker between 0 and 1

Data overview

load("art.rda")
head(art)
#   Participant Group Word Sound  Loc Trial Sensor Axis   Time Position
# 1   VENI_EN_1    EN tick     T Init     1     TB    X 0.0000   -0.567
# 2   VENI_EN_1    EN tick     T Init     1     TB    X 0.0161   -0.533
# 3   VENI_EN_1    EN tick     T Init     1     TB    X 0.0323   -0.503
# 4   VENI_EN_1    EN tick     T Init     1     TB    X 0.0484   -0.424
# 5   VENI_EN_1    EN tick     T Init     1     TB    X 0.0645   -0.403
# 6   VENI_EN_1    EN tick     T Init     1     TB    X 0.0806   -0.164
dim(art)
# [1] 798942     10

Much individual variation and noisy data

plot of chunk unnamed-chunk-4

First model: tongue frontness for "tenth" and "tent"

(R version 3.4.3 (2017-11-30), mgcv version 1.8.23, itsadug version 2.3)

library(mgcv)
library(itsadug)
tt <- art[art$Word %in% c("tenth", "tent") & art$Sensor == "TT" & art$Axis == "X", ]
tt$TTfront <- 0 - tt$Position  # higher values more front (instead of lower values)

# sort data per individual trajectory
tt <- start_event(tt, event = c("Participant", "Trial"))

# fit first GAM
m0 <- bam(TTfront ~ s(Time), data = tt)

Model summary

summary(m0)
# 
# Family: gaussian 
# Link function: identity 
# 
# Formula:
# TTfront ~ s(Time)
# 
# Parametric coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  0.44155    0.00735    60.1   <2e-16 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Approximate significance of smooth terms:
#          edf Ref.df   F p-value    
# s(Time) 8.45   8.91 187  <2e-16 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# R-sq.(adj) =  0.0968   Deviance explained = 9.73%
# fREML =  20687  Scale est. = 0.8381    n = 15530

Visualizing the non-linear time pattern of the model

(Interpreting GAM results always involves visualization)

plot(m0, rug = F, shade = T, main = "Partial effect", ylab = "TTfront", ylim = c(-1, 1))
plot_smooth(m0, view = "Time", rug = F, main = "Full effect", ylim = c(-1, 1))

plot of chunk unnamed-chunk-10

Check if number of basis functions is adequate

(if p-value is low and edf close to k')

gam.check(m0)
# 
# Method: fREML   Optimizer: perf newton
# full convergence after 9 iterations.
# Gradient range [-5.89e-07,5.05e-07]
# (score 20687 & scale 0.838).
# Hessian positive definite, eigenvalue range [3.55,7764].
# Model rank =  10 / 10 
# 
# Basis dimension (k) checking results. Low p-value (k-index<1) may
# indicate that k is too low, especially if edf is close to k'.
# 
#           k'  edf k-index p-value  
# s(Time) 9.00 8.45    0.98    0.08 .
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Increasing the number of basis functions with \(k\)

(double \(k\) if higher \(k\) is needed, but not higher than number of data points)

m0b <- bam(TTfront ~ s(Time, k = 20), data = tt)
summary(m0b)$s.table  # extract smooths from summary: new edf
#          edf Ref.df   F p-value
# s(Time) 11.5   13.9 120       0
summary(m0)$s.table  # original edf
#          edf Ref.df   F p-value
# s(Time) 8.45   8.91 187       0

Effect of increasing \(k\)

plot_smooth(m0, view = "Time", rug = F, main = "m0 (k=10)")
plot_smooth(m0b, view = "Time", rug = F, main = "m0b (k=20)")

plot of chunk unnamed-chunk-14

Including individual variation: random intercepts

m1 <- bam(TTfront ~ s(Time) + s(Participant, bs = "re"), data = tt)
summary(m1)
# Parametric coefficients:
#             Estimate Std. Error t value Pr(>|t|)
# (Intercept)    0.444     0.0558    7.94 2.09e-15
# 
# Approximate significance of smooth terms:
#                  edf Ref.df     F p-value
# s(Time)         8.52   8.93 215.6  <2e-16
# s(Participant) 41.32  42.00  56.9  <2e-16
  • Explained variance of model without random intercept (\(r^2\)): 0.1
  • Explained variance of model with random intercept (\(r^2\)): 0.22

Effect of including a random intercept

plot_smooth(m0, view = "Time", rug = F, main = "m0", ylim = c(-0.5, 1.5))
plot_smooth(m1, view = "Time", rug = F, main = "m1", ylim = c(-0.5, 1.5), rm.ranef = T)

plot of chunk unnamed-chunk-18

Including individual variation: random slopes

m2 <- bam(TTfront ~ s(Time) + s(Participant, bs = "re") + s(Participant, Time, bs = "re"), 
    data = tt)
summary(m2)
# Parametric coefficients:
#             Estimate Std. Error t value Pr(>|t|)
# (Intercept)    0.444      0.103    4.29 1.81e-05
# 
# Approximate significance of smooth terms:
#                       edf Ref.df     F p-value
# s(Time)              8.54   8.94   177  <2e-16
# s(Participant)      41.06  42.00  9914  <2e-16
# s(Participant,Time) 40.08  42.00 10890  <2e-16

Effect of including a random slope

plot_smooth(m1, view = "Time", rug = F, rm.ranef = T, main = "m1", ylim = c(-0.5, 1.5))
plot_smooth(m2, view = "Time", rug = F, rm.ranef = T, main = "m2", ylim = c(-0.5, 1.5))

plot of chunk unnamed-chunk-22

Including individual variation: factor smooths

(i.e. including a non-linear random effect instead of a random intercept and random slope)

m3 <- bam(TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1), data = tt)
summary(m3)
# Parametric coefficients:
#             Estimate Std. Error t value Pr(>|t|)
# (Intercept)    0.425     0.0646    6.57 5.23e-11
# 
# Approximate significance of smooth terms:
#                        edf Ref.df    F p-value
# s(Time)               8.18   8.53 26.5  <2e-16
# s(Time,Participant) 292.81 386.00 14.3  <2e-16

Visualization of individual variation

plot(m3, select = 2)

plot of chunk unnamed-chunk-26

Effect of including a random non-linear effect

plot_smooth(m2, view = "Time", rug = F, main = "m2", ylim = c(-0.5, 1.5), rm.ranef = T)
plot_smooth(m3, view = "Time", rug = F, main = "m3", ylim = c(-0.5, 1.5), rm.ranef = T)

plot of chunk unnamed-chunk-27

Explore random effects yourself!

Speeding up computation

(via discrete and nthreads, only for default method="fREML")

system.time(bam(TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1), data = tt))
#    user  system elapsed 
#   23.63   37.74    5.24
system.time(bam(TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1), data = tt, 
    discrete = T))
#    user  system elapsed 
#    7.69   13.99    2.52
system.time(bam(TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1), data = tt, 
    discrete = T, nthreads = 2))
#    user  system elapsed 
#    5.66    7.67    2.43
  • Speed improvement with nthreads is larger for more complex models

Comparing "tenth" and "tent" in one model

(smooths are centered, so the factorial predictor also needs to be included in the fixed effects)

m4 <- bam(TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, bs = "fs", m = 1), 
    data = tt, discrete = T, nthreads = 2)
summary(m4)
# Parametric coefficients:
#             Estimate Std. Error t value Pr(>|t|)
# (Intercept)    0.186     0.0672    2.77    0.006
# Wordtenth      0.580     0.0119   48.72   <2e-16
# 
# Approximate significance of smooth terms:
#                        edf Ref.df    F p-value
# s(Time):Wordtent      7.46   8.07 12.3  <2e-16
# s(Time):Wordtenth     8.51   8.81 49.0  <2e-16
# s(Time,Participant) 306.10 386.00 18.5  <2e-16
  • Does the word distinction improve the model?

Assessing model improvement

(comparing fixed effects, so method='ML' and no discrete=T)

m3.ml <- bam(TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1), data = tt, method = "ML")
m4.ml <- bam(TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, bs = "fs", m = 1), 
    data = tt, method = "ML")
compareML(m3.ml, m4.ml)  # model m4.ml is much better (AIC decrease > 2)!
# m3.ml: TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1)
# 
# m4.ml: TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, bs = "fs", 
#     m = 1)
# 
# Chi-square test of ML scores
# -----
#   Model Score Edf Difference    Df  p.value Sig.
# 1 m3.ml 18724   5                               
# 2 m4.ml 17156   8   1567.911 3.000  < 2e-16  ***
# 
# AIC difference: 3222.06, model m4.ml has lower AIC.

Visualizing the two patterns

plot_smooth(m4, view = "Time", rug = F, plot_all = "Word", main = "m4", ylim = c(-0.5, 
    2), rm.ranef = T)

plot of chunk unnamed-chunk-33

Visualizing the difference

plot_diff(m4, view = "Time", comp = list(Word = c("tenth", "tent")), ylim = c(-0.5, 2), 
    rm.ranef = T)

plot of chunk unnamed-chunk-34

Question 2

Including individual variation for "tenth" vs. "tent"

m5 <- bam(TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, by = Word, bs = "fs", 
    m = 1), data = tt, discrete = T, nthreads = 2)
summary(m5)
# Parametric coefficients:
#             Estimate Std. Error t value Pr(>|t|)
# (Intercept)    0.171     0.0803    2.12    0.034
# Wordtenth      0.529     0.1314    4.02 5.77e-05
# 
# Approximate significance of smooth terms:
#                                  edf Ref.df    F  p-value
# s(Time):Wordtent                7.64   8.05 11.4 2.15e-16
# s(Time):Wordtenth               8.35   8.55 22.6   <2e-16
# s(Time,Participant):Wordtent  316.56 386.00 31.4   <2e-16
# s(Time,Participant):Wordtenth 316.59 368.00 31.4   <2e-16

More uncertainty in the difference

plot_diff(m4, view="Time", comp=list(Word=c("tenth","tent")), ylim=c(-0.5,2), 
    main="m4: difference", rm.ranef=T) 
plot_diff(m5, view="Time", comp=list(Word=c("tenth","tent")), ylim=c(-0.5,2), 
    main="m5: difference", rm.ranef=T) 

plot of chunk unnamed-chunk-38

Autocorrelation in the data is a problem!

Correcting for autocorrelation

rhoval <- m5acf[2]  # correlation of residuals at time t with those at time t-1 (about 0.93)
m6 <- bam(TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, by = Word, bs = "fs", 
    m = 1), data = tt, rho = rhoval, AR.start = tt$start.event, discrete = T, nthreads = 2)
summary(m6)
# Parametric coefficients:
#             Estimate Std. Error t value Pr(>|t|)
# (Intercept)    0.168     0.0729    2.31    0.021
# Wordtenth      0.512     0.1181    4.33 1.48e-05
# 
# Approximate significance of smooth terms:
#                                  edf Ref.df     F p-value
# s(Time):Wordtent                7.80   8.32 11.86  <2e-16
# s(Time):Wordtenth               8.31   8.61 21.87  <2e-16
# s(Time,Participant):Wordtent  237.49 387.00  2.29  <2e-16
# s(Time,Participant):Wordtenth 252.39 369.00  2.73  <2e-16

Autocorrelation has been removed

par(mfrow = c(1, 2))
acf_resid(m5, main = "ACF of m5")
acf_resid(m6, main = "ACF of m6")

plot of chunk unnamed-chunk-45

Clear model improvement

compareML(m5.ml, m6.ml)
# m5.ml: TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, by = Word, 
#     bs = "fs", m = 1)
# 
# m6.ml: TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, by = Word, 
#     bs = "fs", m = 1)
# 
# Model m6.ml preferred: lower ML score (17362.348), and equal df (0.000).
# -----
#   Model Score Edf Difference    Df
# 1 m5.ml 13459  10                 
# 2 m6.ml -3903  10 -17362.348 0.000
# 
# AIC difference: 33779.22, model m6.ml has lower AIC.

Explore autocorrelation yourself!

Question 3

Distinguishing the two speaker groups

tt$WordGroup <- interaction(tt$Word, tt$Group)
m7 <- bam(TTfront ~ s(Time, by = WordGroup) + WordGroup + s(Time, Participant, by = Word, bs = "fs", 
    m = 1), data = tt, rho = rhoval, AR.start = tt$start.event, discrete = T, nthreads = 2)
summary(m7)
# Parametric coefficients:
#                   Estimate Std. Error t value Pr(>|t|)
# (Intercept)        -0.0915     0.0983   -0.93    0.352
# WordGrouptenth.EN   0.7445     0.1584    4.70  2.6e-06
# WordGrouptent.NL    0.4958     0.1384    3.58 0.000341
# WordGrouptenth.NL   0.8053     0.1669    4.83 1.41e-06
# 
# Approximate significance of smooth terms:
#                                  edf Ref.df     F p-value
# s(Time):WordGrouptent.EN        3.88   4.66  2.29   0.044
# s(Time):WordGrouptenth.EN       7.94   8.38 15.81  <2e-16
# s(Time):WordGrouptent.NL        7.87   8.33 14.45  <2e-16
# s(Time):WordGrouptenth.NL       7.72   8.22 11.93  <2e-16
# s(Time,Participant):Wordtent  223.12 386.00  1.93  <2e-16
# s(Time,Participant):Wordtenth 239.21 367.00  2.46  <2e-16

Visualizing the patterns

plot_smooth(m7, view = "Time", rug = F, plot_all = "WordGroup", main = "m7", rm.ranef = T, ylim = c(-1, 
    2))
plot_diff(m7, view = "Time", rm.ranef = T, comp = list(WordGroup = c("tenth.EN", "tent.EN")), main = "EN difference", 
    ylim = c(-1, 2))
plot_diff(m7, view = "Time", rm.ranef = T, comp = list(WordGroup = c("tenth.NL", "tent.NL")), main = "NL difference", 
    ylim = c(-1, 2))

plot of chunk unnamed-chunk-53

Including language is necessary

compareML(m6.ml, m7.ml)
# m6.ml: TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, by = Word, 
#     bs = "fs", m = 1)
# 
# m7.ml: TTfront ~ s(Time, by = WordGroup) + WordGroup + s(Time, Participant, 
#     by = Word, bs = "fs", m = 1)
# 
# Chi-square test of ML scores
# -----
#   Model Score Edf Difference    Df   p.value Sig.
# 1 m6.ml -3903  10                                
# 2 m7.ml -3919  16     15.420 6.000 2.720e-05  ***
# 
# AIC difference: 0.21, model m7.ml has lower AIC.
  • But this does not mean the difference smooths are significant (see this tutorial)

The full tongue tip model: all words and both axes

(Word is now a random-effect factor, Sound distinguishes T from TH words)

t1 <- art[art$Sensor == "TT", ]
t1$GroupSoundAxis <- interaction(t1$Group, t1$Sound, t1$Axis)
t1$GroupAxis <- interaction(t1$Group, t1$Axis)
t1$SoundAxis <- interaction(t1$Sound, t1$Axis)
t1 <- start_event(t1, event = c("Participant", "Sensor", "Axis", "Trial"))
t1$TTfront <- 0 - t1$Position

# duration discrete=F: 1500 sec., discrete=T: 450 s. (1 thr.) / 260 s. (2 thr.)
system.time(model <- bam(TTfront ~ s(Time, by = GroupSoundAxis) + GroupSoundAxis + s(Time, 
    Participant, by = SoundAxis, bs = "fs", m = 1) + s(Time, Word, by = GroupAxis, bs = "fs", 
    m = 1), data = t1, rho = rhoval, AR.start = t1$start.event, discrete = T, nthreads = 4))
#    user  system elapsed 
#  440.02    1.59  123.38

The full tongue tip model: all words and both axes (1)

smry <- summary(model, re.test = FALSE)  # exclude random effects from summary (much faster)
smry$p.table  # show fixed effects of summary
#                       Estimate Std. Error t value Pr(>|t|)
# (Intercept)            -0.0615     0.0867  -0.710 4.78e-01
# GroupSoundAxisNL.T.X    0.1322     0.1287   1.027 3.04e-01
# GroupSoundAxisEN.TH.X   0.6376     0.1515   4.208 2.58e-05
# GroupSoundAxisNL.TH.X   0.4321     0.1565   2.761 5.77e-03
# GroupSoundAxisEN.T.Z    0.0567     0.1436   0.395 6.93e-01
# GroupSoundAxisNL.T.Z    0.0842     0.1462   0.576 5.65e-01
# GroupSoundAxisEN.TH.Z   0.4012     0.1353   2.966 3.02e-03
# GroupSoundAxisNL.TH.Z   0.2911     0.1400   2.079 3.76e-02

The full tongue tip model: all words and both axes (2)

smry$s.table  # show smooths of summary (excluding random effects)
#                                edf Ref.df     F  p-value
# s(Time):GroupSoundAxisEN.T.X  5.16   5.49  2.85 1.82e-02
# s(Time):GroupSoundAxisNL.T.X  7.27   7.44  4.55 1.56e-05
# s(Time):GroupSoundAxisEN.TH.X 6.62   6.86  6.22 3.64e-07
# s(Time):GroupSoundAxisNL.TH.X 7.08   7.29  7.92 5.39e-09
# s(Time):GroupSoundAxisEN.T.Z  8.24   8.32 34.60 1.00e-56
# s(Time):GroupSoundAxisNL.T.Z  8.21   8.29 21.41 7.58e-34
# s(Time):GroupSoundAxisEN.TH.Z 7.01   7.20  8.55 4.35e-08
# s(Time):GroupSoundAxisNL.TH.Z 7.37   7.54 11.33 1.43e-12

L1-based differences

plot_diff(model, view = 'Time', rm.ranef = T, ylim=c(-0.5, 1.5),
    comp = list(GroupSoundAxis = c('EN.TH.X', 'EN.T.X')))
plot_diff(model, view = 'Time', rm.ranef = T, ylim=c(-0.5, 1.5),
    comp = list(GroupSoundAxis = c('NL.TH.X', 'NL.T.X')))

plot of chunk unnamed-chunk-59

Discussion

  • Native English speakers appear to make the /t/-/θ/ distinction, whereas Dutch L2 speakers of English generally do not
    • But this does not mean the difference between the two languages is significant (see this tutorial)
  • Note that the model could be improved by separating minimal pairs into two categories

Question 4

How to report?

Recap

  • We have applied GAMs to articulography data and learned how to:
    • use s(Time) to model a non-linearity over time
    • use the k parameter to control the number of basis functions
    • use the plotting functions plot_smooth and plot_diff (rm.ranef=T!)
    • use the parameter setting bs="re" to add random intercepts and slopes
    • add non-linear random effects using s(Time,...,bs="fs",m=1)
    • use the by-parameter to obtain separate non-linearities
      • Note that this factorial predictor needs to be included as fixed-effect factor as well!
    • use compareML to compare models (method='ML' when comparing fixed effects)
  • After the break:

Evaluation

Questions?

Thank you for your attention!

http://www.martijnwieling.nl
wieling@gmail.com