Martijn Wieling
University of Groningen

## This lecture

• Introduction
• Articulography
• Using articulography to study L2 pronunciation differences
• Design
• Methods: R code
• Results
• Discussion

• Generalized additive model (GAM): relaxing assumption of linear relation between dependent variable and predictor
• Relationship between individual predictors and (possibly transformed) dependent variable is estimated by a non-linear smooth function: $g(y) = s(x_1) +s(x_2,x_3) + \beta_4x_4 + ...$
• Multiple predictors can be combined in a (hyper)surface smooth (next lecture)

## Question 1

• Advantage of GAM over manual specification of non-linearities: the optimal shape of the non-linearity is determined automatically
• Appropriate degree of smoothness is automatically determined by minimizing combined error and "wigglyness" (no overfitting)
• Maximum number of basis functions limits the maximum amount of non-linearity

## First ten basis functions

• Choosing a smoothing basis
• Single predictor or isotropic predictors: thin plate regression spline (this lecture)
• Efficient approximation of the optimal (thin plate) spline
• Combining non-isotropic predictors: tensor product spline
• Random effects can be treated as smooths as well (Wood, 2008)
• R: gam and bam (package mgcv)
• For more (mathematical) details, see Wood (2006) and Wood (2017)

## Differences between native and non-native English

• 19 native Dutch speakers from Groningen
• 22 native Southern Standard British English speakers from London
• Material: 10 minimal pairs [t]:[θ] repeated twice:
• 'tent'-'tenth', 'fate'-'faith', 'forth'-'fort', 'kit'-'kith', 'mitt'-'myth'
• 'tank'-'thank', 'team'-'theme', 'tick'-'thick', 'ties'-'thighs', 'tongs'-'thongs'
• Note that the sound [θ] does not exist in the Dutch language
• Goal: compare distinction between this sound contrast for both groups
• Preprocessing:
• Articulatory segmentation: gestural onset to offset (within /ə/ context)
• Positions $z$-transformed per axis and time normalized per speaker between 0 and 1

## Data overview

load("art.rda")

#   Participant Group Word Sound  Loc Trial Sensor Axis   Time Position
# 1   VENI_EN_1    EN tick     T Init     1     TB    X 0.0000   -0.567
# 2   VENI_EN_1    EN tick     T Init     1     TB    X 0.0161   -0.533
# 3   VENI_EN_1    EN tick     T Init     1     TB    X 0.0323   -0.503
# 4   VENI_EN_1    EN tick     T Init     1     TB    X 0.0484   -0.424
# 5   VENI_EN_1    EN tick     T Init     1     TB    X 0.0645   -0.403
# 6   VENI_EN_1    EN tick     T Init     1     TB    X 0.0806   -0.164

dim(art)

# [1] 798942     10


## First model: tongue frontness for "tenth" and "tent"

#### (R version 3.4.3 (2017-11-30), mgcv version 1.8.23, itsadug version 2.3)

library(mgcv)
tt <- art[art$Word %in% c("tenth", "tent") & art$Sensor == "TT" & art$Axis == "X", ] tt$TTfront <- 0 - tt$Position # higher values more front (instead of lower values) # sort data per individual trajectory tt <- start_event(tt, event = c("Participant", "Trial")) # fit first GAM m0 <- bam(TTfront ~ s(Time), data = tt)  ## Model summary summary(m0)  # # Family: gaussian # Link function: identity # # Formula: # TTfront ~ s(Time) # # Parametric coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 0.44155 0.00735 60.1 <2e-16 *** # --- # Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # # Approximate significance of smooth terms: # edf Ref.df F p-value # s(Time) 8.45 8.91 187 <2e-16 *** # --- # Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # # R-sq.(adj) = 0.0968 Deviance explained = 9.73% # fREML = 20687 Scale est. = 0.8381 n = 15530  ## Visualizing the non-linear time pattern of the model #### (Interpreting GAM results always involves visualization) plot(m0, rug = F, shade = T, main = "Partial effect", ylab = "TTfront", ylim = c(-1, 1)) plot_smooth(m0, view = "Time", rug = F, main = "Full effect", ylim = c(-1, 1))  ## Check if number of basis functions is adequate #### (if p-value is low and edf close to k') gam.check(m0)  # # Method: fREML Optimizer: perf newton # full convergence after 9 iterations. # Gradient range [-5.89e-07,5.05e-07] # (score 20687 & scale 0.838). # Hessian positive definite, eigenvalue range [3.55,7764]. # Model rank = 10 / 10 # # Basis dimension (k) checking results. Low p-value (k-index<1) may # indicate that k is too low, especially if edf is close to k'. # # k' edf k-index p-value # s(Time) 9.00 8.45 0.98 0.08 . # --- # Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  ## Increasing the number of basis functions with $k$ #### (double $k$ if higher $k$ is needed, but not higher than number of data points) m0b <- bam(TTfront ~ s(Time, k = 20), data = tt) summary(m0b)$s.table  # extract smooths from summary: new edf

#          edf Ref.df   F p-value
# s(Time) 11.5   13.9 120       0

summary(m0)$s.table # original edf  # edf Ref.df F p-value # s(Time) 8.45 8.91 187 0  ## Effect of increasing $k$ plot_smooth(m0, view = "Time", rug = F, main = "m0 (k=10)") plot_smooth(m0b, view = "Time", rug = F, main = "m0b (k=20)")  ## Including individual variation: random intercepts m1 <- bam(TTfront ~ s(Time) + s(Participant, bs = "re"), data = tt)  summary(m1)  # Parametric coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 0.444 0.0558 7.94 2.09e-15 # # Approximate significance of smooth terms: # edf Ref.df F p-value # s(Time) 8.52 8.93 215.6 <2e-16 # s(Participant) 41.32 42.00 56.9 <2e-16  • Explained variance of model without random intercept ($r^2$): 0.1 • Explained variance of model with random intercept ($r^2$): 0.22 ## Effect of including a random intercept plot_smooth(m0, view = "Time", rug = F, main = "m0", ylim = c(-0.5, 1.5)) plot_smooth(m1, view = "Time", rug = F, main = "m1", ylim = c(-0.5, 1.5), rm.ranef = T)  ## Including individual variation: random slopes m2 <- bam(TTfront ~ s(Time) + s(Participant, bs = "re") + s(Participant, Time, bs = "re"), data = tt)  summary(m2)  # Parametric coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 0.444 0.103 4.29 1.81e-05 # # Approximate significance of smooth terms: # edf Ref.df F p-value # s(Time) 8.54 8.94 177 <2e-16 # s(Participant) 41.06 42.00 9914 <2e-16 # s(Participant,Time) 40.08 42.00 10890 <2e-16  ## Effect of including a random slope plot_smooth(m1, view = "Time", rug = F, rm.ranef = T, main = "m1", ylim = c(-0.5, 1.5)) plot_smooth(m2, view = "Time", rug = F, rm.ranef = T, main = "m2", ylim = c(-0.5, 1.5))  ## Including individual variation: factor smooths #### (i.e. including a non-linear random effect instead of a random intercept and random slope) m3 <- bam(TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1), data = tt)  summary(m3)  # Parametric coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 0.425 0.0646 6.57 5.23e-11 # # Approximate significance of smooth terms: # edf Ref.df F p-value # s(Time) 8.18 8.53 26.5 <2e-16 # s(Time,Participant) 292.81 386.00 14.3 <2e-16  ## Visualization of individual variation plot(m3, select = 2)  ## Effect of including a random non-linear effect plot_smooth(m2, view = "Time", rug = F, main = "m2", ylim = c(-0.5, 1.5), rm.ranef = T) plot_smooth(m3, view = "Time", rug = F, main = "m3", ylim = c(-0.5, 1.5), rm.ranef = T)  ## Explore random effects yourself! ## Speeding up computation #### (via discrete and nthreads, only for default method="fREML") system.time(bam(TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1), data = tt))  # user system elapsed # 23.63 37.74 5.24  system.time(bam(TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1), data = tt, discrete = T))  # user system elapsed # 7.69 13.99 2.52  system.time(bam(TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1), data = tt, discrete = T, nthreads = 2))  # user system elapsed # 5.66 7.67 2.43  • Speed improvement with nthreads is larger for more complex models ## Comparing "tenth" and "tent" in one model #### (smooths are centered, so the factorial predictor also needs to be included in the fixed effects) m4 <- bam(TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, bs = "fs", m = 1), data = tt, discrete = T, nthreads = 2)  summary(m4)  # Parametric coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 0.186 0.0672 2.77 0.006 # Wordtenth 0.580 0.0119 48.72 <2e-16 # # Approximate significance of smooth terms: # edf Ref.df F p-value # s(Time):Wordtent 7.46 8.07 12.3 <2e-16 # s(Time):Wordtenth 8.51 8.81 49.0 <2e-16 # s(Time,Participant) 306.10 386.00 18.5 <2e-16  • Does the word distinction improve the model? ## Assessing model improvement #### (comparing fixed effects, so method='ML' and no discrete=T) m3.ml <- bam(TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1), data = tt, method = "ML") m4.ml <- bam(TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, bs = "fs", m = 1), data = tt, method = "ML") compareML(m3.ml, m4.ml) # model m4.ml is much better (AIC decrease > 2)!  # m3.ml: TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1) # # m4.ml: TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, bs = "fs", # m = 1) # # Chi-square test of ML scores # ----- # Model Score Edf Difference Df p.value Sig. # 1 m3.ml 18724 5 # 2 m4.ml 17156 8 1567.911 3.000 < 2e-16 *** # # AIC difference: 3222.06, model m4.ml has lower AIC.  ## Visualizing the two patterns plot_smooth(m4, view = "Time", rug = F, plot_all = "Word", main = "m4", ylim = c(-0.5, 2), rm.ranef = T)  ## Visualizing the difference plot_diff(m4, view = "Time", comp = list(Word = c("tenth", "tent")), ylim = c(-0.5, 2), rm.ranef = T)  ## Question 2 ## Including individual variation for "tenth" vs. "tent" m5 <- bam(TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, by = Word, bs = "fs", m = 1), data = tt, discrete = T, nthreads = 2)  summary(m5)  # Parametric coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 0.171 0.0803 2.12 0.034 # Wordtenth 0.529 0.1314 4.02 5.77e-05 # # Approximate significance of smooth terms: # edf Ref.df F p-value # s(Time):Wordtent 7.64 8.05 11.4 2.15e-16 # s(Time):Wordtenth 8.35 8.55 22.6 <2e-16 # s(Time,Participant):Wordtent 316.56 386.00 31.4 <2e-16 # s(Time,Participant):Wordtenth 316.59 368.00 31.4 <2e-16  ## More uncertainty in the difference plot_diff(m4, view="Time", comp=list(Word=c("tenth","tent")), ylim=c(-0.5,2), main="m4: difference", rm.ranef=T) plot_diff(m5, view="Time", comp=list(Word=c("tenth","tent")), ylim=c(-0.5,2), main="m5: difference", rm.ranef=T)  ## Autocorrelation in the data is a problem! #### (residuals should be independent, otherwise the standard errors and p-values are wrong) m5acf <- acf_resid(m5) # show autocorrelation  ## Correcting for autocorrelation rhoval <- m5acf[2] # correlation of residuals at time t with those at time t-1 (about 0.93) m6 <- bam(TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, by = Word, bs = "fs", m = 1), data = tt, rho = rhoval, AR.start = tt$start.event, discrete = T, nthreads = 2)

summary(m6)

# Parametric coefficients:
#             Estimate Std. Error t value Pr(>|t|)
# (Intercept)    0.168     0.0729    2.31    0.021
# Wordtenth      0.512     0.1181    4.33 1.48e-05
#
# Approximate significance of smooth terms:
#                                  edf Ref.df     F p-value
# s(Time):Wordtent                7.80   8.32 11.86  <2e-16
# s(Time):Wordtenth               8.31   8.61 21.87  <2e-16
# s(Time,Participant):Wordtent  237.49 387.00  2.29  <2e-16
# s(Time,Participant):Wordtenth 252.39 369.00  2.73  <2e-16


## Autocorrelation has been removed

par(mfrow = c(1, 2))
acf_resid(m5, main = "ACF of m5")
acf_resid(m6, main = "ACF of m6")


## Clear model improvement

compareML(m5.ml, m6.ml)

# m5.ml: TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, by = Word,
#     bs = "fs", m = 1)
#
# m6.ml: TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, by = Word,
#     bs = "fs", m = 1)
#
# Model m6.ml preferred: lower ML score (17362.348), and equal df (0.000).
# -----
#   Model Score Edf Difference    Df
# 1 m5.ml 13459  10
# 2 m6.ml -3903  10 -17362.348 0.000
#
# AIC difference: 33779.22, model m6.ml has lower AIC.


## Distinguishing the two speaker groups

tt$WordGroup <- interaction(tt$Word, tt$Group) m7 <- bam(TTfront ~ s(Time, by = WordGroup) + WordGroup + s(Time, Participant, by = Word, bs = "fs", m = 1), data = tt, rho = rhoval, AR.start = tt$start.event, discrete = T, nthreads = 2)
summary(m7)

# Parametric coefficients:
#                   Estimate Std. Error t value Pr(>|t|)
# (Intercept)        -0.0915     0.0983   -0.93    0.352
# WordGrouptenth.EN   0.7445     0.1584    4.70  2.6e-06
# WordGrouptent.NL    0.4958     0.1384    3.58 0.000341
# WordGrouptenth.NL   0.8053     0.1669    4.83 1.41e-06
#
# Approximate significance of smooth terms:
#                                  edf Ref.df     F p-value
# s(Time):WordGrouptent.EN        3.88   4.66  2.29   0.044
# s(Time):WordGrouptenth.EN       7.94   8.38 15.81  <2e-16
# s(Time):WordGrouptent.NL        7.87   8.33 14.45  <2e-16
# s(Time):WordGrouptenth.NL       7.72   8.22 11.93  <2e-16
# s(Time,Participant):Wordtent  223.12 386.00  1.93  <2e-16
# s(Time,Participant):Wordtenth 239.21 367.00  2.46  <2e-16


## Visualizing the patterns

plot_smooth(m7, view = "Time", rug = F, plot_all = "WordGroup", main = "m7", rm.ranef = T, ylim = c(-1,
2))
plot_diff(m7, view = "Time", rm.ranef = T, comp = list(WordGroup = c("tenth.EN", "tent.EN")), main = "EN difference",
ylim = c(-1, 2))
plot_diff(m7, view = "Time", rm.ranef = T, comp = list(WordGroup = c("tenth.NL", "tent.NL")), main = "NL difference",
ylim = c(-1, 2))


## Including language is necessary

compareML(m6.ml, m7.ml)

# m6.ml: TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, by = Word,
#     bs = "fs", m = 1)
#
# m7.ml: TTfront ~ s(Time, by = WordGroup) + WordGroup + s(Time, Participant,
#     by = Word, bs = "fs", m = 1)
#
# Chi-square test of ML scores
# -----
#   Model Score Edf Difference    Df   p.value Sig.
# 1 m6.ml -3903  10
# 2 m7.ml -3919  16     15.420 6.000 2.720e-05  ***
#
# AIC difference: 0.21, model m7.ml has lower AIC.

• But this does not mean the difference smooths are significant (see this tutorial)

## The full tongue tip model: all words and both axes

#### (Word is now a random-effect factor, Sound distinguishes T from TH words)

t1 <- art[art$Sensor == "TT", ] t1$GroupSoundAxis <- interaction(t1$Group, t1$Sound, t1$Axis) t1$GroupAxis <- interaction(t1$Group, t1$Axis)
t1$SoundAxis <- interaction(t1$Sound, t1$Axis) t1 <- start_event(t1, event = c("Participant", "Sensor", "Axis", "Trial")) t1$TTfront <- 0 - t1$Position # duration discrete=F: 1500 sec., discrete=T: 450 s. (1 thr.) / 260 s. (2 thr.) system.time(model <- bam(TTfront ~ s(Time, by = GroupSoundAxis) + GroupSoundAxis + s(Time, Participant, by = SoundAxis, bs = "fs", m = 1) + s(Time, Word, by = GroupAxis, bs = "fs", m = 1), data = t1, rho = rhoval, AR.start = t1$start.event, discrete = T, nthreads = 4))

#    user  system elapsed
#  440.02    1.59  123.38


## The full tongue tip model: all words and both axes (1)

smry <- summary(model, re.test = FALSE)  # exclude random effects from summary (much faster)
smry$p.table # show fixed effects of summary  # Estimate Std. Error t value Pr(>|t|) # (Intercept) -0.0615 0.0867 -0.710 4.78e-01 # GroupSoundAxisNL.T.X 0.1322 0.1287 1.027 3.04e-01 # GroupSoundAxisEN.TH.X 0.6376 0.1515 4.208 2.58e-05 # GroupSoundAxisNL.TH.X 0.4321 0.1565 2.761 5.77e-03 # GroupSoundAxisEN.T.Z 0.0567 0.1436 0.395 6.93e-01 # GroupSoundAxisNL.T.Z 0.0842 0.1462 0.576 5.65e-01 # GroupSoundAxisEN.TH.Z 0.4012 0.1353 2.966 3.02e-03 # GroupSoundAxisNL.TH.Z 0.2911 0.1400 2.079 3.76e-02  ## The full tongue tip model: all words and both axes (2) smry$s.table  # show smooths of summary (excluding random effects)

#                                edf Ref.df     F  p-value
# s(Time):GroupSoundAxisEN.T.X  5.16   5.49  2.85 1.82e-02
# s(Time):GroupSoundAxisNL.T.X  7.27   7.44  4.55 1.56e-05
# s(Time):GroupSoundAxisEN.TH.X 6.62   6.86  6.22 3.64e-07
# s(Time):GroupSoundAxisNL.TH.X 7.08   7.29  7.92 5.39e-09
# s(Time):GroupSoundAxisEN.T.Z  8.24   8.32 34.60 1.00e-56
# s(Time):GroupSoundAxisNL.T.Z  8.21   8.29 21.41 7.58e-34
# s(Time):GroupSoundAxisEN.TH.Z 7.01   7.20  8.55 4.35e-08
# s(Time):GroupSoundAxisNL.TH.Z 7.37   7.54 11.33 1.43e-12


## L1-based differences

plot_diff(model, view = 'Time', rm.ranef = T, ylim=c(-0.5, 1.5),
comp = list(GroupSoundAxis = c('EN.TH.X', 'EN.T.X')))
plot_diff(model, view = 'Time', rm.ranef = T, ylim=c(-0.5, 1.5),
comp = list(GroupSoundAxis = c('NL.TH.X', 'NL.T.X')))


## Discussion

• Native English speakers appear to make the /t/-/θ/ distinction, whereas Dutch L2 speakers of English generally do not
• But this does not mean the difference between the two languages is significant (see this tutorial)
• Note that the model could be improved by separating minimal pairs into two categories

## Recap

• We have applied GAMs to articulography data and learned how to:
• use s(Time) to model a non-linearity over time
• use the k parameter to control the number of basis functions
• use the plotting functions plot_smooth and plot_diff (rm.ranef=T!)
• use the parameter setting bs="re" to add random intercepts and slopes
• add non-linear random effects using s(Time,...,bs="fs",m=1)
• use the by-parameter to obtain separate non-linearities
• Note that this factorial predictor needs to be included as fixed-effect factor as well!
• use compareML to compare models (method='ML' when comparing fixed effects)
• After the break: