Martijn Wieling
University of Groningen

## This lecture

• Introduction
• Articulography
• Using articulography to study L2 pronunciation differences
• Design
• Methods: R code
• Results
• Discussion

• Generalized additive model (GAM): relaxing assumption of linear relation between dependent variable and predictor
• Relationship between individual predictors and (possibly transformed) dependent variable is estimated by a non-linear smooth function: $g(y) = s(x_1) +s(x_2,x_3) + \beta_4x_4 + ...$
• Multiple predictors can be combined in a (hyper)surface smooth (other lecture)

## Question 1

• Advantage of GAM over manual specification of non-linearities: the optimal shape of the non-linearity is determined automatically
• Appropriate degree of smoothness is automatically determined by minimizing combined error and "wigglyness" (no overfitting)
• Maximum number of basis functions limits the maximum amount of non-linearity

## First ten basis functions

• Choosing a smoothing basis
• Single predictor or isotropic predictors: thin plate regression spline (this lecture)
• Efficient approximation of the optimal (thin plate) spline
• Combining non-isotropic predictors: tensor product spline
• Random effects can be treated as smooths as well (Wood, 2008)
• R: gam and bam (package mgcv)
• For more (mathematical) details, see Wood (2006) and Wood (2017)

## Present study: goal and setup

• 19 native Dutch speakers from Groningen
• 22 native Standard Southern British English speakers from London
• Material: 10 minimal pairs [t]:[θ] repeated twice:
• 'tent'-'tenth', 'fate'-'faith', 'forth'-'fort', 'kit'-'kith', 'mitt'-'myth'
• 'tank'-'thank', 'team'-'theme', 'tick'-'thick', 'ties'-'thighs', 'tongs'-'thongs'
• Note that the sound [θ] does not exist in the Dutch language
• Goal: compare distinction between this sound contrast for both groups
• Preprocessing:
• Articulatory segmentation: gestural onset to offset (within /ə/ context)
• Positions $z$-transformed per axis and time normalized (from 0 to 1) per speaker

## Data overview

load("full.rda")  # only includes anterior-posterior position of tongue tip sensor

#   Participant Group Word Sound  Loc Trial   Time TTfront
# 1   VENI_EN_1    EN tick     T Init     1 0.0000  -0.392
# 2   VENI_EN_1    EN tick     T Init     1 0.0161  -0.440
# 3   VENI_EN_1    EN tick     T Init     1 0.0323  -0.440
# 4   VENI_EN_1    EN tick     T Init     1 0.0484  -0.503
# 5   VENI_EN_1    EN tick     T Init     1 0.0645  -0.513
# 6   VENI_EN_1    EN tick     T Init     1 0.0806  -0.677

dim(full)

# [1] 126177      8


## First model: tongue frontness for "tenth" and "tent"

#### (R version 3.5.2 (2018-12-20), mgcv version 1.8.27, itsadug version 2.3)

library(mgcv)
tt <- full[full$Word %in% c("tenth", "tent"), ] # sort data per individual trajectory (necessary to detect autocorrelation) tt <- start_event(tt, event = c("Participant", "Trial")) # fit first GAM m0 <- bam(TTfront ~ s(Time), data = tt)  ## Model summary summary(m0)  # # Family: gaussian # Link function: identity # # Formula: # TTfront ~ s(Time) # # Parametric coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 0.40281 0.00825 48.8 <2e-16 *** # --- # Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # # Approximate significance of smooth terms: # edf Ref.df F p-value # s(Time) 8.4 8.9 170 <2e-16 *** # --- # Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # # R-sq.(adj) = 0.106 Deviance explained = 10.6% # fREML = 17375 Scale est. = 0.8739 n = 12839  ## Visualizing the non-linear time pattern of the model #### (Interpreting GAM results always involves visualization) plot(m0, rug = F, shade = T, main = "Partial effect", ylab = "TTfront", ylim = c(-1, 1)) plot_smooth(m0, view = "Time", rug = F, main = "Full effect", ylim = c(-1, 1))  ## Check if number of basis functions is adequate #### (if p-value is low and edf close to k') gam.check(m0)  # # Method: fREML Optimizer: perf newton # full convergence after 9 iterations. # Gradient range [-1.25e-07,1.02e-07] # (score 17375 & scale 0.874). # Hessian positive definite, eigenvalue range [3.46,6419]. # Model rank = 10 / 10 # # Basis dimension (k) checking results. Low p-value (k-index<1) may # indicate that k is too low, especially if edf is close to k'. # # k' edf k-index p-value # s(Time) 9.0 8.4 1.02 0.93  ## Increasing the number of basis functions with $k$ #### (double $k$ if higher $k$ is needed, but not higher than number of data points) m0b <- bam(TTfront ~ s(Time, k = 20), data = tt) summary(m0b)$s.table  # extract smooths from summary: new edf

#          edf Ref.df   F   p-value
# s(Time) 11.2   13.5 112 1.91e-304

summary(m0)$s.table # original edf  # edf Ref.df F p-value # s(Time) 8.4 8.9 170 1.24e-310  ## Effect of increasing $k$ plot_smooth(m0, view = "Time", rug = F, main = "m0 (k=10)") plot_smooth(m0b, view = "Time", rug = F, main = "m0b (k=20)")  ## Individual varation: random effects in mgcv vs. lme4 • (1|Participant): s(Participant, bs="re") • (0 + WF|Participant): s(Participant, WF, bs="re") • (1 + WF|Participant): not possible in mgcv ## Including individual variation: random intercepts m1 <- bam(TTfront ~ s(Time) + s(Participant, bs = "re"), data = tt) summary(m1) # starting from here, slides only show the relevant part of the summary  # Parametric coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 0.432 0.0648 6.67 2.75e-11 *** # # Approximate significance of smooth terms: # edf Ref.df F p-value # s(Time) 8.49 8.93 206.1 <2e-16 *** # s(Participant) 40.41 41.00 65.6 <2e-16 *** # # Deviance explained = 26.3%  • Deviance explained of model without random intercept ($r^2$): 10.6% ## Effect of including a random intercept plot_smooth(m0, view = "Time", rug = F, main = "m0", ylim = c(-0.5, 1.5)) plot_smooth(m1, view = "Time", rug = F, main = "m1", ylim = c(-0.5, 1.5), rm.ranef = T)  ## Including individual variation: random slopes m2 <- bam(TTfront ~ s(Time) + s(Participant, bs = "re") + s(Participant, Time, bs = "re"), data = tt) summary(m2)  # Parametric coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 0.432 0.106 4.08 4.5e-05 *** # # Approximate significance of smooth terms: # edf Ref.df F p-value # s(Time) 8.53 8.94 163 <2e-16 *** # s(Participant) 39.91 41.00 9294 <2e-16 *** # s(Participant,Time) 39.12 41.00 10292 <2e-16 *** # # Deviance explained = 31.7%  ## Effect of including a random slope plot_smooth(m1, view = "Time", rug = F, rm.ranef = T, main = "m1", ylim = c(-0.5, 1.5)) plot_smooth(m2, view = "Time", rug = F, rm.ranef = T, main = "m2", ylim = c(-0.5, 1.5))  ## Including individual variation: factor smooths #### (i.e. including a non-linear random effect instead of a random intercept and random slope) m3 <- bam(TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1), data = tt) summary(m3)  # Parametric coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 0.414 0.0802 5.16 2.54e-07 *** # # Approximate significance of smooth terms: # edf Ref.df F p-value # s(Time) 8.15 8.52 24 <2e-16 *** # s(Time,Participant) 287.25 377.00 16 <2e-16 *** # # Deviance explained = 40.6%  ## Visualization of individual variation plot(m3, select = 2)  ## Effect of including a random non-linear effect plot_smooth(m2, view = "Time", rug = F, main = "m2", ylim = c(-0.5, 1.2), rm.ranef = T) plot_smooth(m3, view = "Time", rug = F, main = "m3", ylim = c(-0.5, 1.2), rm.ranef = T)  ## Influence of random effects #### http://eolomea.let.rug.nl/GAM/RandomEffects (login: f112300 and ShinyDem0) ## Speeding up computation #### (via discrete and nthreads, only for default method="fREML") system.time(bam(TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1), data = tt))  # user system elapsed # 20.08 29.22 4.15  system.time(bam(TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1), data = tt, discrete = T))  # user system elapsed # 6.16 9.79 1.84  system.time(bam(TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1), data = tt, discrete = T, nthreads = 2))  # user system elapsed # 14.53 31.66 2.65  • Speed improvement with nthreads is larger for more complex models ## Comparing "tenth" and "tent" in one model #### (smooths are centered, so the factorial predictor also needs to be included in the fixed effects) m4 <- bam(TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, bs = "fs", m = 1), data = tt, discrete = T, nthreads = 2) summary(m4)  # Parametric coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 0.072 0.0833 0.86 0.388 # Wordtenth 0.690 0.0118 58.66 <2e-16 *** # # Approximate significance of smooth terms: # edf Ref.df F p-value # s(Time):Wordtent 7.28 7.93 9.37 1.09e-12 *** # s(Time):Wordtenth 8.51 8.79 43.37 <2e-16 *** # s(Time,Participant) 306.18 377.00 22.54 <2e-16 ***  • Does the word distinction improve the model? ## Assessing model improvement #### (comparing fixed effects, so method='ML' and no discrete=T) m3.ml <- bam(TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1), data = tt, method = "ML") m4.ml <- bam(TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, bs = "fs", m = 1), data = tt, method = "ML") compareML(m3.ml, m4.ml) # model m4.ml is much better!  # m3.ml: TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1) # # m4.ml: TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, bs = "fs", # m = 1) # # Chi-square test of ML scores # ----- # Model Score Edf Difference Df p.value Sig. # 1 m3.ml 15292 5 # 2 m4.ml 13298 8 1993.465 3.000 < 2e-16 *** # # AIC difference: 4105.39, model m4.ml has lower AIC.  ## Visualizing the two patterns plot_smooth(m4, view = "Time", rug = F, plot_all = "Word", main = "m4", ylim = c(-0.5, 2), rm.ranef = T)  ## Visualizing the difference plot_diff(m4, view = "Time", comp = list(Word = c("tenth", "tent")), ylim = c(-0.5, 2), rm.ranef = T)  ## Question 2 ## Including individual variation for "tenth" vs. "tent" m5 <- bam(TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, by = Word, bs = "fs", m = 1), data = tt, discrete = T, nthreads = 2) summary(m5)  # Parametric coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 0.091 0.0973 0.94 0.349 # Wordtenth 0.598 0.1431 4.18 2.92e-05 *** # # Approximate significance of smooth terms: # edf Ref.df F p-value # s(Time):Wordtent 7.56 7.97 9.62 4.01e-13 *** # s(Time):Wordtenth 8.40 8.56 22.61 <2e-16 *** # s(Time,Participant):Wordtent 315.77 377.00 38.05 <2e-16 *** # s(Time,Participant):Wordtenth 327.33 368.00 43.14 <2e-16 ***  ## More uncertainty in the difference plot_diff(m4, view="Time", comp=list(Word=c("tenth","tent")), ylim=c(-0.5,2), main="m4: difference", rm.ranef=T) plot_diff(m5, view="Time", comp=list(Word=c("tenth","tent")), ylim=c(-0.5,2), main="m5: difference", rm.ranef=T)  ## Autocorrelation in the data is a problem! #### (residuals should be independent, otherwise the standard errors and p-values are wrong) m5acf <- acf_resid(m5) # show autocorr.  ## Correcting for autocorrelation rhoval <- m5acf[2] # correlation of residuals at time t with those at time t-1 (about 0.93) m6 <- bam(TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, by = Word, bs = "fs", m = 1), data = tt, rho = rhoval, AR.start = tt$start.event, discrete = T, nthreads = 2)
summary(m6)

# Parametric coefficients:
#             Estimate Std. Error t value Pr(>|t|)
# (Intercept)   0.0855     0.0878    0.97    0.330
# Wordtenth     0.5905     0.1294    4.56 5.05e-06 ***
#
# Approximate significance of smooth terms:
#                                  edf Ref.df     F  p-value
# s(Time):Wordtent                7.41   8.08  8.89 2.15e-12 ***
# s(Time):Wordtenth               8.32   8.60 22.08   <2e-16 ***
# s(Time,Participant):Wordtent  228.84 377.00  2.87   <2e-16 ***
# s(Time,Participant):Wordtenth 267.20 369.00  3.81   <2e-16 ***


## Autocorrelation has been removed

par(mfrow = c(1, 2))
acf_resid(m5, main = "ACF of m5")
acf_resid(m6, main = "ACF of m6")


## Clear model improvement

compareML(m5.ml, m6.ml)

# m5.ml: TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, by = Word,
#     bs = "fs", m = 1)
#
# m6.ml: TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, by = Word,
#     bs = "fs", m = 1)
#
# Model m6.ml preferred: lower ML score (12683.931), and equal df (0.000).
# -----
#   Model Score Edf Difference    Df
# 1 m5.ml  9393  10
# 2 m6.ml -3291  10 -12683.931 0.000
#
# AIC difference: 24357.55, model m6.ml has lower AIC.


## Distinguishing the two speaker groups

tt$WordGroup <- interaction(tt$Word, tt$Group) m7 <- bam(TTfront ~ s(Time, by = WordGroup) + WordGroup + s(Time, Participant, by = Word, bs = "fs", m = 1), data = tt, rho = rhoval, AR.start = tt$start.event, discrete = T, nthreads = 2)
summary(m7)

# Parametric coefficients:
#                   Estimate Std. Error t value Pr(>|t|)
# (Intercept)        -0.0955      0.119   -0.80    0.423
# WordGrouptenth.EN   0.7452      0.174    4.27 1.94e-05 ***
# WordGrouptent.NL    0.3731      0.174    2.14    0.032 *
# WordGrouptenth.NL   0.8047      0.182    4.42    1e-05 ***
#
# Approximate significance of smooth terms:
#                                  edf Ref.df     F  p-value
# s(Time):WordGrouptent.EN        3.86   4.56  2.30    0.043 *
# s(Time):WordGrouptenth.EN       7.94   8.36 16.08   <2e-16 ***
# s(Time):WordGrouptent.NL        7.46   8.13 10.86 9.74e-16 ***
# s(Time):WordGrouptenth.NL       7.74   8.21 12.04   <2e-16 ***
# s(Time,Participant):Wordtent  217.47 377.00  2.65   <2e-16 ***
# s(Time,Participant):Wordtenth 254.62 369.00  3.45   <2e-16 ***


## Visualizing the patterns

plot_smooth(m7, view = "Time", rug = F, plot_all = "WordGroup", main = "m7", rm.ranef = T, ylim = c(-1, 2))
plot_diff(m7, view = "Time", rm.ranef = T, comp = list(WordGroup = c("tenth.EN", "tent.EN")), main = "EN difference",
ylim = c(-1, 2))
plot_diff(m7, view = "Time", rm.ranef = T, comp = list(WordGroup = c("tenth.NL", "tent.NL")), main = "NL difference",
ylim = c(-1, 2))


## Including language is necessary

# m6.ml: TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, by = Word,
#     bs = "fs", m = 1)
#
# m7.ml: TTfront ~ s(Time, by = WordGroup) + WordGroup + s(Time, Participant,
#     by = Word, bs = "fs", m = 1)
#
# Chi-square test of ML scores
# -----
#   Model Score Edf Difference    Df   p.value Sig.
# 1 m6.ml -3291  10
# 2 m7.ml -3306  16     14.985 6.000 3.982e-05  ***
#
# AIC difference: 7.61, model m7.ml has lower AIC.


## The full tongue tip model: all words

#### (Word is now a random-effect factor, Sound distinguishes T from TH words)

full$SoundGroup <- interaction(full$Sound, full$Group) full <- start_event(full, event = c("Participant", "Trial")) # duration discrete=F: 105 sec., discrete=T: 54 s. (1 thr.) / 37 s. (2 thr.) system.time(model <- bam(TTfront ~ s(Time, by = SoundGroup) + SoundGroup + s(Time, Participant, by = Sound, bs = "fs", m = 1) + s(Time, Word, by = Group, bs = "fs", m = 1), data = full, rho = rhoval, AR.start = full$start.event, discrete = T, nthreads = 4))

#    user  system elapsed
#   164.5   266.7    24.2


## The full tongue tip model: results

summary(model, re.test = FALSE)  # exclude random effects from summary (much faster)

# Parametric coefficients:
#                 Estimate Std. Error t value Pr(>|t|)
# (Intercept)      -0.0602     0.0969   -0.62    0.535
# SoundGroupTH.EN   0.6397     0.1581    4.05  5.2e-05 ***
# SoundGroupT.NL    0.1052     0.1461    0.72    0.472
# SoundGroupTH.NL   0.4340     0.1642    2.64    0.008 **
#
# Approximate significance of smooth terms:
#                          edf Ref.df    F  p-value
# s(Time):SoundGroupT.EN  5.16   5.49 2.96    0.014 *
# s(Time):SoundGroupTH.EN 6.61   6.86 6.54 1.39e-07 ***
# s(Time):SoundGroupT.NL  7.25   7.44 4.38 2.68e-05 ***
# s(Time):SoundGroupTH.NL 6.90   7.12 5.88  1.9e-06 ***
#
# Deviance explained = 53.5%


## L1-based differences

plot_diff(model, view = 'Time', rm.ranef = T, ylim=c(-0.5, 1.5),
comp = list(SoundGroup = c('TH.EN', 'T.EN')))
plot_diff(model, view = 'Time', rm.ranef = T, ylim=c(-0.5, 1.5),
comp = list(SoundGroup = c('TH.NL', 'T.NL')))


## Discussion

• Native English speakers appear to make the /t/-/θ/ distinction, whereas Dutch L2 speakers of English generally do not
• But this does not mean the difference between these distinctions is significant (see this tutorial)

## Recap

• We have applied GAMs to articulography data and learned how to:
• use s(Time) to model a non-linearity over time
• use the k parameter to control the number of basis functions
• use the plotting functions plot_smooth and plot_diff (rm.ranef=T!)
• use the parameter setting bs="re" to add random intercepts and slopes
• add non-linear random effects using s(Time,...,bs="fs",m=1)
• use the by-parameter to obtain separate non-linearities
• Note that this factorial predictor also needs to be included in fixed effects!
• use compareML to compare models (comparing fixed effects: method='ML')
• Associated lab session: