Abstract
Required libraries and functions
Dataset
- Data description
Word-specific models
- Visualizing word-specific patterns
- Assessing signficant trajectory differences: “taarten”
  - Visualization
  - Formal difference assessment
Aggregate model: dialect words
- Visualization
- Formal difference assessment
Aggregate model: CVC sequences
- Visualization
- Formal difference assessment
Comparison with Linear Discriminant Analysis
- LDA for dialect words
- LDA for CVC sequences
Comparison with formants
- Formant comparison for dialect words
- Formant comparison for CVC sequences
Replication

Abstract

The present study introduces electromagnetic articulography, the measurement of the position of tongue and lips during speech, as a promising method for the study of dialect variation. By using generalized additive modeling to analyze the articulatory trajectories, we are able to reliably detect aggregate group differences, while simultaneously taking into account the individual variation of dozens of speakers. Our results show that two Dutch dialects show clear differences in their articulatory settings, with a more anterior tongue position in the dialect from Ubbergen in the southern half of the Netherlands than in the dialect of Ter Apel in the northern half of the Netherlands. This clearly demonstrates that articulography is able to reveal interesting structural differences between dialects which are not visible when only focusing on the acoustic signal.

Journal: Submitted (September 7, 2015) to Journal of Phonetics

Preprint: http://www.martijnwieling.nl/files/WielingEtAl-art.pdf

All source data: http://www.let.rug.nl/wieling/DiaArt/SourceData/

Keywords: Articulography; Dialectology; Generalized additive modeling; Articulatory setting

## Generated on: September 04, 2015 - 11:59:00

Required libraries and functions

# install packages if not yet installed
packages <- c("mgcv","itsadug","lme4","parallel","MASS","reshape2")
if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
  install.packages(setdiff(packages, rownames(installed.packages())))  
}

# load required packages
library(mgcv)
library(itsadug)
library(lme4)
library(parallel)
library(MASS)
library(reshape2)

# custom plotting functions
if (!file.exists('plotArt2D.R')) { 
    download.file('http://www.let.rug.nl/wieling/DiaArt/plotArt2D.R','plotArt2D.R')
}
source('plotArt2D.R')

# version information
R.version.string

## [1] "R version 3.2.2 (2015-08-14)"

packageVersion('mgcv')

## [1] '1.8.7'

packageVersion('itsadug')

## [1] '1.0.1'

packageVersion('lme4')

## [1] '1.1.8'

packageVersion('parallel')

## [1] '3.2.2'

packageVersion('MASS')

## [1] '7.3.29'

packageVersion('reshape2')

## [1] '1.4.1'

Dataset

if (!file.exists('dia.rda')) { 
    download.file('http://www.let.rug.nl/wieling/DiaArt/dia.rda','dia.rda') # 23 MB
}

if (!file.exists('cvc.rda')) { 
    download.file('http://www.let.rug.nl/wieling/DiaArt/cvc.rda','cvc.rda') # 7 MB
}

load('dia.rda')
load('cvc.rda')

dim(dia)

## [1] 1541378      33

dim(cvc)

## [1] 515459     33

str(dia)

## 'data.frame':    1541378 obs. of  33 variables:
##  $ Speaker          : Factor w/ 40 levels "terapel_1","terapel_2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Group            : Factor w/ 2 levels "TerApel","Ubbergen": 1 1 1 1 1 1 1 1 1 1 ...
##  $ IsTerApel        : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Gender           : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ YearBirth        : int  1939 1939 1939 1939 1939 1939 1939 1939 1939 1939 ...
##  $ PlaceBirth       : Factor w/ 18 levels "Arnhem","Bemmel",..: 10 10 10 10 10 10 10 10 10 10 ...
##  $ Word             : Factor w/ 70 levels "bal","ballen",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WordNr           : num  12 12 12 12 12 12 12 12 12 12 ...
##  $ Segment          : Factor w/ 57 levels "?","@","2","5",..: 14 14 14 14 14 14 14 14 14 14 ...
##  $ SegmentNr        : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Sensor           : Factor w/ 3 levels "T3","T2","T1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Axis             : Factor w/ 2 levels "P","H": 1 1 1 1 1 1 1 1 1 1 ...
##  $ SensorAxis       : Factor w/ 6 levels "T3.P","T2.P",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ GroupSensorAxis  : Factor w/ 12 levels "TerApel.T3.P",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ SpeakerSensorAxis: Factor w/ 240 levels "terapel_1.T3.P",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WordSensorAxis   : Factor w/ 420 levels "bal.T3.P","ballen.T3.P",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ IsTA.T1.P        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ IsTA.T1.H        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ IsTA.T2.P        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ IsTA.T2.H        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ IsTA.T3.P        : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ IsTA.T3.H        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Word.start       : logi  TRUE FALSE FALSE FALSE FALSE FALSE ...
##  $ Segment.start    : logi  TRUE FALSE FALSE FALSE FALSE FALSE ...
##  $ RecBlock         : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ TimeInRecBlock   : num  22.1 22.1 22.1 22.1 22.1 ...
##  $ Time.normWord    : num  0.00251 0.01843 0.03434 0.05025 0.06617 ...
##  $ Time.normSegment : num  0.0132 0.0968 0.1804 0.2639 0.3475 ...
##  $ Position.norm    : num  0.782 0.773 0.777 0.793 0.793 ...
##  $ F1               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ F2               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ F1.norm          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ F2.norm          : num  NA NA NA NA NA NA NA NA NA NA ...

str(cvc)

## 'data.frame':    515459 obs. of  33 variables:
##  $ Speaker          : Factor w/ 40 levels "terapel_1","terapel_2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Group            : Factor w/ 2 levels "TerApel","Ubbergen": 1 1 1 1 1 1 1 1 1 1 ...
##  $ IsTerApel        : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Gender           : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ YearBirth        : int  1939 1939 1939 1939 1939 1939 1939 1939 1939 1939 ...
##  $ PlaceBirth       : Factor w/ 18 levels "Arnhem","Bemmel",..: 10 10 10 10 10 10 10 10 10 10 ...
##  $ Word             : Factor w/ 27 levels "kaak","kaap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WordNr           : num  151 151 151 151 151 151 151 151 151 151 ...
##  $ Segment          : Factor w/ 18 levels "a","b","C","d",..: 9 9 9 9 1 1 1 1 1 1 ...
##  $ SegmentNr        : num  1 1 1 1 2 2 2 2 2 2 ...
##  $ Sensor           : Factor w/ 3 levels "T3","T2","T1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Axis             : Factor w/ 2 levels "P","H": 1 1 1 1 1 1 1 1 1 1 ...
##  $ SensorAxis       : Factor w/ 6 levels "T3.P","T2.P",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ GroupSensorAxis  : Factor w/ 12 levels "TerApel.T3.P",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ SpeakerSensorAxis: Factor w/ 240 levels "terapel_1.T3.P",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WordSensorAxis   : Factor w/ 162 levels "kaak.T3.P","kaap.T3.P",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ IsTA.T1.P        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ IsTA.T1.H        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ IsTA.T2.P        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ IsTA.T2.H        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ IsTA.T3.P        : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ IsTA.T3.H        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Word.start       : logi  TRUE FALSE FALSE FALSE FALSE FALSE ...
##  $ Segment.start    : logi  TRUE FALSE FALSE FALSE TRUE FALSE ...
##  $ RecBlock         : int  13 13 13 13 13 13 13 13 13 13 ...
##  $ TimeInRecBlock   : num  12.3 12.3 12.3 12.4 12.4 ...
##  $ Time.normWord    : num  0.000892 0.027183 0.052954 0.078985 0.105016 ...
##  $ Time.normSegment : num  0.00904 0.27528 0.53625 0.79986 0.01078 ...
##  $ Position.norm    : num  0.567 0.571 0.599 0.612 0.635 ...
##  $ F1               : num  NA NA NA NA 417 ...
##  $ F2               : num  NA NA NA NA 1602 ...
##  $ F1.norm          : num  NA NA NA NA -0.183 ...
##  $ F2.norm          : num  NA NA NA NA 0.421 0.359 0.667 0.306 0.488 NA ...

colnames(dia)

##  [1] "Speaker"           "Group"             "IsTerApel"        
##  [4] "Gender"            "YearBirth"         "PlaceBirth"       
##  [7] "Word"              "WordNr"            "Segment"          
## [10] "SegmentNr"         "Sensor"            "Axis"             
## [13] "SensorAxis"        "GroupSensorAxis"   "SpeakerSensorAxis"
## [16] "WordSensorAxis"    "IsTA.T1.P"         "IsTA.T1.H"        
## [19] "IsTA.T2.P"         "IsTA.T2.H"         "IsTA.T3.P"        
## [22] "IsTA.T3.H"         "Word.start"        "Segment.start"    
## [25] "RecBlock"          "TimeInRecBlock"    "Time.normWord"    
## [28] "Time.normSegment"  "Position.norm"     "F1"               
## [31] "F2"                "F1.norm"           "F2.norm"

colnames(cvc) # equal column names as dia

##  [1] "Speaker"           "Group"             "IsTerApel"        
##  [4] "Gender"            "YearBirth"         "PlaceBirth"       
##  [7] "Word"              "WordNr"            "Segment"          
## [10] "SegmentNr"         "Sensor"            "Axis"             
## [13] "SensorAxis"        "GroupSensorAxis"   "SpeakerSensorAxis"
## [16] "WordSensorAxis"    "IsTA.T1.P"         "IsTA.T1.H"        
## [19] "IsTA.T2.P"         "IsTA.T2.H"         "IsTA.T3.P"        
## [22] "IsTA.T3.H"         "Word.start"        "Segment.start"    
## [25] "RecBlock"          "TimeInRecBlock"    "Time.normWord"    
## [28] "Time.normSegment"  "Position.norm"     "F1"               
## [31] "F2"                "F1.norm"           "F2.norm"

Data description

Speaker – ID of the speaker
Group – Group of the speaker (Ter Apel in the North, or Ubbergen, further south)
IsTerApel – Binary value equal to 1 for speakers of Ter Apel, and 0 for speakers from Ubbergen
Gender – Gender of the speaker (F: Female, M: Male)
YearBirth – Year of birth of the speaker (in years)
PlaceBirth – Place of birth of the speaker
Word – CVC sequence
WordNr – Word number, i.e. the position in the experimen when the word was presented
Segment – Phonetic segment within the CVC sequence (annotated in XSampa)
SegmentNr - Segment number, i.e. the position of the segment in the word
Sensor - Tongue sensor, i.e. T1 (tongue tip), T2 (behind tongue tip), T3 (behind T2)
Axis - Sensor axis, i.e. P: posterior position, or H: height
SensorAxis - Combination of Sensor and Axis
GroupSensorAxis - Combination of Group, Sensor and Axis
SpeakerSensorAxis - Combination of Speaker, Sensor and Axis
WordSensorAxis - Combination of Word, Sensor and Axis
IsTA.T1.P - Binary predictor equal to 1 if GroupSensorAxis = “TerApel.T1.P”, 0 otherwise
IsTA.T1.H - Binary predictor equal to 1 if GroupSensorAxis = “TerApel.T1.H”, 0 otherwise
IsTA.T2.P - Binary predictor equal to 1 if GroupSensorAxis = “TerApel.T2.P”, 0 otherwise
IsTA.T2.H - Binary predictor equal to 1 if GroupSensorAxis = “TerApel.T2.H”, 0 otherwise
IsTA.T3.P - Binary predictor equal to 1 if GroupSensorAxis = “TerApel.T3.P”, 0 otherwise
IsTA.T3.H - Binary predictor equal to 1 if GroupSensorAxis = “TerApel.T3.H”, 0 otherwise
Word.start - Logical predictor equal to TRUE at the start of each new articulatory trajectory (per individual word pronunciation per sensor and axis separately), FALSE otheriwse
Segment.start - Logical predictor equal to TRUE at the start of each new articulatory trajectory (per individual segment pronunciation per sensor and axis separately), FALSE otheriwse
RecBlock - Sequence number of the recording block
TimeInRecBlock - Timestamp in the recording block
Time.normWord - Time points associated with the articulatory measurements, normalized per word pronunciation between 0 (start of the pronunciation) and 1 (end of the pronunciation)
Time.normSegment - Time points associated with the articulatory measurements, normalized per segment pronunciation between 0 (start of the pronunciation) and 1 (end of the pronunciation)
Position.norm - Normalized (between 0 and 1) position of the Sensor in the direction Axis
F1 - First formant (only for vowels)
F2 - Second formant (only for vowels)
F1.norm - First formant (only for vowels), normalized with Lobanov’s (1971) method
F2.norm - Second formant (only for vowels), normalized with Lobanov’s (1971) method

Word-specific models

In the following four models are fitted, two for dialect words (taarten and boor) and two for CVC sequences (taat: [tat] and poop [pop]). For each word two models are fitted, the first to determine the rho value (for correcting autocorrelation in the residuals), the second model which does corrects for the autocorrelation using the predetermined rho value.

# this code is not evaluated, as each model takes about 1 hour to fit
words = c('taarten','boor','taat','poop')

for (word in words) { 
    if (word %in% c('taarten','boor')) { 
        subdat = droplevels(dia[dia$Word == word,])
    } else { 
        subdat = droplevels(cvc[cvc$Word == word,])
    }

    # fit first model to determine autocorrelation in residuals
    model0 <- bam(Position.norm ~ s(Time.normWord,by=GroupSensorAxis,k=20) + 
                  GroupSensorAxis + s(Time.normWord,SpeakerSensorAxis,bs='fs',m=1), 
                  data=subdat, method='fREML', gc.level=2)

    # assess autocorrelation in residuals
    model0acf = acf(resid(model0))
    rhoval = as.vector(model0acf[1]$acf)

    # fit model which corrects for autocorrelation in residuals
    model <- bam(Position.norm ~ s(Time.normWord,by=GroupSensorAxis, k=20) + 
                 GroupSensorAxis + s(Time.normWord,SpeakerSensorAxis,bs='fs',m=1), 
                 data=subdat, method='fREML', gc.level=2, AR.start=Word.start, rho=rhoval)

    smry <- summary(model)
    save(model,smry,file=paste(word,'-group.rda',sep=''))
}

Visualizing word-specific patterns

This visualization corresponds to Figure 5 in the manuscript.

if (!file.exists('taarten-group.rda')) { 
    download.file('http://www.let.rug.nl/wieling/DiaArt/taarten-group.rda',
                  'taarten-group.rda') # 253 MB
}

if (!file.exists('boor-group.rda')) { 
    download.file('http://www.let.rug.nl/wieling/DiaArt/boor-group.rda',
                  'boor-group.rda') # 251 MB
}

if (!file.exists('taat-group.rda')) { 
    download.file('http://www.let.rug.nl/wieling/DiaArt/taat-group.rda',
                  'taat-group.rda') # 254 MB
}

if (!file.exists('poop-group.rda')) { 
    download.file('http://www.let.rug.nl/wieling/DiaArt/poop-group.rda',
                  'poop-group.rda') # 255 MB
}

par(mfrow=c(2,2))
load('taarten-group.rda')
taartenmodel = model # store for later use
plotArt2D(model, catvar='GroupSensorAxis',
          catlevels.x=c('TerApel.T3.P','Ubbergen.T3.P','TerApel.T2.P',
                        'Ubbergen.T2.P','TerApel.T1.P','Ubbergen.T1.P'), 
          catlevels.y=c('TerApel.T3.H','Ubbergen.T3.H','TerApel.T2.H',
                        'Ubbergen.T2.H','TerApel.T1.H','Ubbergen.T1.H'), 
          timevar='Time.normWord', collabels=c('Ter Apel', 'Ubbergen'),
          main='Position of the three tongue sensors: "taarten"',
          xlab='Posterior position', ylab='Height', showPoints=T,
          cols=list(c("cadetblue3","cadetblue1"),c("tomato4","tomato")),alpha=15)

load('boor-group.rda')
plotArt2D(model, catvar='GroupSensorAxis',
          catlevels.x=c('TerApel.T3.P','Ubbergen.T3.P','TerApel.T2.P',
                        'Ubbergen.T2.P','TerApel.T1.P','Ubbergen.T1.P'), 
          catlevels.y=c('TerApel.T3.H','Ubbergen.T3.H','TerApel.T2.H',
                        'Ubbergen.T2.H','TerApel.T1.H','Ubbergen.T1.H'), 
          timevar='Time.normWord', collabels=c('Ter Apel', 'Ubbergen'),
          main='Position of the three tongue sensors: "boor"',
          xlab='Posterior position', ylab='Height', showPoints=T,
          cols=list(c("cadetblue3","cadetblue1"),c("tomato4","tomato")),alpha=15)

load('taat-group.rda')
plotArt2D(model, catvar='GroupSensorAxis',
          catlevels.x=c('TerApel.T3.P','Ubbergen.T3.P','TerApel.T2.P',
                        'Ubbergen.T2.P','TerApel.T1.P','Ubbergen.T1.P'), 
          catlevels.y=c('TerApel.T3.H','Ubbergen.T3.H','TerApel.T2.H',
                        'Ubbergen.T2.H','TerApel.T1.H','Ubbergen.T1.H'), 
          timevar='Time.normWord', collabels=c('Ter Apel', 'Ubbergen'),
          main='Position of the three tongue sensors: "taat"',
          xlab='Posterior position', ylab='Height', showPoints=T,
          cols=list(c("cadetblue3","cadetblue1"),c("tomato4","tomato")),alpha=15)

load('poop-group.rda')
plotArt2D(model, catvar='GroupSensorAxis',
          catlevels.x=c('TerApel.T3.P','Ubbergen.T3.P','TerApel.T2.P',
                        'Ubbergen.T2.P','TerApel.T1.P','Ubbergen.T1.P'), 
          catlevels.y=c('TerApel.T3.H','Ubbergen.T3.H','TerApel.T2.H',
                        'Ubbergen.T2.H','TerApel.T1.H','Ubbergen.T1.H'), 
          timevar='Time.normWord', collabels=c('Ter Apel', 'Ubbergen'),
          main='Position of the three tongue sensors: "poop"',
          xlab='Posterior position', ylab='Height', showPoints=T,
          cols=list(c("cadetblue3","cadetblue1"),c("tomato4","tomato")),alpha=15)

Assessing signficant trajectory differences: “taarten”

Visualization

The following visualization corresponds to Figure 6 in the manuscript.

par(mfrow=c(2,2))
plotSmooths(taartenmodel,xvar='Time.normWord',catvar='GroupSensorAxis',
            catlevels=c('TerApel.T1.P','Ubbergen.T1.P'),
            dropRanef='SpeakerSensorAxis',ylim=c(0.05,1),legendPos='topleft',
            main='T1 posterior position: "taarten"',xlab='Time (normalized)',
            ylab='Posterior position',legendlabels=c('Ter Apel','Ubbergen'),
            legendtitle='Group', colors=c('cadetblue3','tomato4'),cexPoints=1,
            alphaPoints=0.04,showPoints=T)

plotSmooths(taartenmodel,xvar='Time.normWord',catvar='GroupSensorAxis',
            catlevels=c('TerApel.T1.H','Ubbergen.T1.H'),
            dropRanef='SpeakerSensorAxis',ylim=c(0.05,1),legendPos='topleft',
            main='T1 height: "taarten"',xlab='Time (normalized)',
            ylab='Height',legendlabels=c('Ter Apel','Ubbergen'),
            legendtitle='Group',colors=c('cadetblue3','tomato4'),cexPoints=1,
            alphaPoints=0.04,showPoints=T)

plotDiff(taartenmodel, xvar="Time.normWord", catvar='GroupSensorAxis',
         catlevels=c('Ubbergen.T1.P','TerApel.T1.P'), 
         dropRanef='SpeakerSensorAxis',ylim=c(-0.325,0.325),
         main='T1 posterior position: Ter Apel vs. Ubbergen', 
         ylab='Posterior position difference',xlab='Time (normalized)')

plotDiff(taartenmodel, xvar="Time.normWord", catvar='GroupSensorAxis',
         catlevels=c('Ubbergen.T1.H','TerApel.T1.H'), 
         dropRanef='SpeakerSensorAxis',ylim=c(-0.325,0.325),
         main='T1 height: Ter Apel vs. Ubbergen', 
         ylab='Height difference',xlab='Time (normalized)')

Formal difference assessment

To formally assess the difference, a model is specified with difference curves (see explanation in the manuscript).

# this code is not evaluated, as each model takes about 1 to 2 hours to fit
subdat = droplevels(dia[dia$Word == 'taarten',])

# fit first model to determine autocorrelation in residuals
model0 <- bam(Position.norm ~ s(Time.normWord, by=SensorAxis,k=20) + SensorAxis + 
              s(Time.normWord, by=IsTA.T1.P,k=20) + s(Time.normWord, by=IsTA.T1.H,k=20) + 
              s(Time.normWord, by=IsTA.T2.P,k=20) + s(Time.normWord, by=IsTA.T2.H,k=20) + 
              s(Time.normWord, by=IsTA.T3.P,k=20) + s(Time.normWord, by=IsTA.T3.H,k=20)  + 
              s(Time.normWord,SpeakerSensorAxis,bs='fs',m=1), data=subdat, 
              method='fREML', gc.level=2)

# assess autocorrelation in residuals
model0acf = acf(resid(model0))
rhoval = as.vector(model0acf[1]$acf)

# fit model which corrects for autocorrelation in residuals
modelbin <- bam(Position.norm ~ s(Time.normWord, by=SensorAxis,k=20) + SensorAxis + 
                s(Time.normWord, by=IsTA.T1.P,k=20) + s(Time.normWord, by=IsTA.T1.H,k=20) + 
                s(Time.normWord, by=IsTA.T2.P,k=20) + s(Time.normWord, by=IsTA.T2.H,k=20) + 
                s(Time.normWord, by=IsTA.T3.P,k=20) + s(Time.normWord, by=IsTA.T3.H,k=20) + 
                s(Time.normWord,SpeakerSensorAxis,bs='fs',m=1), data=subdat, method='fREML', 
                gc.level=2, AR.start=Word.start, rho=rhoval)

smrybin <- summary(modelbin)
save(modelbin,smrybin,file='taarten-binary.rda')

if (!file.exists('taarten-binary.rda')) { 
    download.file('http://www.let.rug.nl/wieling/DiaArt/taarten-binary.rda',
                  'taarten-binary.rda') # 261 MB
}

load('taarten-binary.rda')
smrybin # show summary

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Position.norm ~ s(Time.normWord, by = SensorAxis, k = 20) + SensorAxis + 
##     s(Time.normWord, by = IsTA.T1.P, k = 20) + s(Time.normWord, 
##     by = IsTA.T1.H, k = 20) + s(Time.normWord, by = IsTA.T2.P, 
##     k = 20) + s(Time.normWord, by = IsTA.T2.H, k = 20) + s(Time.normWord, 
##     by = IsTA.T3.P, k = 20) + s(Time.normWord, by = IsTA.T3.H, 
##     k = 20) + s(Time.normWord, SpeakerSensorAxis, bs = "fs", 
##     m = 1)
## 
## Parametric coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.68071    0.01779  38.255   <2e-16 ***
## SensorAxisT2.P -0.24752    0.02518  -9.832   <2e-16 ***
## SensorAxisT1.P -0.48324    0.02519 -19.181   <2e-16 ***
## SensorAxisT3.H  0.01578    0.02512   0.628   0.5300    
## SensorAxisT2.H -0.06143    0.02519  -2.439   0.0147 *  
## SensorAxisT1.H -0.21961    0.02521  -8.710   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                        edf  Ref.df      F  p-value    
## s(Time.normWord):SensorAxisT3.P      15.21   17.00 11.191  < 2e-16 ***
## s(Time.normWord):SensorAxisT2.P      15.62   17.29 13.492  < 2e-16 ***
## s(Time.normWord):SensorAxisT1.P      16.48   17.85 17.593  < 2e-16 ***
## s(Time.normWord):SensorAxisT3.H      12.44   14.73  6.274 3.64e-13 ***
## s(Time.normWord):SensorAxisT2.H      15.71   17.38 18.599  < 2e-16 ***
## s(Time.normWord):SensorAxisT1.H      17.37   18.39 32.779  < 2e-16 ***
## s(Time.normWord):IsTA.T1.P           14.94   17.18  7.597  < 2e-16 ***
## s(Time.normWord):IsTA.T1.H           16.86   18.63 14.829  < 2e-16 ***
## s(Time.normWord):IsTA.T2.P           13.92   16.26  7.203  < 2e-16 ***
## s(Time.normWord):IsTA.T2.H           15.38   17.48  9.385  < 2e-16 ***
## s(Time.normWord):IsTA.T3.P           12.93   15.32  5.442 2.72e-11 ***
## s(Time.normWord):IsTA.T3.H           11.70   14.00  3.641 4.28e-06 ***
## s(Time.normWord,SpeakerSensorAxis) 1670.45 2148.00  6.985  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.964   Deviance explained = 96.7%
## fREML = -56955  Scale est. = 0.0011372  n = 21750

The summary above shows that all difference smooths are significant (i.e. the p-values associated with the binary, IsTA, smooths.

Aggregate model: dialect words

In the following the aggregate model for dialect words is fitted. Three models are fitted: the first model is used to determine the rho value (for correcting autocorrelation in the residuals), the second model fits separate trajectories for each level (this model is used for visualization), and the third model fits the model using difference curves, thereby yielding p-values assessing if the difference between the two groups is significant or not.

# this code is not evaluated, as each model takes about 5 hours to fit (using 36 CPUs)
cl = makeCluster(36) # 36 CPUs to calculate the model fit

# fit first model to determine autocorrelation in residuals
modelNoRho <- bam(Position.norm ~ s(Time.normWord,by=GroupSensorAxis, k=20) + 
                  GroupSensorAxis + s(Time.normWord,SpeakerSensorAxis,bs='fs',m=1) + 
                  s(Time.normWord,WordSensorAxis,bs='fs',m=1), data=dia, 
                  method='fREML', cluster=cl)
save(modelNoRho,file='modelD-group-norho-1.8.7.rda')

# assess autocorrelation in residuals
modelACF = acf(resid(modelNoRho)) 
(rhoval = as.vector(modelACF[1]$acf)) # 0.9669873

# fit model which corrects for autocorrelation in residuals
model <- bam(Position.norm ~ s(Time.normWord,by=GroupSensorAxis, k=20) + 
             GroupSensorAxis + s(Time.normWord,SpeakerSensorAxis,bs='fs',m=1) + 
             s(Time.normWord,WordSensorAxis,bs='fs',m=1), data=dia, 
             method='fREML', AR.start=Word.start, rho=rhoval, cluster=cl) 

smry <- summary(model)
save(model,smry,file='modelD-group-1.8.7.rda') # mgcv 1.8.7 was used

# fit model with difference curve
modelbin <- bam(Position.norm ~ s(Time.normWord, by=SensorAxis,k=20) + SensorAxis + 
                s(Time.normWord, by=IsTA.T1.P,k=20) + s(Time.normWord, by=IsTA.T1.H,k=20) + 
                s(Time.normWord, by=IsTA.T2.P,k=20) + s(Time.normWord, by=IsTA.T2.H,k=20) + 
                s(Time.normWord, by=IsTA.T3.P,k=20) + s(Time.normWord, by=IsTA.T3.H,k=20) + 
                s(Time.normWord,SpeakerSensorAxis,bs='fs',m=1) + 
                s(Time.normWord,WordSensorAxis,bs='fs',m=1), data=dia, method='fREML', 
                AR.start=Word.start, rho=rhoval, cluster=cl)

smrybin <- summary(modelbin)
save(modelbin,smrybin,file='modelD-binary-1.8.7.rda')

Visualization

Individual variation

The following graph visualizes the individual variation and is equal to Figure 3 in the manuscript.

if (!file.exists('modelD-group-1.8.7.rda')) { 
    download.file('http://www.let.rug.nl/wieling/DiaArt/modelD-group-1.8.7.rda',
                  'modelD-group-1.8.7.rda') # 1773 MB
}

load('modelD-group-1.8.7.rda')
plot(model,select=13,ylab='Position (normalized)', xlab='Time (normalized)')

Autocorrelation

The following graph visualizes the autocorrelation and is equal to Figure 4 in the manuscript.

if (!file.exists('modelD-group-norho-1.8.7.rda')) { 
    download.file('http://www.let.rug.nl/wieling/DiaArt/modelD-group-norho-1.8.7.rda',
                  'modelD-group-norho-1.8.7.rda') # 1147 MB
}

load('modelD-group-norho-1.8.7.rda')

par(mfrow=c(1,2))
acf_resid(modelNoRho, main='Original autocorrelation in residuals',ylab='ACF',max_lag=30)
acf_resid(model, main='Corrected autocorrelation in residuals',ylab='ACF',max_lag=30)

Tongue trajectories in two dimensions

The following graph corresponds to Figure 7 in the manuscript and visualizes the tongue trajectories in two dimensions.

plotArt2D(model, catvar='GroupSensorAxis', 
          catlevels.x=c('TerApel.T1.P','Ubbergen.T1.P','TerApel.T2.P',
                        'Ubbergen.T2.P','TerApel.T3.P','Ubbergen.T3.P'), 
          catlevels.y=c('TerApel.T1.H','Ubbergen.T1.H','TerApel.T2.H',
                        'Ubbergen.T2.H','TerApel.T3.H','Ubbergen.T3.H'), 
          timevar='Time.normWord', collabels=c('Ter Apel', 'Ubbergen'), 
          xlab='Posterior position', ylab='Height', showPoints=T, 
          main='Position of the three tongue sensors: dialect words',
          cols=list(c("cadetblue3","cadetblue1"),c("tomato4","tomato")), 
          alpha=2, cexPoints=0.4)

Tongue trajectory differences

The following graph corresponds to Figure 8 in the manuscript and visualizes the tongue trajectory differences.

par(mfrow=c(4,3))
plotSmooths(model,xvar='Time.normWord',catvar='GroupSensorAxis',
            catlevels=c('TerApel.T1.P','Ubbergen.T1.P'),
            dropRanef=c('SpeakerSensorAxis','WordSensorAxis'),
            ylim=c(0.05,1),legendPos='topleft',main='T1 posterior position',
            xlab='Time (normalized)',ylab='Posterior position',
            legendlabels=c('Ter Apel','Ubbergen'),legendtitle='Group',
            colors=c('cadetblue3','tomato4'),cexPoints=0.5,showPoints=F)

plotSmooths(model,xvar='Time.normWord',catvar='GroupSensorAxis',
            catlevels=c('TerApel.T2.P','Ubbergen.T2.P'),
            dropRanef=c('SpeakerSensorAxis','WordSensorAxis'),
            ylim=c(0.05,1),legendPos='topleft',main='T2 posterior position',
            xlab='Time (normalized)',ylab='Posterior position',
            legendlabels=c('Ter Apel','Ubbergen'),legendtitle='Group',
            colors=c('cadetblue3','tomato4'),cexPoints=0.5,showPoints=F)

plotSmooths(model,xvar='Time.normWord',catvar='GroupSensorAxis',
            catlevels=c('TerApel.T3.P','Ubbergen.T3.P'),
            dropRanef=c('SpeakerSensorAxis','WordSensorAxis'),
            ylim=c(0.05,1),legendPos='bottomleft',main='T3 posterior position',
            xlab='Time (normalized)',ylab='Posterior position',
            legendlabels=c('Ter Apel','Ubbergen'),legendtitle='Group',
            colors=c('cadetblue3','tomato4'),cexPoints=0.5,showPoints=F)

plotDiff(model, xvar="Time.normWord", catvar='GroupSensorAxis',
         catlevels=c('Ubbergen.T1.P','TerApel.T1.P'), 
         dropRanef=c('SpeakerSensorAxis','WordSensorAxis'),
         ylim=c(-0.13,0.13),main='T1 posterior position Ter Apel vs. Ubbergen)', 
         ylab='Posterior position difference',xlab='Time (normalized)')

plotDiff(model, xvar="Time.normWord", catvar='GroupSensorAxis',
         catlevels=c('Ubbergen.T2.P','TerApel.T2.P'), 
         dropRanef=c('SpeakerSensorAxis','WordSensorAxis'),
         ylim=c(-0.13,0.13),main='T2 posterior position Ter Apel vs. Ubbergen)', 
         ylab='Posterior position difference',xlab='Time (normalized)')

plotDiff(model, xvar="Time.normWord", catvar='GroupSensorAxis',
         catlevels=c('Ubbergen.T3.P','TerApel.T3.P'), 
         dropRanef=c('SpeakerSensorAxis','WordSensorAxis'),
         ylim=c(-0.13,0.13),main='T3 posterior position Ter Apel vs. Ubbergen)', 
         ylab='Posterior position difference',xlab='Time (normalized)')

plotSmooths(model,xvar='Time.normWord',catvar='GroupSensorAxis',
            catlevels=c('TerApel.T1.H','Ubbergen.T1.H'),
            dropRanef=c('SpeakerSensorAxis','WordSensorAxis'),
            ylim=c(0.05,1),legendPos='topleft',main='T1 height',
            xlab='Time (normalized)',ylab='Height',
            legendlabels=c('Ter Apel','Ubbergen'),legendtitle='Group',
            colors=c('cadetblue3','tomato4'),cexPoints=0.5,showPoints=F)

plotSmooths(model,xvar='Time.normWord',catvar='GroupSensorAxis',
            catlevels=c('TerApel.T2.H','Ubbergen.T2.H'),
            dropRanef=c('SpeakerSensorAxis','WordSensorAxis'),
            ylim=c(0.05,1),legendPos='topleft',main='T2 height',
            xlab='Time (normalized)',ylab='Height',
            legendlabels=c('Ter Apel','Ubbergen'),legendtitle='Group',
            colors=c('cadetblue3','tomato4'),cexPoints=0.5,showPoints=F)

plotSmooths(model,xvar='Time.normWord',catvar='GroupSensorAxis',
            catlevels=c('TerApel.T3.H','Ubbergen.T3.H'),
            dropRanef=c('SpeakerSensorAxis','WordSensorAxis'),
            ylim=c(0.05,1),legendPos='bottomleft',main='T3 height',
            xlab='Time (normalized)',ylab='Height',
            legendlabels=c('Ter Apel','Ubbergen'),legendtitle='Group',
            colors=c('cadetblue3','tomato4'),cexPoints=0.5,showPoints=F)

plotDiff(model, xvar="Time.normWord", catvar='GroupSensorAxis',
         catlevels=c('Ubbergen.T1.H','TerApel.T1.H'), 
         dropRanef=c('SpeakerSensorAxis','WordSensorAxis'),
         ylim=c(-0.13,0.13),main='T1 height Ter Apel vs. Ubbergen)', 
         ylab='Height difference',xlab='Time (normalized)')

plotDiff(model, xvar="Time.normWord", catvar='GroupSensorAxis',
         catlevels=c('Ubbergen.T2.H','TerApel.T2.H'),
         dropRanef=c('SpeakerSensorAxis','WordSensorAxis'),
         ylim=c(-0.13,0.13),main='T2 height Ter Apel vs. Ubbergen)', 
         ylab='Height difference',xlab='Time (normalized)')

plotDiff(model, xvar="Time.normWord", catvar='GroupSensorAxis',
         catlevels=c('Ubbergen.T3.H','TerApel.T3.H'),
         dropRanef=c('SpeakerSensorAxis','WordSensorAxis'),
         ylim=c(-0.13,0.13),main='T3 height Ter Apel vs. Ubbergen)', 
         ylab='Height difference',xlab='Time (normalized)')

Formal difference assessment

if (!file.exists('modelD-binary-1.8.7.rda')) { 
    download.file('http://www.let.rug.nl/wieling/DiaArt/modelD-binary-1.8.7.rda',
                  'modelD-binary-1.8.7.rda') # 1773 MB
}

load('modelD-binary-1.8.7.rda')
smrybin # show summary assessing significant differences

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Position.norm ~ s(Time.normWord, by = SensorAxis, k = 20) + SensorAxis + 
##     s(Time.normWord, by = IsTA.T1.P, k = 20) + s(Time.normWord, 
##     by = IsTA.T1.H, k = 20) + s(Time.normWord, by = IsTA.T2.P, 
##     k = 20) + s(Time.normWord, by = IsTA.T2.H, k = 20) + s(Time.normWord, 
##     by = IsTA.T3.P, k = 20) + s(Time.normWord, by = IsTA.T3.H, 
##     k = 20) + s(Time.normWord, SpeakerSensorAxis, bs = "fs", 
##     m = 1) + s(Time.normWord, WordSensorAxis, bs = "fs", m = 1)
## 
## Parametric coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.695607   0.014362  48.435  < 2e-16 ***
## SensorAxisT2.P -0.241335   0.020376 -11.844  < 2e-16 ***
## SensorAxisT1.P -0.469898   0.020435 -22.995  < 2e-16 ***
## SensorAxisT3.H  0.002574   0.020409   0.126      0.9    
## SensorAxisT2.H -0.121326   0.020463  -5.929 3.05e-09 ***
## SensorAxisT1.H -0.277807   0.020535 -13.529  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                         edf   Ref.df       F  p-value    
## s(Time.normWord):SensorAxisT3.P       4.101    5.710   1.694  0.13026    
## s(Time.normWord):SensorAxisT2.P       7.943   10.154   4.286 4.47e-06 ***
## s(Time.normWord):SensorAxisT1.P      10.919   12.744   6.432 3.32e-12 ***
## s(Time.normWord):SensorAxisT3.H       8.855   10.840   7.680 2.88e-13 ***
## s(Time.normWord):SensorAxisT2.H      11.455   13.199  14.083  < 2e-16 ***
## s(Time.normWord):SensorAxisT1.H      16.164   16.914  14.512  < 2e-16 ***
## s(Time.normWord):IsTA.T1.P            5.657    7.013   2.005  0.05051 .  
## s(Time.normWord):IsTA.T1.H           17.133   18.404   9.898  < 2e-16 ***
## s(Time.normWord):IsTA.T2.P            4.072    4.928   2.937  0.01283 *  
## s(Time.normWord):IsTA.T2.H           13.831   15.783   3.514 2.72e-06 ***
## s(Time.normWord):IsTA.T3.P            2.018    2.028   1.513  0.21935    
## s(Time.normWord):IsTA.T3.H           10.102   12.219   2.466  0.00273 ** 
## s(Time.normWord,SpeakerSensorAxis) 1949.512 2148.000  39.358  < 2e-16 ***
## s(Time.normWord,WordSensorAxis)    3722.378 3774.000 121.938  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.861   Deviance explained = 86.2%
## fREML = -4.3865e+06  Scale est. = 0.0028218  n = 1541344

Aggregate model: CVC sequences

In the following the aggregate model for CVC sequences is fitted. Three models are fitted: the first model is used to determine the rho value (for correcting autocorrelation in the residuals), the second model fits separate trajectories for each level (this model is used for visualization), and the third model fits the model using difference curves, thereby yielding p-values assessing if the difference between the two groups is significant or not.

# this code is not evaluated, as each model takes about 5 hours to fit (using 36 CPUs)
cl = makeCluster(36) # 36 CPUs to calculate the model fit

# fit first model to determine autocorrelation in residuals
modelNoRho <- bam(Position.norm ~ s(Time.normWord,by=GroupSensorAxis, k=20) + 
                  GroupSensorAxis + s(Time.normWord,SpeakerSensorAxis,bs='fs',m=1) + 
                  s(Time.normWord,WordSensorAxis,bs='fs',m=1), data=cvc, 
                  method='fREML', cluster=cl)
save(modelNoRho,file='modelCVC-group-norho-1.8.7.rda')

# assess autocorrelation in residuals
modelACF = acf(resid(modelNoRho)) 
(rhoval = as.vector(modelACF[1]$acf)) # 0.9572511

# fit model which corrects for autocorrelation in residuals
model <- bam(Position.norm ~ s(Time.normWord,by=GroupSensorAxis, k=20) + 
             GroupSensorAxis + s(Time.normWord,SpeakerSensorAxis,bs='fs',m=1) + 
             s(Time.normWord,WordSensorAxis,bs='fs',m=1), data=cvc, 
             method='fREML', AR.start=Word.start, rho=rhoval, cluster=cl) 

smry <- summary(model)
save(model,smry,file='modelCVC-group-1.8.7.rda') # mgcv 1.8.7 was used

# fit model with difference curve
modelbin <- bam(Position.norm ~ s(Time.normWord, by=SensorAxis,k=20) + SensorAxis + 
                s(Time.normWord, by=IsTA.T1.P,k=20) + s(Time.normWord, by=IsTA.T1.H,k=20) + 
                s(Time.normWord, by=IsTA.T2.P,k=20) + s(Time.normWord, by=IsTA.T2.H,k=20) + 
                s(Time.normWord, by=IsTA.T3.P,k=20) + s(Time.normWord, by=IsTA.T3.H,k=20) + 
                s(Time.normWord,SpeakerSensorAxis,bs='fs',m=1) + 
                s(Time.normWord,WordSensorAxis,bs='fs',m=1), data=cvc, method='fREML', 
                AR.start=Word.start, rho=rhoval, cluster=cl)

smrybin <- summary(modelbin)
save(modelbin,smrybin,file='modelCVC-binary-1.8.7.rda')

Visualization

Individual variation

The following graph visualizes the individual variation.

if (!file.exists('modelCVC-group-1.8.7.rda')) { 
    download.file('http://www.let.rug.nl/wieling/DiaArt/modelCVC-group-1.8.7.rda',
                  'modelCVC-group-1.8.7.rda') # 690 MB
}

load('modelCVC-group-1.8.7.rda')
plot(model,select=13,ylab='Position (normalized)', xlab='Time (normalized)')

Autocorrelation

The following graph visualizes the autocorrelation.

if (!file.exists('modelCVC-group-norho-1.8.7.rda')) { 
    download.file('http://www.let.rug.nl/wieling/DiaArt/modelCVC-group-norho-1.8.7.rda',
                  'modelCVC-group-norho-1.8.7.rda') # 448 MB
}

load('modelCVC-group-norho-1.8.7.rda')

par(mfrow=c(1,2))
acf_resid(modelNoRho, main='Original autocorrelation in residuals',ylab='ACF',max_lag=30)
acf_resid(model, main='Corrected autocorrelation in residuals',ylab='ACF',max_lag=30)

Tongue trajectories in two dimensions

The following graph corresponds to Figure 9 in the manuscript and visualizes the tongue trajectories in two dimensions.

plotArt2D(model, catvar='GroupSensorAxis', 
          catlevels.x=c('TerApel.T1.P','Ubbergen.T1.P','TerApel.T2.P',
                        'Ubbergen.T2.P','TerApel.T3.P','Ubbergen.T3.P'), 
          catlevels.y=c('TerApel.T1.H','Ubbergen.T1.H','TerApel.T2.H',
                        'Ubbergen.T2.H','TerApel.T3.H','Ubbergen.T3.H'), 
          timevar='Time.normWord', collabels=c('Ter Apel', 'Ubbergen'), 
          xlab='Posterior position', ylab='Height', showPoints=T, 
          main='Position of the three tongue sensors: Dutch CVC sequences',
          cols=list(c("cadetblue3","cadetblue1"),c("tomato4","tomato")), 
          alpha=3,cexPoints=0.45)

Tongue trajectory differences

The following graph corresponds to Figure 10 in the manuscript and visualizes the tongue trajectory differences.

par(mfrow=c(4,3))
plotSmooths(model,xvar='Time.normWord',catvar='GroupSensorAxis',
            catlevels=c('TerApel.T1.P','Ubbergen.T1.P'),
            dropRanef=c('SpeakerSensorAxis','WordSensorAxis'),
            ylim=c(0.05,1),legendPos='topleft',main='T1 posterior position',
            xlab='Time (normalized)',ylab='Posterior position',
            legendlabels=c('Ter Apel','Ubbergen'),legendtitle='Group',
            colors=c('cadetblue3','tomato4'),cexPoints=0.5,showPoints=F)

plotSmooths(model,xvar='Time.normWord',catvar='GroupSensorAxis',
            catlevels=c('TerApel.T2.P','Ubbergen.T2.P'),
            dropRanef=c('SpeakerSensorAxis','WordSensorAxis'),
            ylim=c(0.05,1),legendPos='topleft',main='T2 posterior position',
            xlab='Time (normalized)',ylab='Posterior position',
            legendlabels=c('Ter Apel','Ubbergen'),legendtitle='Group',
            colors=c('cadetblue3','tomato4'),cexPoints=0.5,showPoints=F)

plotSmooths(model,xvar='Time.normWord',catvar='GroupSensorAxis',
            catlevels=c('TerApel.T3.P','Ubbergen.T3.P'),
            dropRanef=c('SpeakerSensorAxis','WordSensorAxis'),
            ylim=c(0.05,1),legendPos='bottomleft',main='T3 posterior position',
            xlab='Time (normalized)',ylab='Posterior position',
            legendlabels=c('Ter Apel','Ubbergen'),legendtitle='Group',
            colors=c('cadetblue3','tomato4'),cexPoints=0.5,showPoints=F)

plotDiff(model, xvar="Time.normWord", catvar='GroupSensorAxis',
         catlevels=c('Ubbergen.T1.P','TerApel.T1.P'), 
         dropRanef=c('SpeakerSensorAxis','WordSensorAxis'),
         ylim=c(-0.13,0.13),main='T1 posterior position Ter Apel vs. Ubbergen)', 
         ylab='Posterior position difference',xlab='Time (normalized)')

plotDiff(model, xvar="Time.normWord", catvar='GroupSensorAxis',
         catlevels=c('Ubbergen.T2.P','TerApel.T2.P'), 
         dropRanef=c('SpeakerSensorAxis','WordSensorAxis'),
         ylim=c(-0.13,0.13),main='T2 posterior position Ter Apel vs. Ubbergen)', 
         ylab='Posterior position difference',xlab='Time (normalized)')

plotDiff(model, xvar="Time.normWord", catvar='GroupSensorAxis',
         catlevels=c('Ubbergen.T3.P','TerApel.T3.P'), 
         dropRanef=c('SpeakerSensorAxis','WordSensorAxis'),
         ylim=c(-0.13,0.13),main='T3 posterior position Ter Apel vs. Ubbergen)', 
         ylab='Posterior position difference',xlab='Time (normalized)')

plotSmooths(model,xvar='Time.normWord',catvar='GroupSensorAxis',
            catlevels=c('TerApel.T1.H','Ubbergen.T1.H'),
            dropRanef=c('SpeakerSensorAxis','WordSensorAxis'),
            ylim=c(0.05,1),legendPos='topleft',main='T1 height',
            xlab='Time (normalized)',ylab='Height',
            legendlabels=c('Ter Apel','Ubbergen'),legendtitle='Group',
            colors=c('cadetblue3','tomato4'),cexPoints=0.5,showPoints=F)

plotSmooths(model,xvar='Time.normWord',catvar='GroupSensorAxis',
            catlevels=c('TerApel.T2.H','Ubbergen.T2.H'),
            dropRanef=c('SpeakerSensorAxis','WordSensorAxis'),
            ylim=c(0.05,1),legendPos='topleft',main='T2 height',
            xlab='Time (normalized)',ylab='Height',
            legendlabels=c('Ter Apel','Ubbergen'),legendtitle='Group',
            colors=c('cadetblue3','tomato4'),cexPoints=0.5,showPoints=F)

plotSmooths(model,xvar='Time.normWord',catvar='GroupSensorAxis',
            catlevels=c('TerApel.T3.H','Ubbergen.T3.H'),
            dropRanef=c('SpeakerSensorAxis','WordSensorAxis'),
            ylim=c(0.05,1),legendPos='bottomleft',main='T3 height',
            xlab='Time (normalized)',ylab='Height',
            legendlabels=c('Ter Apel','Ubbergen'),legendtitle='Group',
            colors=c('cadetblue3','tomato4'),cexPoints=0.5,showPoints=F)

plotDiff(model, xvar="Time.normWord", catvar='GroupSensorAxis',
         catlevels=c('Ubbergen.T1.H','TerApel.T1.H'), 
         dropRanef=c('SpeakerSensorAxis','WordSensorAxis'),
         ylim=c(-0.13,0.13),main='T1 height Ter Apel vs. Ubbergen)', 
         ylab='Height difference',xlab='Time (normalized)')

plotDiff(model, xvar="Time.normWord", catvar='GroupSensorAxis',
         catlevels=c('Ubbergen.T2.H','TerApel.T2.H'),
         dropRanef=c('SpeakerSensorAxis','WordSensorAxis'),
         ylim=c(-0.13,0.13),main='T2 height Ter Apel vs. Ubbergen)', 
         ylab='Height difference',xlab='Time (normalized)')

plotDiff(model, xvar="Time.normWord", catvar='GroupSensorAxis',
         catlevels=c('Ubbergen.T3.H','TerApel.T3.H'),
         dropRanef=c('SpeakerSensorAxis','WordSensorAxis'),
         ylim=c(-0.13,0.13),main='T3 height Ter Apel vs. Ubbergen)', 
         ylab='Height difference',xlab='Time (normalized)')

Formal difference assessment

if (!file.exists('modelCVC-binary-1.8.7.rda')) { 
    download.file('http://www.let.rug.nl/wieling/DiaArt/modelCVC-binary-1.8.7.rda',
                  'modelCVC-binary-1.8.7.rda') # 690 MB
}

load('modelCVC-binary-1.8.7.rda')
smrybin # show summary assessing significant differences

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Position.norm ~ s(Time.normWord, by = SensorAxis, k = 20) + SensorAxis + 
##     s(Time.normWord, by = IsTA.T1.P, k = 20) + s(Time.normWord, 
##     by = IsTA.T1.H, k = 20) + s(Time.normWord, by = IsTA.T2.P, 
##     k = 20) + s(Time.normWord, by = IsTA.T2.H, k = 20) + s(Time.normWord, 
##     by = IsTA.T3.P, k = 20) + s(Time.normWord, by = IsTA.T3.H, 
##     k = 20) + s(Time.normWord, SpeakerSensorAxis, bs = "fs", 
##     m = 1) + s(Time.normWord, WordSensorAxis, bs = "fs", m = 1)
## 
## Parametric coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.66551    0.01785  37.291  < 2e-16 ***
## SensorAxisT2.P -0.23699    0.02525  -9.387  < 2e-16 ***
## SensorAxisT1.P -0.45210    0.02527 -17.889  < 2e-16 ***
## SensorAxisT3.H  0.07949    0.02523   3.150 0.001633 ** 
## SensorAxisT2.H -0.08800    0.02530  -3.479 0.000504 ***
## SensorAxisT1.H -0.28318    0.02531 -11.188  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                         edf  Ref.df       F  p-value    
## s(Time.normWord):SensorAxisT3.P      13.871   14.99   4.000 2.58e-07 ***
## s(Time.normWord):SensorAxisT2.P      14.147   15.22   4.727 3.96e-09 ***
## s(Time.normWord):SensorAxisT1.P      15.057   15.96   6.040 1.89e-13 ***
## s(Time.normWord):SensorAxisT3.H      15.267   16.08  12.028  < 2e-16 ***
## s(Time.normWord):SensorAxisT2.H      16.462   17.01  24.052  < 2e-16 ***
## s(Time.normWord):SensorAxisT1.H      17.154   17.57  22.274  < 2e-16 ***
## s(Time.normWord):IsTA.T1.P           13.439   15.70   3.770 6.52e-07 ***
## s(Time.normWord):IsTA.T1.H           13.785   16.08   2.827 0.000128 ***
## s(Time.normWord):IsTA.T2.P           11.212   13.56   3.474 1.48e-05 ***
## s(Time.normWord):IsTA.T2.H           11.795   14.24   3.833 1.27e-06 ***
## s(Time.normWord):IsTA.T3.P            9.999   12.28   2.603 0.001625 ** 
## s(Time.normWord):IsTA.T3.H            2.000    2.00   0.928 0.395547    
## s(Time.normWord,SpeakerSensorAxis) 1910.877 2148.00  28.236  < 2e-16 ***
## s(Time.normWord,WordSensorAxis)    1408.806 1452.00 144.696  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.922   Deviance explained = 92.3%
## fREML = -1.4992e+06  Scale est. = 0.0019074  n = 515454

Comparison with Linear Discriminant Analysis

In this section, the results are determined on the basis of LDA, for comparison with the results on the basis of generalized additive modeling.

LDA for dialect words

# extract relevant columns from dialect data set
ld = dia[!is.na(dia$Position.norm),c("Speaker","Group","Word","WordNr","Segment",
                                     "SensorAxis","Time.normWord","Position.norm")]

# convert to wide format, with separate columns for each sensor/axis
wide = dcast(ld, Speaker + Group + Word + WordNr + Segment + Time.normWord ~ SensorAxis, 
             value.var = "Position.norm")

sgmts = c('a','i','o','k','t')

for (sgm in sgmts) { 
    writeLines(paste("###### SEGMENT:",sgm))
    
    cat('\n### LDA output\n\n')
    # Focus on consistent target segment (rows with missing values are excluded)
    seg = droplevels(na.omit(subset(wide,Segment==sgm)))

    # LDA: Note that the group means generally show directional distances consistent 
    # with the GAMs results: TerApel more back (higher posteriority) and lower than Ubbergen
    print(seg.lda <- lda(Group ~ T1.P + T2.P + T3.P + T1.H + T2.H + T3.H, data=seg, 
                         na.action=na.omit))
    
    cat('\n\n### Test if group means are different\n\n')
    # Test if group means are different
    seg.m = manova(cbind(seg$T1.P, seg$T2.P, seg$T3.P, seg$T1.H, seg$T2.H, seg$T3.H) ~
                   seg$Group)
    print(summary(seg.m))
   
    # Test whether the predicted classes are selected above chance
    seg.p = predict(seg.lda,seg[,7:12])
    k = (seg.p$class==seg$Group)
    cat('\n\n### Confusion matrix\n')
    print(table(seg.p$class,seg$Group))
    
    cat('\n\n### Classification performance\n')
    print(binom.test(sum(k,na.rm=T),length(k)))
    cat('\n')
}

## ###### SEGMENT: a
## 
## ### LDA output
## 
## Call:
## lda(Group ~ T1.P + T2.P + T3.P + T1.H + T2.H + T3.H, data = seg, 
##     na.action = na.omit)
## 
## Prior probabilities of groups:
##   TerApel  Ubbergen 
## 0.6766776 0.3233224 
## 
## Group means:
##               T1.P      T2.P      T3.P      T1.H      T2.H      T3.H
## TerApel  0.2891709 0.5514214 0.7695527 0.2797721 0.4279887 0.5479407
## Ubbergen 0.2954047 0.5252838 0.7533199 0.2742079 0.4712480 0.6198329
## 
## Coefficients of linear discriminants:
##             LD1
## T1.P  11.193650
## T2.P -19.852718
## T3.P   7.671764
## T1.H  -2.154486
## T2.H   0.672566
## T3.H   4.005326
## 
## 
## ### Test if group means are different
## 
##             Df  Pillai approx F num Df den Df    Pr(>F)    
## seg$Group    1 0.12102   182.45      6   7951 < 2.2e-16 ***
## Residuals 7956                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## ### Confusion matrix
##           
##            TerApel Ubbergen
##   TerApel     4791     1919
##   Ubbergen     594      654
## 
## 
## ### Classification performance
## 
##  Exact binomial test
## 
## data:  sum(k, na.rm = T) and length(k)
## number of successes = 5445, number of trials = 7958, p-value <
## 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.6738756 0.6944236
## sample estimates:
## probability of success 
##              0.6842171 
## 
## 
## ###### SEGMENT: i
## 
## ### LDA output
## 
## Call:
## lda(Group ~ T1.P + T2.P + T3.P + T1.H + T2.H + T3.H, data = seg, 
##     na.action = na.omit)
## 
## Prior probabilities of groups:
##   TerApel  Ubbergen 
## 0.4709135 0.5290865 
## 
## Group means:
##               T1.P      T2.P      T3.P      T1.H      T2.H      T3.H
## TerApel  0.1943743 0.4246195 0.6464775 0.4328305 0.6848339 0.8068518
## Ubbergen 0.1729227 0.3770877 0.6174808 0.4786159 0.7157095 0.8830820
## 
## Coefficients of linear discriminants:
##             LD1
## T1.P   3.541838
## T2.P -13.432849
## T3.P   8.315477
## T1.H   7.420194
## T2.H  -8.367080
## T3.H   9.457053
## 
## 
## ### Test if group means are different
## 
##             Df  Pillai approx F num Df den Df    Pr(>F)    
## seg$Group    1 0.23936   435.99      6   8313 < 2.2e-16 ***
## Residuals 8318                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## ### Confusion matrix
##           
##            TerApel Ubbergen
##   TerApel     2456      947
##   Ubbergen    1462     3455
## 
## 
## ### Classification performance
## 
##  Exact binomial test
## 
## data:  sum(k, na.rm = T) and length(k)
## number of successes = 5911, number of trials = 8320, p-value <
## 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.7005784 0.7201874
## sample estimates:
## probability of success 
##              0.7104567 
## 
## 
## ###### SEGMENT: o
## 
## ### LDA output
## 
## Call:
## lda(Group ~ T1.P + T2.P + T3.P + T1.H + T2.H + T3.H, data = seg, 
##     na.action = na.omit)
## 
## Prior probabilities of groups:
##   TerApel  Ubbergen 
## 0.3710556 0.6289444 
## 
## Group means:
##               T1.P      T2.P      T3.P      T1.H      T2.H      T3.H
## TerApel  0.3582128 0.6136622 0.8225727 0.1937995 0.3920918 0.5667555
## Ubbergen 0.3653927 0.5874748 0.8125403 0.2585737 0.4651559 0.6676979
## 
## Coefficients of linear discriminants:
##             LD1
## T1.P  11.185945
## T2.P -15.845352
## T3.P   4.833566
## T1.H   5.429878
## T2.H  -5.174782
## T3.H   7.415979
## 
## 
## ### Test if group means are different
## 
##             Df  Pillai approx F num Df den Df    Pr(>F)    
## seg$Group    1 0.24185   419.15      6   7884 < 2.2e-16 ***
## Residuals 7889                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## ### Confusion matrix
##           
##            TerApel Ubbergen
##   TerApel     1515      725
##   Ubbergen    1413     4238
## 
## 
## ### Classification performance
## 
##  Exact binomial test
## 
## data:  sum(k, na.rm = T) and length(k)
## number of successes = 5753, number of trials = 7891, p-value <
## 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.7191054 0.7388420
## sample estimates:
## probability of success 
##              0.7290584 
## 
## 
## ###### SEGMENT: k
## 
## ### LDA output
## 
## Call:
## lda(Group ~ T1.P + T2.P + T3.P + T1.H + T2.H + T3.H, data = seg, 
##     na.action = na.omit)
## 
## Prior probabilities of groups:
##   TerApel  Ubbergen 
## 0.4399054 0.5600946 
## 
## Group means:
##               T1.P      T2.P      T3.P      T1.H      T2.H      T3.H
## TerApel  0.2389757 0.4551985 0.6770988 0.3345936 0.6122836 0.8075722
## Ubbergen 0.2044630 0.3963662 0.6234869 0.3794513 0.6246470 0.8710793
## 
## Coefficients of linear discriminants:
##             LD1
## T1.P -0.3277397
## T2.P -0.3929488
## T3.P -3.0346310
## T1.H  7.3502104
## T2.H -9.3643360
## T3.H 12.4287216
## 
## 
## ### Test if group means are different
## 
##             Df  Pillai approx F num Df den Df    Pr(>F)    
## seg$Group    1 0.26206   374.84      6   6333 < 2.2e-16 ***
## Residuals 6338                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## ### Confusion matrix
##           
##            TerApel Ubbergen
##   TerApel     1673      604
##   Ubbergen    1116     2947
## 
## 
## ### Classification performance
## 
##  Exact binomial test
## 
## data:  sum(k, na.rm = T) and length(k)
## number of successes = 4620, number of trials = 6340, p-value <
## 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.7175796 0.7396229
## sample estimates:
## probability of success 
##              0.7287066 
## 
## 
## ###### SEGMENT: t
## 
## ### LDA output
## 
## Call:
## lda(Group ~ T1.P + T2.P + T3.P + T1.H + T2.H + T3.H, data = seg, 
##     na.action = na.omit)
## 
## Prior probabilities of groups:
##   TerApel  Ubbergen 
## 0.4636455 0.5363545 
## 
## Group means:
##               T1.P      T2.P      T3.P      T1.H      T2.H      T3.H
## TerApel  0.2352396 0.4957411 0.7062446 0.5302483 0.6366531 0.6788357
## Ubbergen 0.1336021 0.3765153 0.6251408 0.5443852 0.6692314 0.7242568
## 
## Coefficients of linear discriminants:
##             LD1
## T1.P  -2.898990
## T2.P -14.115635
## T3.P   2.563233
## T1.H   3.975205
## T2.H  -1.244822
## T3.H   1.334026
## 
## 
## ### Test if group means are different
## 
##              Df  Pillai approx F num Df den Df    Pr(>F)    
## seg$Group     1 0.45873   1834.8      6  12990 < 2.2e-16 ***
## Residuals 12995                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## ### Confusion matrix
##           
##            TerApel Ubbergen
##   TerApel     4658      792
##   Ubbergen    1368     6179
## 
## 
## ### Classification performance
## 
##  Exact binomial test
## 
## data:  sum(k, na.rm = T) and length(k)
## number of successes = 10837, number of trials = 12997, p-value <
## 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.8272950 0.8401708
## sample estimates:
## probability of success 
##              0.8338078

LDA for CVC sequences

# extract relevant columns from dialect data set
ld = cvc[!is.na(cvc$Position.norm),c("Speaker","Group","Word","WordNr","Segment",
                                     "SensorAxis","Time.normWord","Position.norm")]

# convert to wide format, with separate columns for each sensor/axis
wide = dcast(ld, Speaker + Group + Word + WordNr + Segment + Time.normWord ~ SensorAxis, 
             value.var = "Position.norm")

sgmts = c('i','a','o','k','t')

for (sgm in sgmts) { 
    writeLines(paste("###### SEGMENT:",sgm))
    
    cat('\n### LDA output\n\n')
    # Focus on consistent target segment (rows with missing values are excluded)
    seg = droplevels(na.omit(subset(wide,Segment==sgm)))

    # LDA: Note that the group means generally show directional distances consistent 
    # with the GAMs results: TerApel more back (higher posteriority) and lower than Ubbergen
    print(seg.lda <- lda(Group ~ T1.P + T2.P + T3.P + T1.H + T2.H + T3.H, data=seg, 
                         na.action=na.omit))
    
    cat('\n\n### Test if group means are different\n\n')
    # Test if group means are different
    seg.m = manova(cbind(seg$T1.P, seg$T2.P, seg$T3.P, seg$T1.H, seg$T2.H, seg$T3.H) ~
                   seg$Group)
    print(summary(seg.m))
   
    # Test whether the predicted classes are selected above chance
    seg.p = predict(seg.lda,seg[,7:12])
    k = (seg.p$class==seg$Group)
    cat('\n\n### Confusion matrix\n')
    print(table(seg.p$class,seg$Group))
    
    cat('\n\n### Classification performance\n')
    print(binom.test(sum(k,na.rm=T),length(k)))
    cat('\n')
}

## ###### SEGMENT: i
## 
## ### LDA output
## 
## Call:
## lda(Group ~ T1.P + T2.P + T3.P + T1.H + T2.H + T3.H, data = seg, 
##     na.action = na.omit)
## 
## Prior probabilities of groups:
##   TerApel  Ubbergen 
## 0.4507191 0.5492809 
## 
## Group means:
##               T1.P      T2.P      T3.P      T1.H      T2.H      T3.H
## TerApel  0.1839142 0.3964286 0.6282997 0.4243656 0.7254252 0.8612279
## Ubbergen 0.1469226 0.3375954 0.5879307 0.4483011 0.7097348 0.9161968
## 
## Coefficients of linear discriminants:
##            LD1
## T1.P -2.580868
## T2.P -8.799849
## T3.P  4.289847
## T1.H  7.263122
## T2.H -9.097211
## T3.H 10.249383
## 
## 
## ### Test if group means are different
## 
##             Df  Pillai approx F num Df den Df    Pr(>F)    
## seg$Group    1 0.30142   544.53      6   7572 < 2.2e-16 ***
## Residuals 7577                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## ### Confusion matrix
##           
##            TerApel Ubbergen
##   TerApel     2142      901
##   Ubbergen    1274     3262
## 
## 
## ### Classification performance
## 
##  Exact binomial test
## 
## data:  sum(k, na.rm = T) and length(k)
## number of successes = 5404, number of trials = 7579, p-value <
## 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.7026923 0.7231893
## sample estimates:
## probability of success 
##              0.7130228 
## 
## 
## ###### SEGMENT: a
## 
## ### LDA output
## 
## Call:
## lda(Group ~ T1.P + T2.P + T3.P + T1.H + T2.H + T3.H, data = seg, 
##     na.action = na.omit)
## 
## Prior probabilities of groups:
##   TerApel  Ubbergen 
## 0.5013864 0.4986136 
## 
## Group means:
##               T1.P      T2.P      T3.P      T1.H      T2.H      T3.H
## TerApel  0.2671566 0.5022219 0.7182539 0.2818187 0.5024803 0.6200569
## Ubbergen 0.2438843 0.4696303 0.7092268 0.2807898 0.4323438 0.5824709
## 
## Coefficients of linear discriminants:
##              LD1
## T1.P   0.1357679
## T2.P -14.0853354
## T3.P   9.4084456
## T1.H   7.4292286
## T2.H -13.9236000
## T3.H   4.8903032
## 
## 
## ### Test if group means are different
## 
##              Df  Pillai approx F num Df den Df    Pr(>F)    
## seg$Group     1 0.25047   782.97      6  14058 < 2.2e-16 ***
## Residuals 14063                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## ### Confusion matrix
##           
##            TerApel Ubbergen
##   TerApel     5209     2223
##   Ubbergen    1843     4790
## 
## 
## ### Classification performance
## 
##  Exact binomial test
## 
## data:  sum(k, na.rm = T) and length(k)
## number of successes = 9999, number of trials = 14065, p-value <
## 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.7033429 0.7183970
## sample estimates:
## probability of success 
##              0.7109136 
## 
## 
## ###### SEGMENT: o
## 
## ### LDA output
## 
## Call:
## lda(Group ~ T1.P + T2.P + T3.P + T1.H + T2.H + T3.H, data = seg, 
##     na.action = na.omit)
## 
## Prior probabilities of groups:
##   TerApel  Ubbergen 
## 0.5786739 0.4213261 
## 
## Group means:
##               T1.P      T2.P      T3.P      T1.H      T2.H      T3.H
## TerApel  0.3836173 0.6127728 0.8218935 0.2471225 0.5014741 0.6700617
## Ubbergen 0.3509027 0.5603653 0.7790775 0.2853263 0.5209739 0.7160739
## 
## Coefficients of linear discriminants:
##              LD1
## T1.P   8.4741358
## T2.P -16.7488646
## T3.P   0.1832691
## T1.H   3.7233562
## T2.H  -5.6484760
## T3.H   5.3834986
## 
## 
## ### Test if group means are different
## 
##             Df  Pillai approx F num Df den Df    Pr(>F)    
## seg$Group    1 0.15658    273.7      6   8846 < 2.2e-16 ***
## Residuals 8851                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## ### Confusion matrix
##           
##            TerApel Ubbergen
##   TerApel     4074     1871
##   Ubbergen    1049     1859
## 
## 
## ### Classification performance
## 
##  Exact binomial test
## 
## data:  sum(k, na.rm = T) and length(k)
## number of successes = 5933, number of trials = 8853, p-value <
## 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.6602635 0.6799610
## sample estimates:
## probability of success 
##              0.6701683 
## 
## 
## ###### SEGMENT: k
## 
## ### LDA output
## 
## Call:
## lda(Group ~ T1.P + T2.P + T3.P + T1.H + T2.H + T3.H, data = seg, 
##     na.action = na.omit)
## 
## Prior probabilities of groups:
##   TerApel  Ubbergen 
## 0.5167247 0.4832753 
## 
## Group means:
##               T1.P      T2.P      T3.P      T1.H      T2.H      T3.H
## TerApel  0.2434248 0.4484329 0.6751671 0.3166901 0.6208556 0.8212812
## Ubbergen 0.2109022 0.4009014 0.6302337 0.3703093 0.6244048 0.8613941
## 
## Coefficients of linear discriminants:
##             LD1
## T1.P  0.5104503
## T2.P -1.9469977
## T3.P -1.8610251
## T1.H  9.6625347
## T2.H -9.6531470
## T3.H  8.4316152
## 
## 
## ### Test if group means are different
## 
##              Df  Pillai approx F num Df den Df    Pr(>F)    
## seg$Group     1 0.19397   644.83      6  16077 < 2.2e-16 ***
## Residuals 16082                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## ### Confusion matrix
##           
##            TerApel Ubbergen
##   TerApel     5918     2656
##   Ubbergen    2393     5117
## 
## 
## ### Classification performance
## 
##  Exact binomial test
## 
## data:  sum(k, na.rm = T) and length(k)
## number of successes = 11035, number of trials = 16084, p-value <
## 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.6788492 0.6932546
## sample estimates:
## probability of success 
##              0.6860856 
## 
## 
## ###### SEGMENT: t
## 
## ### LDA output
## 
## Call:
## lda(Group ~ T1.P + T2.P + T3.P + T1.H + T2.H + T3.H, data = seg, 
##     na.action = na.omit)
## 
## Prior probabilities of groups:
##  TerApel Ubbergen 
## 0.533519 0.466481 
## 
## Group means:
##               T1.P     T2.P      T3.P      T1.H      T2.H      T3.H
## TerApel  0.2288785 0.486989 0.7036794 0.5242628 0.6647319 0.7172740
## Ubbergen 0.1277324 0.368123 0.6161714 0.5246833 0.6693391 0.7410292
## 
## Coefficients of linear discriminants:
##              LD1
## T1.P  -2.8559369
## T2.P -12.5088756
## T3.P   0.3429155
## T1.H   4.0766302
## T2.H  -4.2952272
## T3.H   0.8676794
## 
## 
## ### Test if group means are different
## 
##              Df  Pillai approx F num Df den Df    Pr(>F)    
## seg$Group     1 0.43207   2092.9      6  16506 < 2.2e-16 ***
## Residuals 16511                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## ### Confusion matrix
##           
##            TerApel Ubbergen
##   TerApel     7122     1482
##   Ubbergen    1688     6221
## 
## 
## ### Classification performance
## 
##  Exact binomial test
## 
## data:  sum(k, na.rm = T) and length(k)
## number of successes = 13343, number of trials = 16513, p-value <
## 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.8019386 0.8140128
## sample estimates:
## probability of success 
##                0.80803

Comparison with formants

In this section, the articulatory results are compared to the acoustic results.

Formant comparison for dialect words

Correlations

# correlations of F1 with height
(t1h = with(subset(dia,SensorAxis=='T1.H'),cor(Position.norm,F1.norm,use='pair')))

## [1] -0.1816255

(t2h = with(subset(dia,SensorAxis=='T2.H'),cor(Position.norm,F1.norm,use='pair')))

## [1] -0.282886

(t3h = with(subset(dia,SensorAxis=='T3.H'),cor(Position.norm,F1.norm,use='pair')))

## [1] -0.3364694

mean(c(t1h,t2h,t3h))

## [1] -0.2669937

# correlations of F2 with posterior position
(t1p = with(subset(dia,SensorAxis=='T1.P'),cor(Position.norm,F2.norm,use='pair')))

## [1] -0.4720476

(t2p = with(subset(dia,SensorAxis=='T2.P'),cor(Position.norm,F2.norm,use='pair')))

## [1] -0.5302865

(t3p = with(subset(dia,SensorAxis=='T3.P'),cor(Position.norm,F2.norm,use='pair')))

## [1] -0.5415716

mean(c(t1p,t2p,t3p))

## [1] -0.5146352

# all correlations are significant, example for weakest correlation:
summary(lm(Position.norm ~ F1.norm, data=dia[dia$SensorAxis=='T1.H',]))

## 
## Call:
## lm(formula = Position.norm ~ F1.norm, data = dia[dia$SensorAxis == 
##     "T1.H", ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.38135 -0.08493  0.00566  0.08828  0.53081 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.3541624  0.0004041  876.52   <2e-16 ***
## F1.norm     -0.0259020  0.0004320  -59.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1311 on 105372 degrees of freedom
##   (152850 observations deleted due to missingness)
## Multiple R-squared:  0.03299,    Adjusted R-squared:  0.03298 
## F-statistic:  3595 on 1 and 105372 DF,  p-value: < 2.2e-16

Visualization

The following visualization corresponds to Figure 11 in the manuscript.

sensoraxis = c('T1.H','T2.H','T3.H','T1.P','T2.P','T3.P')
par(mfrow=c(2,3))

for (sa in sensoraxis) {
    # select sensor and axis
    tmp = dia[dia$SensorAxis==sa,]

    # labels, etc. dependent on sensoraxis
    if (sa %in% c('T1.P','T2.P','T3.P')) { 
        xl = 'Posterior position (normalized)'
        yl = 'F2 (normalized)'
        tmp$F.norm = tmp$F2.norm # F2 corresponds with posterior position
    } else { 
        reg1 <- lm(tmp$F1.norm ~ tmp$Position.norm)
        xl = 'Height (normalized)'
        yl = 'F1 (normalized)'
        tmp$F.norm = tmp$F1.norm # F1 corresponds with height
    }
    
    # obtain regression line
    reg1 <- lm(tmp$F.norm ~ tmp$Position.norm)

    # scatterplot of position and F1 valuess
    plot(tmp$Position.norm, tmp$F.norm, xlab=xl, ylab=yl,
         main=paste('Dialect words:',substr(sa,1,2)), col='grey', cex.lab=2,
         cex.main=2,cex.lab=1.5,xlim=c(0,1),ylim=c(-5,5))
    abline(reg1,lty=2)
}

Group differences

# select formants (not separated per axis and sensor)
formantsDia = unique(dia[,c("Speaker","Group","Word","WordNr","Segment",
                            "SegmentNr","Time.normWord","F1.norm","F2.norm")])

# compare to Ubbergen: set as reference level
formantsDia$Group = relevel(formantsDia$Group,'Ubbergen') 

# mixed model shows lower F1 in Ter Apel for dialect words
summary(lmer(F1.norm ~ Group + (1|Speaker),data=formantsDia))

## Linear mixed model fit by REML ['lmerMod']
## Formula: F1.norm ~ Group + (1 | Speaker)
##    Data: formantsDia
## 
## REML criterion at convergence: 285654.7
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.4984 -0.6981 -0.1521  0.6524  7.1421 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev.
##  Speaker  (Intercept) 0.0007906 0.02812 
##  Residual             0.8716251 0.93361 
## Number of obs: 105756, groups:  Speaker, 40
## 
## Fixed effects:
##               Estimate Std. Error t value
## (Intercept)   0.002596   0.007573   0.343
## GroupTerApel -0.077862   0.010616  -7.335
## 
## Correlation of Fixed Effects:
##             (Intr)
## GroupTerApl -0.713

# mixed model shows lower F2 in Ter Apel for dialect words
summary(lmer(F2.norm ~ Group + (1|Speaker),data=formantsDia))

## Linear mixed model fit by REML ['lmerMod']
## Formula: F2.norm ~ Group + (1 | Speaker)
##    Data: formantsDia
## 
## REML criterion at convergence: 301038.9
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4810 -0.8317  0.0594  0.6663  4.4686 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev.
##  Speaker  (Intercept) 0.0009088 0.03015 
##  Residual             0.9958372 0.99792 
## Number of obs: 106213, groups:  Speaker, 40
## 
## Fixed effects:
##               Estimate Std. Error t value
## (Intercept)   0.046534   0.008098   5.747
## GroupTerApel -0.063604   0.011365  -5.596
## 
## Correlation of Fixed Effects:
##             (Intr)
## GroupTerApl -0.712

Formant comparison for CVC sequences

Correlations

# correlations of F1 with height
(t1h = with(subset(cvc,SensorAxis=='T1.H'),cor(Position.norm,F1.norm,use='pair')))

## [1] -0.3380322

(t2h = with(subset(cvc,SensorAxis=='T2.H'),cor(Position.norm,F1.norm,use='pair')))

## [1] -0.504336

(t3h = with(subset(cvc,SensorAxis=='T3.H'),cor(Position.norm,F1.norm,use='pair')))

## [1] -0.6216022

mean(c(t1h,t2h,t3h))

## [1] -0.4879901

# correlations of F2 with posterior position
(t1p = with(subset(cvc,SensorAxis=='T1.P'),cor(Position.norm,F2.norm,use='pair')))

## [1] -0.640532

(t2p = with(subset(cvc,SensorAxis=='T2.P'),cor(Position.norm,F2.norm,use='pair')))

## [1] -0.6926906

(t3p = with(subset(cvc,SensorAxis=='T3.P'),cor(Position.norm,F2.norm,use='pair')))

## [1] -0.678066

mean(c(t1p,t2p,t3p))

## [1] -0.6704295

# all correlations are significant, example for weakest correlation:
summary(lm(Position.norm ~ F1.norm, data=cvc[cvc$SensorAxis=='T1.H',]))

## 
## Call:
## lm(formula = Position.norm ~ F1.norm, data = cvc[cvc$SensorAxis == 
##     "T1.H", ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34468 -0.09077 -0.00201  0.08314  0.51842 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.3083326  0.0007122  432.95   <2e-16 ***
## F1.norm     -0.0388728  0.0006012  -64.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1276 on 32404 degrees of freedom
##   (53656 observations deleted due to missingness)
## Multiple R-squared:  0.1143, Adjusted R-squared:  0.1142 
## F-statistic:  4180 on 1 and 32404 DF,  p-value: < 2.2e-16

Visualization

The following visualization corresponds to Figure 12 in the manuscript.

sensoraxis = c('T1.H','T2.H','T3.H','T1.P','T2.P','T3.P')
par(mfrow=c(2,3))

for (sa in sensoraxis) {
    # select sensor and axis
    tmp = cvc[cvc$SensorAxis==sa,]

    # labels, etc. dependent on sensoraxis
    if (sa %in% c('T1.P','T2.P','T3.P')) { 
        xl = 'Posterior position (normalized)'
        yl = 'F2 (normalized)'
        tmp$F.norm = tmp$F2.norm # F2 corresponds with posterior position
    } else { 
        reg1 <- lm(tmp$F1.norm ~ tmp$Position.norm)
        xl = 'Height (normalized)'
        yl = 'F1 (normalized)'
        tmp$F.norm = tmp$F1.norm # F1 corresponds with height
    }
    
    # obtain regression line
    reg1 <- lm(tmp$F.norm ~ tmp$Position.norm)
    
    # scatterplot of position and F1 valuess
    plot(tmp$Position.norm, tmp$F.norm, xlab=xl, ylab=yl,
         main=paste('CVC sequences:',substr(sa,1,2)), col='grey', cex.lab=2,
         cex.main=2,cex.lab=1.5,xlim=c(0,1),ylim=c(-5,5))
    abline(reg1,lty=2)
}

Group differences

# select formants (not separated per axis and sensor)
formantsCVC = unique(cvc[,c("Speaker","Group","Word","WordNr","Segment",
                            "SegmentNr","Time.normWord","F1.norm","F2.norm")])

# compare to Ubbergen: set as reference level
formantsCVC$Group = relevel(formantsCVC$Group,'Ubbergen') 

# mixed model shows higher F1 in Ter Apel for dialect words
summary(lmer(F1.norm ~ Group + (1|Speaker),data=formantsCVC))

## Linear mixed model fit by REML ['lmerMod']
## Formula: F1.norm ~ Group + (1 | Speaker)
##    Data: formantsCVC
## 
## REML criterion at convergence: 102914
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.6729 -0.8326 -0.1990  0.9309  3.4232 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Speaker  (Intercept) 0.009558 0.09777 
##  Residual             1.369114 1.17009 
## Number of obs: 32623, groups:  Speaker, 40
## 
## Fixed effects:
##               Estimate Std. Error t value
## (Intercept)  -0.007803   0.024227  -0.322
## GroupTerApel  0.239976   0.033595   7.143
## 
## Correlation of Fixed Effects:
##             (Intr)
## GroupTerApl -0.721

# mixed model shows higher F2 in Ter Apel for dialect words
summary(lmer(F2.norm ~ Group + (1|Speaker),data=formantsCVC))

## Linear mixed model fit by REML ['lmerMod']
## Formula: F2.norm ~ Group + (1 | Speaker)
##    Data: formantsCVC
## 
## REML criterion at convergence: 89202.1
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.5068 -0.8843 -0.0824  0.5032  3.4622 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Speaker  (Intercept) 0.01335  0.1156  
##  Residual             0.97861  0.9892  
## Number of obs: 31636, groups:  Speaker, 40
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept)  -0.16878    0.02766  -6.102
## GroupTerApel  0.22049    0.03827   5.762
## 
## Correlation of Fixed Effects:
##             (Intr)
## GroupTerApl -0.723

Replication

To replicate the analysis presented above, you can just copy the following lines to the most recent version of R. Please note that you first need to install Pandoc. Warning: when replicating this analysis, several data sets and fitted models will need to be downloaded, amounting to about 8 GB (i.e. 8000 MB).

download.file('http://www.let.rug.nl/wieling/DiaArt/analysis.Rmd', 'analysis.Rmd')
library(rmarkdown)
render('analysis.Rmd') # generates html file with results
browseURL(paste('file://', file.path(getwd(),'analysis.html'), sep='')) # shows result

Data, documentation and analysis scripts for Investigating dialectal differences using articulography

Abstract

Required libraries and functions

Dataset

Data description

Word-specific models

Visualizing word-specific patterns

Assessing signficant trajectory differences: “taarten”

Visualization

Formal difference assessment

Aggregate model: dialect words

Visualization

Individual variation

Autocorrelation

Tongue trajectories in two dimensions

Tongue trajectory differences

Formal difference assessment

Aggregate model: CVC sequences

Visualization

Individual variation

Autocorrelation

Tongue trajectories in two dimensions

Tongue trajectory differences

Formal difference assessment

Comparison with Linear Discriminant Analysis

LDA for dialect words

LDA for CVC sequences

Comparison with formants

Formant comparison for dialect words

Correlations

Visualization

Group differences

Formant comparison for CVC sequences

Correlations

Visualization

Group differences

Replication