UH/UM in Dutch

Martijn Wieling (University of Groningen, the Netherlands)

Data and scripts for the Language Log guest post (German) of Martijn Wieling

To run the complete analysis yourself, please refer to the bottom of this page.


Preparation

The following lines load the required library, download the required files, and load the data.

# load required packages
library(lme4)
library(ggplot2)

# version information
R.version.string
## [1] "R version 3.1.1 (2014-07-10)"
packageVersion('lme4')
## [1] '1.1.7'
packageVersion('ggplot2')
## [1] '1.0.0'
# download required files and scripts
download.file('http://www.let.rug.nl/wieling/ll/multiplot.R', 'multiplot.R')
download.file('http://www.let.rug.nl/wieling/ll/German-UH-UM.txt', 'German-UH-UM.txt')

# load custom plotting function
source('multiplot.R') # custom plotting function

# read data
spk = read.table('German-UH-UM.txt',sep='\t',header=T,encoding='UTF-8')


Results

The column ‘UM’ contains TRUE if the hesitation marker was equal to “ühm” or “öhm”, and FALSE if equal to “üh” or “öh”.

dat = aggregate(spk$UM,by=list(spk$AgeGroup,spk$Gender),FUN=mean)
colnames(dat) = c('BirthYear','Gender','RelFreqUM')
dat
##   BirthYear Gender RelFreqUM
## 1 1930-1964      F    0.1390
## 2 1965-1981      F    0.4199
## 3 1982-1986      F    0.5431
## 4 1987-2006      F    0.6258
## 5 1930-1964      M    0.2040
## 6 1965-1981      M    0.3331
## 7 1982-1986      M    0.4633
## 8 1987-2006      M    0.4950

To generate the graph, we first have to calculate the 95% confidence intervals around the average values (per age group) we want to visualize. The 95% confidence interval lies within 1.96 times the standard error around the average of each group. The standard error is calculated by dividing the standard deviation by the square root of the number of observations in the group. The calculation thus can be done as follows:

# calculation of standard deviation
tmp = aggregate(spk$UM,by=list(spk$AgeGroup,spk$Gender),FUN=sd)
colnames(tmp) = c('BirthYear','Gender','RelFreqUM.sd')
dat = merge(dat,tmp,by=c('BirthYear','Gender'))

# calculation of the number of observations per group
spk$ones = 1
tmp = aggregate(spk$ones,by=list(spk$AgeGroup,spk$Gender),FUN=sum)
colnames(tmp) = c('BirthYear','Gender','RelFreqUM.N')
dat = merge(dat,tmp,by=c('BirthYear','Gender'))

# storing the 95% confidence bands (1.96 standard deviations above and below the mean)
dat$RelFreqUM.lower = dat$RelFreqUM - (1.96 * (dat$RelFreqUM.sd / sqrt(dat$RelFreqUM.N)))
dat$RelFreqUM.upper = dat$RelFreqUM + (1.96 * (dat$RelFreqUM.sd / sqrt(dat$RelFreqUM.N)))

The following command visualizes the graph including the confidence bands:

ggplot(data = dat, aes(x = BirthYear, y = RelFreqUM, colour = Gender)) + 
geom_line(aes(group = Gender)) + theme_bw() + xlab('Year of birth') + ylab(" ") + 
geom_errorbar(aes(ymin=RelFreqUM.lower, ymax=RelFreqUM.upper, width=.1)) +
ggtitle("Relative frequency of 'um' (238 speakers)")

plot of chunk graph

Clearly the plot shows that women and younger speakers show a greater relative frequency of the use of ‘um’. The mixed-effects logistic regression model supports this pattern:

# BirthYear was z-transformed to prevent a warning during fitting of the model
model1 = glmer(UM ~ BirthYear.z + Gender + 
                    (1|Speaker) + (0+BirthYear.z|Interview) + 
                    (1+Gender|Interview), family='binomial', data=spk) 
summary(model1)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: UM ~ BirthYear.z + Gender + (1 | Speaker) + (0 + BirthYear.z |  
##     Interview) + (1 + Gender | Interview)
##    Data: spk
## 
##      AIC      BIC   logLik deviance df.resid 
##    17299    17361    -8642    17283    16213 
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -3.693 -0.649 -0.273  0.696  6.183 
## 
## Random effects:
##  Groups      Name        Variance Std.Dev. Corr 
##  Speaker     (Intercept) 0.9259   0.962         
##  Interview   BirthYear.z 0.0207   0.144         
##  Interview.1 (Intercept) 0.0708   0.266         
##              GenderM     0.4193   0.648    -0.80
## Number of obs: 16221, groups:  Speaker, 238; Interview, 234
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    0.125      0.108    1.15    0.249    
## BirthYear.z    0.943      0.082   11.50   <2e-16 ***
## GenderM       -0.434      0.161   -2.70    0.007 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) BrthY.
## BirthYear.z -0.047       
## GenderM     -0.688  0.083

The random-effects structure of the model (determined via AIC-based model comparison) is added to prevent obtaining p-values which are too low. The first line of the fixed effects description (the part which we are focusing on here) shows the intercept. As the positive estimate does not significantly differ from 0, this indicates that female speakers born in 1980 (the average year of birth: BirthYear.z equals 0) do not use ‘um’ significantly more frequently than ‘uh’. The positive estimate for year of birth indicates that younger speakers (having a higher year of birth) are significantly more likely to use ‘um’ as opposed to ‘uh’. Similarly, the negative estimate for the male gender indicates that men are significantly less likely to use ‘um’ as opposed to ‘uh’ than women.

Whereas the graph above suggests there is an interaction between year of birth and gender, this interaction is not significant, as the following model illustrates.

model2 = glmer(UM ~ BirthYear.z * Gender + 
                    (1|Speaker) + (0+BirthYear.z|Interview) + 
                    (1+Gender|Interview), family='binomial', data=spk) 
summary(model2)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: UM ~ BirthYear.z * Gender + (1 | Speaker) + (0 + BirthYear.z |  
##     Interview) + (1 + Gender | Interview)
##    Data: spk
## 
##      AIC      BIC   logLik deviance df.resid 
##    17300    17369    -8641    17282    16212 
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -3.686 -0.649 -0.270  0.696  5.967 
## 
## Random effects:
##  Groups      Name        Variance Std.Dev. Corr 
##  Speaker     (Intercept) 0.9151   0.957         
##  Interview   BirthYear.z 0.0212   0.145         
##  Interview.1 (Intercept) 0.0716   0.268         
##              GenderM     0.4235   0.651    -0.80
## Number of obs: 16221, groups:  Speaker, 238; Interview, 234
## 
## Fixed effects:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            0.119      0.108    1.10   0.2719    
## BirthYear.z            1.028      0.120    8.59   <2e-16 ***
## GenderM               -0.437      0.160   -2.72   0.0065 ** 
## BirthYear.z:GenderM   -0.160      0.163   -0.98   0.3256    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) BrthY. GendrM
## BirthYear.z -0.073              
## GenderM     -0.687  0.044       
## BrthYr.z:GM  0.056 -0.731  0.020


Acknowledgements

I thank Thomas Schmidt of the IDS Mannheim for helping me to obtain the data from the Datenbank für Gesprochenes Deutsch.


Replication of the analysis

To replicate the analysis presented above, you can just copy the following lines to the most recent version of R. You need the packages ‘lme4’, ‘ggplot2’ and ‘rmarkdown’. If these are not installed (the library commands will throw an error), you can uncomment (i.e. remove the hashtag) the first three lines to install them.

#install.packages('lme4',repos='http://cran.us.r-project.org')
#install.packages('ggplot2',repos='http://cran.us.r-project.org')
#install.packages('rmarkdown',repos='http://cran.us.r-project.org')
download.file('http://www.let.rug.nl/wieling/ll/analysis-German.Rmd', 'analysis-German.Rmd')
library(rmarkdown)
render('analysis-German.Rmd') # generates html file with results
browseURL(paste('file://', file.path(getwd(),'analysis-German.html'), sep='')) # shows result