Martijn Wieling (University of Groningen, the Netherlands)
Data and scripts for the Language Log guest post (German) of Martijn Wieling
To run the complete analysis yourself, please refer to the bottom of this page.
Preparation
The following lines load the required library, download the required files, and load the data.
# load required packages
library(lme4)
library(ggplot2)
# version information
R.version.string
## [1] "R version 3.1.1 (2014-07-10)"
packageVersion('lme4')
## [1] '1.1.7'
packageVersion('ggplot2')
## [1] '1.0.0'
# download required files and scripts
download.file('http://www.let.rug.nl/wieling/ll/multiplot.R', 'multiplot.R')
download.file('http://www.let.rug.nl/wieling/ll/German-UH-UM.txt', 'German-UH-UM.txt')
# load custom plotting function
source('multiplot.R') # custom plotting function
# read data
spk = read.table('German-UH-UM.txt',sep='\t',header=T,encoding='UTF-8')
Results
The column ‘UM’ contains TRUE if the hesitation marker was equal to “ühm” or “öhm”, and FALSE if equal to “üh” or “öh”.
dat = aggregate(spk$UM,by=list(spk$AgeGroup,spk$Gender),FUN=mean)
colnames(dat) = c('BirthYear','Gender','RelFreqUM')
dat
## BirthYear Gender RelFreqUM
## 1 1930-1964 F 0.1390
## 2 1965-1981 F 0.4199
## 3 1982-1986 F 0.5431
## 4 1987-2006 F 0.6258
## 5 1930-1964 M 0.2040
## 6 1965-1981 M 0.3331
## 7 1982-1986 M 0.4633
## 8 1987-2006 M 0.4950
To generate the graph, we first have to calculate the 95% confidence intervals around the average values (per age group) we want to visualize. The 95% confidence interval lies within 1.96 times the standard error around the average of each group. The standard error is calculated by dividing the standard deviation by the square root of the number of observations in the group. The calculation thus can be done as follows:
# calculation of standard deviation
tmp = aggregate(spk$UM,by=list(spk$AgeGroup,spk$Gender),FUN=sd)
colnames(tmp) = c('BirthYear','Gender','RelFreqUM.sd')
dat = merge(dat,tmp,by=c('BirthYear','Gender'))
# calculation of the number of observations per group
spk$ones = 1
tmp = aggregate(spk$ones,by=list(spk$AgeGroup,spk$Gender),FUN=sum)
colnames(tmp) = c('BirthYear','Gender','RelFreqUM.N')
dat = merge(dat,tmp,by=c('BirthYear','Gender'))
# storing the 95% confidence bands (1.96 standard deviations above and below the mean)
dat$RelFreqUM.lower = dat$RelFreqUM - (1.96 * (dat$RelFreqUM.sd / sqrt(dat$RelFreqUM.N)))
dat$RelFreqUM.upper = dat$RelFreqUM + (1.96 * (dat$RelFreqUM.sd / sqrt(dat$RelFreqUM.N)))
The following command visualizes the graph including the confidence bands:
ggplot(data = dat, aes(x = BirthYear, y = RelFreqUM, colour = Gender)) +
geom_line(aes(group = Gender)) + theme_bw() + xlab('Year of birth') + ylab(" ") +
geom_errorbar(aes(ymin=RelFreqUM.lower, ymax=RelFreqUM.upper, width=.1)) +
ggtitle("Relative frequency of 'um' (238 speakers)")
Clearly the plot shows that women and younger speakers show a greater relative frequency of the use of ‘um’. The mixed-effects logistic regression model supports this pattern:
# BirthYear was z-transformed to prevent a warning during fitting of the model
model1 = glmer(UM ~ BirthYear.z + Gender +
(1|Speaker) + (0+BirthYear.z|Interview) +
(1+Gender|Interview), family='binomial', data=spk)
summary(model1)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: UM ~ BirthYear.z + Gender + (1 | Speaker) + (0 + BirthYear.z |
## Interview) + (1 + Gender | Interview)
## Data: spk
##
## AIC BIC logLik deviance df.resid
## 17299 17361 -8642 17283 16213
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.693 -0.649 -0.273 0.696 6.183
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## Speaker (Intercept) 0.9259 0.962
## Interview BirthYear.z 0.0207 0.144
## Interview.1 (Intercept) 0.0708 0.266
## GenderM 0.4193 0.648 -0.80
## Number of obs: 16221, groups: Speaker, 238; Interview, 234
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.125 0.108 1.15 0.249
## BirthYear.z 0.943 0.082 11.50 <2e-16 ***
## GenderM -0.434 0.161 -2.70 0.007 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) BrthY.
## BirthYear.z -0.047
## GenderM -0.688 0.083
The random-effects structure of the model (determined via AIC-based model comparison) is added to prevent obtaining p-values which are too low. The first line of the fixed effects description (the part which we are focusing on here) shows the intercept. As the positive estimate does not significantly differ from 0, this indicates that female speakers born in 1980 (the average year of birth: BirthYear.z equals 0) do not use ‘um’ significantly more frequently than ‘uh’. The positive estimate for year of birth indicates that younger speakers (having a higher year of birth) are significantly more likely to use ‘um’ as opposed to ‘uh’. Similarly, the negative estimate for the male gender indicates that men are significantly less likely to use ‘um’ as opposed to ‘uh’ than women.
Whereas the graph above suggests there is an interaction between year of birth and gender, this interaction is not significant, as the following model illustrates.
model2 = glmer(UM ~ BirthYear.z * Gender +
(1|Speaker) + (0+BirthYear.z|Interview) +
(1+Gender|Interview), family='binomial', data=spk)
summary(model2)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: UM ~ BirthYear.z * Gender + (1 | Speaker) + (0 + BirthYear.z |
## Interview) + (1 + Gender | Interview)
## Data: spk
##
## AIC BIC logLik deviance df.resid
## 17300 17369 -8641 17282 16212
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.686 -0.649 -0.270 0.696 5.967
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## Speaker (Intercept) 0.9151 0.957
## Interview BirthYear.z 0.0212 0.145
## Interview.1 (Intercept) 0.0716 0.268
## GenderM 0.4235 0.651 -0.80
## Number of obs: 16221, groups: Speaker, 238; Interview, 234
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.119 0.108 1.10 0.2719
## BirthYear.z 1.028 0.120 8.59 <2e-16 ***
## GenderM -0.437 0.160 -2.72 0.0065 **
## BirthYear.z:GenderM -0.160 0.163 -0.98 0.3256
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) BrthY. GendrM
## BirthYear.z -0.073
## GenderM -0.687 0.044
## BrthYr.z:GM 0.056 -0.731 0.020
Acknowledgements
I thank Thomas Schmidt of the IDS Mannheim for helping me to obtain the data from the Datenbank für Gesprochenes Deutsch.
Replication of the analysis
To replicate the analysis presented above, you can just copy the following lines to the most recent version of R. You need the packages ‘lme4’, ‘ggplot2’ and ‘rmarkdown’. If these are not installed (the library commands will throw an error), you can uncomment (i.e. remove the hashtag) the first three lines to install them.
#install.packages('lme4',repos='http://cran.us.r-project.org')
#install.packages('ggplot2',repos='http://cran.us.r-project.org')
#install.packages('rmarkdown',repos='http://cran.us.r-project.org')
download.file('http://www.let.rug.nl/wieling/ll/analysis-German.Rmd', 'analysis-German.Rmd')
library(rmarkdown)
render('analysis-German.Rmd') # generates html file with results
browseURL(paste('file://', file.path(getwd(),'analysis-German.html'), sep='')) # shows result