Martijn Wieling (University of Groningen, the Netherlands)
Data and scripts for the Language Log guest post (Norwegian) of Martijn Wieling
## Generated on: October 08, 2014 - 11:38:59
To run the complete analysis yourself, please refer to the bottom of this page.
Preparation
The following lines load the required library, download the required files, and load the data.
# load required packages
library(lme4)
library(ggplot2)
# version information
R.version.string
## [1] "R version 3.1.1 (2014-07-10)"
packageVersion('lme4')
## [1] '1.1.7'
packageVersion('ggplot2')
## [1] '1.0.0'
# download required files and scripts
download.file('http://www.let.rug.nl/wieling/ll/multiplot.R', 'multiplot.R')
download.file('http://www.let.rug.nl/wieling/ll/Norwegian-UH-UM.txt', 'Norwegian-UH-UM.txt')
# load custom plotting function
source('multiplot.R') # custom plotting function
# read data
spk = read.table('Norwegian-UH-UM.txt',sep='\t',header=T,encoding='UTF-8')
Results
The column ‘UM’ contains a 1 if the hesitation marker contains an ‘m’ (generally ‘em’ or ‘m’) and a 0 if it did not (generally ‘e’).
table(spk$Form,spk$UM)
##
## 0 1
## e 42285 0
## E 5 0
## em 0 2508
## EM 0 26
## h-e 1 0
## m 0 3294
## M 0 4
## m-m 0 19
## m_m 0 28
In the following, I will refer to the inclusion of ‘m’ in the hesitation marker as the ‘um’ form, and the absence of ‘m’ from the hesitation marker as the ‘uh’ form. The following table shows the distribution of ‘um’ versus ‘uh’ separated by age and gender:
dat = aggregate(spk$UM,by=list(spk$AgeGroup,spk$Gender),FUN=mean)
colnames(dat) = c('AgeGroup','Gender','RelFreqUM')
dat
## AgeGroup Gender RelFreqUM
## 1 Old Female 0.10855
## 2 Young Female 0.19934
## 3 Old Male 0.09572
## 4 Young Male 0.14425
To generate the associated graph, we first have to calculate the 95% confidence intervals around the average values (per age group) we want to visualize. The 95% confidence interval lies within 1.96 times the standard error around the average of each group. The standard error is calculated by dividing the standard deviation by the square root of the number of observations in the group. The calculation thus can be done as follows:
# calculation of standard deviation
tmp = aggregate(spk$UM,by=list(spk$AgeGroup,spk$Gender),FUN=sd)
colnames(tmp) = c('AgeGroup','Gender','RelFreqUM.sd')
dat = merge(dat,tmp,by=c('AgeGroup','Gender'))
# calculation of the number of observations per group
spk$ones = 1
tmp = aggregate(spk$ones,by=list(spk$AgeGroup,spk$Gender),FUN=sum)
colnames(tmp) = c('AgeGroup','Gender','RelFreqUM.N')
dat = merge(dat,tmp,by=c('AgeGroup','Gender'))
# storing the 95% confidence bands (1.96 standard deviations above and below the mean)
dat$RelFreqUM.lower = dat$RelFreqUM - (1.96 * (dat$RelFreqUM.sd / sqrt(dat$RelFreqUM.N)))
dat$RelFreqUM.upper = dat$RelFreqUM + (1.96 * (dat$RelFreqUM.sd / sqrt(dat$RelFreqUM.N)))
The following command visualizes the graph including the confidence bands:
ggplot(data = dat, aes(x = AgeGroup, y = RelFreqUM, colour = Gender)) +
geom_line(aes(group = Gender)) + theme_bw() + xlab('Age group') + ylab(" ") +
geom_errorbar(aes(ymin=RelFreqUM.lower, ymax=RelFreqUM.upper, width=.1)) +
ggtitle("Relative frequency of 'um' (556 speakers)")
Clearly the plot shows that women and younger speakers show a greater relative frequency of the use of ‘um’. The mixed-effects logistic regression model supports this pattern:
model1 = glmer(UM ~ AgeGroup + Gender + (1|Speaker), family='binomial', data=spk)
summary(model1)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: UM ~ AgeGroup + Gender + (1 | Speaker)
## Data: spk
##
## AIC BIC logLik deviance df.resid
## 32530 32565 -16261 32522 48166
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.070 -0.384 -0.287 -0.186 7.948
##
## Random effects:
## Groups Name Variance Std.Dev.
## Speaker (Intercept) 0.866 0.931
## Number of obs: 48170, groups: Speaker, 556
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.4238 0.0781 -31.02 < 2e-16 ***
## AgeGroupYoung 0.8454 0.0916 9.22 < 2e-16 ***
## GenderMale -0.2978 0.0883 -3.37 0.00075 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) AgGrpY
## AgeGroupYng -0.543
## GenderMale -0.634 0.065
The first line of the fixed effects description (the part which we are focusing on here) shows the intercept. As the negative estimate significantly differs from 0, this indicates that old female speakers use ‘uh’ more frequently than ‘um’. The positive estimate for the young age group indicates that younger speakers are significantly more likely to use ‘um’ as opposed to ‘uh’. Similarly, the negative estimate for the male gender indicates that men are significantly less likely to use ‘um’ as opposed to ‘uh’ than women.
Since the dataset also contains the recording year of the speaker, we can test if this influences the results as well. Note that the age group of the speaker is with respect to the recording year (e.g., a person in the young age group recorded in 1960 is an old person now). Note that most people were recorded recently, however:
table(spk$RecordingYear)
##
## 1951 1956 1958 1959 1960 1962 1963 1964 1965 1967 1968 1969
## 138 263 159 162 128 29 119 8 88 452 1374 987
## 1970 1971 1972 1973 1974 1975 1976 1978 1979 1980 1984 2006
## 220 268 185 799 509 141 194 198 1234 22 21 951
## 2007 2008 2009 2010 2011 2012
## 5143 15555 10245 7580 825 156
hist(spk$RecordingYear, main='', xlab = 'Year of recording')
The following model assesses the linear influence of year of recording (in addition to the effects of age and gender):
# RecordingYear was z-transformed to prevent a warning during fitting of the model
model2 = glmer(UM ~ AgeGroup + Gender + RecordingYear.z + (1|Speaker), family='binomial', data=spk)
summary(model2)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: UM ~ AgeGroup + Gender + RecordingYear.z + (1 | Speaker)
## Data: spk
##
## AIC BIC logLik deviance df.resid
## 32494 32538 -16242 32484 48148
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.066 -0.385 -0.285 -0.180 8.197
##
## Random effects:
## Groups Name Variance Std.Dev.
## Speaker (Intercept) 0.828 0.91
## Number of obs: 48153, groups: Speaker, 555
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.4382 0.0773 -31.52 < 2e-16 ***
## AgeGroupYoung 0.6498 0.0952 6.83 8.7e-12 ***
## GenderMale -0.2197 0.0883 -2.49 0.013 *
## RecordingYear.z 0.3479 0.0586 5.94 2.9e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) AgGrpY GndrMl
## AgeGroupYng -0.489
## GenderMale -0.638 0.012
## RecrdngYr.z -0.081 -0.318 0.154
The results are similar as before, with the addition of the significant positive effect of the year of recording. This effect indicates that the people who were recorded later (regardless of age and gender) show a greater preference for ‘um’ compared to those recorded earlier. Note that ‘uh’ is still the dominant form for all groups, however.
When the data are analyzed by treating the recording year as a factor (recorded before 1985 vs. recorded after 2005), the results show the same pattern (the people recorded recently use ‘um’ relatively more frequently than those recorded many years ago):
# RecordingYear was z-transformed to prevent a warning during fitting of the model
spk$RecentRecording = (spk$RecordingYear > 2005)
model3 = glmer(UM ~ AgeGroup + Gender + RecentRecording + (1|Speaker), family='binomial', data=spk)
summary(model3)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: UM ~ AgeGroup + Gender + RecentRecording + (1 | Speaker)
## Data: spk
##
## AIC BIC logLik deviance df.resid
## 32494 32538 -16242 32484 48148
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.067 -0.386 -0.285 -0.180 7.898
##
## Random effects:
## Groups Name Variance Std.Dev.
## Speaker (Intercept) 0.825 0.908
## Number of obs: 48153, groups: Speaker, 555
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.0711 0.1360 -22.59 < 2e-16 ***
## AgeGroupYoung 0.6462 0.0952 6.79 1.1e-11 ***
## GenderMale -0.2201 0.0881 -2.50 0.012 *
## RecentRecordingTRUE 0.8161 0.1369 5.96 2.5e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) AgGrpY GndrMl
## AgeGroupYng -0.023
## GenderMale -0.480 0.011
## RcntRcrTRUE -0.825 -0.325 0.152
Original data source
The original data was obtained from the Nordic Dialect Corpus and Syntax Database.
Replication of the analysis
To replicate the analysis presented above, you can just copy the following lines to the most recent version of R. You need the packages ‘lme4’, ‘ggplot2’ and ‘rmarkdown’. If these are not installed (the library commands will throw an error), you can uncomment (i.e. remove the hashtag) the first three lines to install them.
#install.packages('lme4',repos='http://cran.us.r-project.org')
#install.packages('ggplot2',repos='http://cran.us.r-project.org')
#install.packages('rmarkdown',repos='http://cran.us.r-project.org')
download.file('http://www.let.rug.nl/wieling/ll/analysis-Norwegian.Rmd', 'analysis-Norwegian.Rmd')
library(rmarkdown)
render('analysis-Norwegian.Rmd') # generates html file with results
browseURL(paste('file://', file.path(getwd(),'analysis-Norwegian.html'), sep='')) # shows result