UH/UM in Norwegian

Martijn Wieling (University of Groningen, the Netherlands)

Data and scripts for the Language Log guest post (Norwegian) of Martijn Wieling

## Generated on: October 08, 2014 - 11:38:59

To run the complete analysis yourself, please refer to the bottom of this page.


Preparation

The following lines load the required library, download the required files, and load the data.

# load required packages
library(lme4)
library(ggplot2)

# version information
R.version.string
## [1] "R version 3.1.1 (2014-07-10)"
packageVersion('lme4')
## [1] '1.1.7'
packageVersion('ggplot2')
## [1] '1.0.0'
# download required files and scripts
download.file('http://www.let.rug.nl/wieling/ll/multiplot.R', 'multiplot.R')
download.file('http://www.let.rug.nl/wieling/ll/Norwegian-UH-UM.txt', 'Norwegian-UH-UM.txt')

# load custom plotting function
source('multiplot.R') # custom plotting function

# read data
spk = read.table('Norwegian-UH-UM.txt',sep='\t',header=T,encoding='UTF-8')


Results

The column ‘UM’ contains a 1 if the hesitation marker contains an ‘m’ (generally ‘em’ or ‘m’) and a 0 if it did not (generally ‘e’).

table(spk$Form,spk$UM)
##      
##           0     1
##   e   42285     0
##   E       5     0
##   em      0  2508
##   EM      0    26
##   h-e     1     0
##   m       0  3294
##   M       0     4
##   m-m     0    19
##   m_m     0    28

In the following, I will refer to the inclusion of ‘m’ in the hesitation marker as the ‘um’ form, and the absence of ‘m’ from the hesitation marker as the ‘uh’ form. The following table shows the distribution of ‘um’ versus ‘uh’ separated by age and gender:

dat = aggregate(spk$UM,by=list(spk$AgeGroup,spk$Gender),FUN=mean)
colnames(dat) = c('AgeGroup','Gender','RelFreqUM')
dat
##   AgeGroup Gender RelFreqUM
## 1      Old Female   0.10855
## 2    Young Female   0.19934
## 3      Old   Male   0.09572
## 4    Young   Male   0.14425

To generate the associated graph, we first have to calculate the 95% confidence intervals around the average values (per age group) we want to visualize. The 95% confidence interval lies within 1.96 times the standard error around the average of each group. The standard error is calculated by dividing the standard deviation by the square root of the number of observations in the group. The calculation thus can be done as follows:

# calculation of standard deviation
tmp = aggregate(spk$UM,by=list(spk$AgeGroup,spk$Gender),FUN=sd)
colnames(tmp) = c('AgeGroup','Gender','RelFreqUM.sd')
dat = merge(dat,tmp,by=c('AgeGroup','Gender'))

# calculation of the number of observations per group
spk$ones = 1
tmp = aggregate(spk$ones,by=list(spk$AgeGroup,spk$Gender),FUN=sum)
colnames(tmp) = c('AgeGroup','Gender','RelFreqUM.N')
dat = merge(dat,tmp,by=c('AgeGroup','Gender'))

# storing the 95% confidence bands (1.96 standard deviations above and below the mean)
dat$RelFreqUM.lower = dat$RelFreqUM - (1.96 * (dat$RelFreqUM.sd / sqrt(dat$RelFreqUM.N)))
dat$RelFreqUM.upper = dat$RelFreqUM + (1.96 * (dat$RelFreqUM.sd / sqrt(dat$RelFreqUM.N)))

The following command visualizes the graph including the confidence bands:

ggplot(data = dat, aes(x = AgeGroup, y = RelFreqUM, colour = Gender)) + 
geom_line(aes(group = Gender)) + theme_bw() + xlab('Age group') + ylab(" ") + 
geom_errorbar(aes(ymin=RelFreqUM.lower, ymax=RelFreqUM.upper, width=.1)) +
ggtitle("Relative frequency of 'um' (556 speakers)")

plot of chunk graph

Clearly the plot shows that women and younger speakers show a greater relative frequency of the use of ‘um’. The mixed-effects logistic regression model supports this pattern:

model1 = glmer(UM ~ AgeGroup + Gender + (1|Speaker), family='binomial', data=spk) 
summary(model1)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: UM ~ AgeGroup + Gender + (1 | Speaker)
##    Data: spk
## 
##      AIC      BIC   logLik deviance df.resid 
##    32530    32565   -16261    32522    48166 
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -1.070 -0.384 -0.287 -0.186  7.948 
## 
## Random effects:
##  Groups  Name        Variance Std.Dev.
##  Speaker (Intercept) 0.866    0.931   
## Number of obs: 48170, groups:  Speaker, 556
## 
## Fixed effects:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -2.4238     0.0781  -31.02  < 2e-16 ***
## AgeGroupYoung   0.8454     0.0916    9.22  < 2e-16 ***
## GenderMale     -0.2978     0.0883   -3.37  0.00075 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) AgGrpY
## AgeGroupYng -0.543       
## GenderMale  -0.634  0.065

The first line of the fixed effects description (the part which we are focusing on here) shows the intercept. As the negative estimate significantly differs from 0, this indicates that old female speakers use ‘uh’ more frequently than ‘um’. The positive estimate for the young age group indicates that younger speakers are significantly more likely to use ‘um’ as opposed to ‘uh’. Similarly, the negative estimate for the male gender indicates that men are significantly less likely to use ‘um’ as opposed to ‘uh’ than women.

Since the dataset also contains the recording year of the speaker, we can test if this influences the results as well. Note that the age group of the speaker is with respect to the recording year (e.g., a person in the young age group recorded in 1960 is an old person now). Note that most people were recorded recently, however:

table(spk$RecordingYear)
## 
##  1951  1956  1958  1959  1960  1962  1963  1964  1965  1967  1968  1969 
##   138   263   159   162   128    29   119     8    88   452  1374   987 
##  1970  1971  1972  1973  1974  1975  1976  1978  1979  1980  1984  2006 
##   220   268   185   799   509   141   194   198  1234    22    21   951 
##  2007  2008  2009  2010  2011  2012 
##  5143 15555 10245  7580   825   156
hist(spk$RecordingYear, main='', xlab = 'Year of recording')

plot of chunk table2

The following model assesses the linear influence of year of recording (in addition to the effects of age and gender):

# RecordingYear was z-transformed to prevent a warning during fitting of the model
model2 = glmer(UM ~ AgeGroup + Gender + RecordingYear.z + (1|Speaker), family='binomial', data=spk) 
summary(model2)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: UM ~ AgeGroup + Gender + RecordingYear.z + (1 | Speaker)
##    Data: spk
## 
##      AIC      BIC   logLik deviance df.resid 
##    32494    32538   -16242    32484    48148 
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -1.066 -0.385 -0.285 -0.180  8.197 
## 
## Random effects:
##  Groups  Name        Variance Std.Dev.
##  Speaker (Intercept) 0.828    0.91    
## Number of obs: 48153, groups:  Speaker, 555
## 
## Fixed effects:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.4382     0.0773  -31.52  < 2e-16 ***
## AgeGroupYoung     0.6498     0.0952    6.83  8.7e-12 ***
## GenderMale       -0.2197     0.0883   -2.49    0.013 *  
## RecordingYear.z   0.3479     0.0586    5.94  2.9e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) AgGrpY GndrMl
## AgeGroupYng -0.489              
## GenderMale  -0.638  0.012       
## RecrdngYr.z -0.081 -0.318  0.154

The results are similar as before, with the addition of the significant positive effect of the year of recording. This effect indicates that the people who were recorded later (regardless of age and gender) show a greater preference for ‘um’ compared to those recorded earlier. Note that ‘uh’ is still the dominant form for all groups, however.

When the data are analyzed by treating the recording year as a factor (recorded before 1985 vs. recorded after 2005), the results show the same pattern (the people recorded recently use ‘um’ relatively more frequently than those recorded many years ago):

# RecordingYear was z-transformed to prevent a warning during fitting of the model
spk$RecentRecording = (spk$RecordingYear > 2005)
model3 = glmer(UM ~ AgeGroup + Gender + RecentRecording + (1|Speaker), family='binomial', data=spk) 
summary(model3)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: UM ~ AgeGroup + Gender + RecentRecording + (1 | Speaker)
##    Data: spk
## 
##      AIC      BIC   logLik deviance df.resid 
##    32494    32538   -16242    32484    48148 
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -1.067 -0.386 -0.285 -0.180  7.898 
## 
## Random effects:
##  Groups  Name        Variance Std.Dev.
##  Speaker (Intercept) 0.825    0.908   
## Number of obs: 48153, groups:  Speaker, 555
## 
## Fixed effects:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -3.0711     0.1360  -22.59  < 2e-16 ***
## AgeGroupYoung         0.6462     0.0952    6.79  1.1e-11 ***
## GenderMale           -0.2201     0.0881   -2.50    0.012 *  
## RecentRecordingTRUE   0.8161     0.1369    5.96  2.5e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) AgGrpY GndrMl
## AgeGroupYng -0.023              
## GenderMale  -0.480  0.011       
## RcntRcrTRUE -0.825 -0.325  0.152


Original data source

The original data was obtained from the Nordic Dialect Corpus and Syntax Database.


Replication of the analysis

To replicate the analysis presented above, you can just copy the following lines to the most recent version of R. You need the packages ‘lme4’, ‘ggplot2’ and ‘rmarkdown’. If these are not installed (the library commands will throw an error), you can uncomment (i.e. remove the hashtag) the first three lines to install them.

#install.packages('lme4',repos='http://cran.us.r-project.org')
#install.packages('ggplot2',repos='http://cran.us.r-project.org')
#install.packages('rmarkdown',repos='http://cran.us.r-project.org')
download.file('http://www.let.rug.nl/wieling/ll/analysis-Norwegian.Rmd', 'analysis-Norwegian.Rmd')
library(rmarkdown)
render('analysis-Norwegian.Rmd') # generates html file with results
browseURL(paste('file://', file.path(getwd(),'analysis-Norwegian.html'), sep='')) # shows result