Generalized additive modeling and dialectology

Lecture 5 of advanced regression for linguists

Martijn Wieling
Department of Information Science

This lecture

Introduction
- Logistic regression
- Standard Italian and Tuscan dialects
Material: Standard Italian and Tuscan dialects
Methods: R code
Results
Discussion

Logistic regression

Dependent variable is binary (1: success, 0: failure), not continuous
Transform to continuous variable via log odds: \(\log(\frac{p}{1-p})\) = logit\((p)\)
- Done automatically in regression by setting family="binomial"
- Transformation of dependent variable: generalized regression model
interpret coefficients w.r.t. success as logits: in R: plogis(x)

Standard Italian and Tuscan dialects

Standard Italian originated in the 14th century as a written language
It originated from the prestigious Florentine variety
The spoken standard Italian language was adopted in the 20th century
- People used to speak in their local dialect
In this study, we investigate the relationship between standard Italian and Tuscan dialects
- We focus on lexical variation
- We attempt to identify which social, geographical and lexical variables influence this relationship

Material: lexical data

We used lexical data from the Atlante Lessicale Toscano (ALT)
- We focus on 2060 speakers from 213 locations and 170 concepts
- Total number of cases: 384454
- For every case, we identified if the lexical form was different from standard Italian (1) or the same (0)

Geographic distribution of locations

Material: additional data

In addition, we obtained the following information:
Speaker age
Speaker gender
Speaker education level
Speaker employment history
Number of inhabitants in each location
Average income in each location
Average age in each location
Frequency of each concept

Modeling geography's influence with a GAM

(R version 3.2.3 (2015-12-10), `mgcv` version 1.8.11)

geo = bam(NotStd ~ s(Longitude, Latitude), data = tuscan, family = "binomial", discrete = T)
vis.gam(geo, view = c("Longitude", "Latitude"), plot.type = "contour", color = "terrain", too.far = 0.045, 
    main = "")

plot of chunk unnamed-chunk-2

Interpreting logit coefficients

summary(geo)$p.table

#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)   -0.247     0.0033   -75.1        0

plogis(coef(geo)["(Intercept)"])

# (Intercept) 
#       0.438

On average 43.8% chance to see non-standard form

plot of chunk unnamed-chunk-4

Varying geography's influence based on concept freq.

Wieling, Nerbonne and Baayen (2011, PLOS ONE) showed that the effect of word frequency varied depending on geography
Here we explicitly include this in the GAM with te(), which can model an \(N\)-way non-linear interaction:
te(Longitude, Latitude, ConceptFreq, d=c(2,1))
As this pattern may be presumed to differ depending on speaker age, we can integrate this in the model as well:
te(Longitude, Latitude, ConceptFreq, YearBirth, d=c(2,1,1))

Full model specification

system.time(m <- bam(NotStd ~ te(Longitude, Latitude, ConceptFreq.log.z, SpeakerBirthYear.z, d = c(2, 
    1, 1)) + CommunitySize.log.z + SpeakerJob_Farmer + SpeakerEduLevel.log.z + SpeakerIsMale + s(Speaker, 
    bs = "re") + s(Location, bs = "re") + s(Concept, bs = "re") + s(Concept, CommunityRecordingYear.z, 
    bs = "re") + s(Concept, CommunitySize.log.z, bs = "re") + s(Concept, CommunityAvgIncome.log.z, bs = "re") + 
    s(Concept, CommunityAvgAge.log.z, bs = "re") + s(Concept, SpeakerJob_Farmer, bs = "re") + s(Concept, 
    SpeakerJob_Executive_AuxiliaryWorker, bs = "re") + s(Concept, SpeakerEduLevel.log.z, bs = "re") + 
    s(Concept, SpeakerIsMale, bs = "re"), data = tuscan, family = "binomial", discrete = T, nthreads = 4))

#    user  system elapsed 
#    5120      35    1815

system.time(smry <- summary(m))

#    user  system elapsed 
# 5640.78    6.13 5664.91

The results will be discussed next... (Wieling et al., 2014, Language)

Results: fixed effects and smooths

smry$p.table

#                       Estimate Std. Error z value Pr(>|z|)
# (Intercept)            -0.4372     0.1265   -3.46 5.48e-04
# CommunitySize.log.z    -0.0635     0.0225   -2.82 4.80e-03
# SpeakerJob_Farmer       0.0448     0.0168    2.66 7.78e-03
# SpeakerEduLevel.log.z  -0.0675     0.0126   -5.37 8.04e-08
# SpeakerIsMale           0.0380     0.0128    2.97 2.98e-03

head(smry$s.table, 1)

#                                                             edf Ref.df Chi.sq p-value
# te(SpeakerBirthYear.z,ConceptFreq.log.z,Longitude,Latitude) 225    270   3286       0

Interpreting logit coefficients II

# chance for a male farmer in a
# very small village (z-scored
# population size = -2) for
# which the location is unknown
# with a very low education
# level (z-score = -2) to use a
# non-standard lexical form
(logit = coef(m)["(Intercept)"] + 
    coef(m)["SpeakerIsMale"] + 
    coef(m)["SpeakerJob_Farmer"] + 
    -2 * coef(m)["CommunitySize.log.z"] + 
    -2 * coef(m)["SpeakerEduLevel.log.z"])

# (Intercept) 
#     -0.0923

plogis(logit)  # was: 0.438 (43.8%)

# (Intercept) 
#       0.477

plot of chunk unnamed-chunk-8

A complex geographical pattern

plot of chunk unnamed-chunk-9

Animation: increasing frequency for older speakers

Animation: increasing frequency for younger speakers

Results: random effects

tail(smry$s.table, 11)

#                                                   edf Ref.df   Chi.sq   p-value
# s(Speaker)                                       90.2   2005     97.7  1.45e-02
# s(Location)                                     175.2    209   5675.4  1.91e-93
# s(Concept)                                      167.0    168 437792.1  0.00e+00
# s(CommunityRecordingYear.z,Concept)             158.9    170 157471.2 4.46e-184
# s(CommunitySize.log.z,Concept)                  149.9    169  29933.0 1.70e-111
# s(CommunityAvgIncome.log.z,Concept)             158.1    170 143338.9 2.12e-160
# s(CommunityAvgAge.log.z,Concept)                154.4    170 110802.6 5.39e-196
# s(SpeakerJob_Farmer,Concept)                     85.9    169  26191.6  1.75e-07
# s(SpeakerJob_Executive_AuxiliaryWorker,Concept)  53.3    170   3325.5  6.77e-04
# s(SpeakerEduLevel.log.z,Concept)                139.1    169   9421.6  4.67e-49
# s(SpeakerIsMale,Concept)                         85.4    169 111227.6  1.00e-10

By-concept random slopes for community size

plot of chunk unnamed-chunk-13

By-concept random slopes for speaker education level

plot of chunk unnamed-chunk-14

Discussion

Using a generalized additive mixed-effects regression model (GAM) to investigate lexical differences between standard Italian and Tuscan dialects revealed interesting dialectal patterns
- GAMs are very suitable to model the non-linear influence of geography
- The regression approach allowed for the simultaneous identification of important social, geographical and lexical predictors
- By including many concepts, results are less subjective than traditional analyses focusing on only a few pre-selected concepts
- The mixed-effects regression approach still allows a focus on individual concepts
- More interested in Tuscan data and analysis? Paper package with all data and analyses available via http://www.martijnwieling.nl

Recap

We have applied GAMs to dialectometry data and learned how to:
- use s() to model two-dimensional interactions on the same scale
- model complex non-linear interactions using te()
- use GAMs to conduct logistic regression (family="binomial")
After the break:
- http://www.let.rug.nl/wieling/statscourse/lecture5/lab
  - We use a subset of Dutch dialect data in the lab (faster: no logistic regression)
  - Similar underlying idea: investigate the effect of geography, word frequency, and location characteristics on pronunciation distances from standard Dutch
Finally: please evaluate the course:
http://www.let.rug.nl/wieling/statscourse/evaluation

Questions?

Thank you for your attention!

http://www.martijnwieling.nl
wieling@gmail.com