Generalized additive modeling and dialectology

Lecture 5 of advanced regression for linguists

Martijn Wieling
Computational Linguistics Research Group

This lecture

Introduction
- Logistic regression (recap)
- Standard Italian and Tuscan dialects
Material: Standard Italian and Tuscan dialects
Methods: R code
Results
Discussion

Question 1

Logistic regression (recap)

Dependent variable is binary (1: success, 0: failure), not continuous
Transform to continuous variable via log odds: \(\log(\frac{p}{1-p})\) = logit\((p)\)
- Done automatically in regression by setting family="binomial"
Generalized linear model: specific link function and error distribution
interpret coefficients w.r.t. success as logits: in R: plogis(x)

Standard Italian and Tuscan dialects

Standard Italian originated in the 14th century as a written language
It originated from the prestigious Florentine variety
The spoken standard Italian language was adopted in the 20th century
- People used to speak in their local dialect
In this study, we investigate the relationship between standard Italian and Tuscan dialects
- We focus on lexical variation
- We assess which social, geographical and lexical variables influence this relationship

Material: lexical data

We used lexical data from the Atlante Lessicale Toscano (ALT)
- We focus on 2060 speakers from 213 locations and 170 concepts
- Total number of cases: 384,454
- Dependent variable
  - 1: lexical form was different from standard Italian
  - 0: lexical form was identical to standard Italian

Geographic distribution of locations

Material: additional data

Speaker age
Speaker gender
Speaker education level
Speaker employment history
Number of inhabitants in each location
Average income in each location
Average age in each location
Frequency of each concept

Modeling geography's influence with a GAM

(R version 3.3.2 (2016-10-31), `mgcv` version 1.8.16)

geo <- bam(NotStd ~ s(Lon, Lat, k = 30), data = tuscan, family = "binomial", discrete = T)
summary(geo)

# 
# Family: binomial 
# Link function: logit 
# 
# Formula:
# NotStd ~ s(Lon, Lat, k = 30)
# 
# Parametric coefficients:
#             Estimate Std. Error z value Pr(>|z|)    
# (Intercept)  -0.2474     0.0033   -75.1   <2e-16 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Approximate significance of smooth terms:
#             edf Ref.df Chi.sq p-value    
# s(Lon,Lat) 28.2     29   1591  <2e-16 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# R-sq.(adj) =  0.0042   Deviance explained = 0.312%
# fREML = 6.1609e+05  Scale est. = 1         n = 384454

First 15 two-dimensional basis functions

plot of chunk unnamed-chunk-3

First 15 two-dimensional basis functions

plot of chunk unnamed-chunk-4

Fitted surface

fvisgam(geo, view = c("Lon", "Lat"), too.far = 0.045, main = "", rm.ranef = T)

plot of chunk unnamed-chunk-5

Thin plate regression spline: scale-dependent

geo2 <- bam(NotStd ~ s(km.e, Lat), data = tuscan, family = "binomial", discrete = T)
fvisgam(geo2, view = c("km.e", "Lat"), too.far = 0.045, main = "", rm.ranef = T)

plot of chunk unnamed-chunk-6

Question 2

Solution: tensor product spline

geo3 <- bam(NotStd ~ te(km.e, Lat, k = c(6, 6)), data = tuscan, family = "binomial", discrete = T)
fvisgam(geo3, view = c("km.e", "Lat"), too.far = 0.045, main = "", rm.ranef = T)

plot of chunk unnamed-chunk-7

Varying geography's influence based on concept freq.

Wieling, Nerbonne and Baayen (2011) showed that the effect of word frequency varied depending on geography
Here we explicitly include this in the GAM with te(), which can model an \(N\)-way non-linear interaction:
te(Lon, Lat, ConceptFreq, d=c(2,1))
As this pattern may be presumed to differ depending on speaker age, we can integrate this in the model as well:
te(Lon, Lat, ConceptFreq, YearBirth, d=c(2,1,1))

Question 3

Full model specification

system.time(
  m <- bam(NotStd ~ te(Lon, Lat, ConceptFreq.log.z, SpeakerBirthYear.z, d=c(2,1,1)) +
    CommunitySize.log.z + SpeakerJob_Farmer + SpeakerEduLevel.log.z + SpeakerIsMale +
    s(Speaker,bs="re") + s(Location,bs="re") + s(Concept,bs="re") + 
    s(Concept,CommunityRecordingYear.z,bs="re") + s(Concept,CommunitySize.log.z,bs="re") +
    s(Concept,CommunityAvgIncome.log.z,bs="re") + s(Concept,CommunityAvgAge.log.z,bs="re") +
    s(Concept,SpeakerJob_Farmer,bs="re") + s(Concept,SpeakerJob_Executive_AuxiliaryWorker,bs="re") +
    s(Concept,SpeakerEduLevel.log.z,bs="re") + s(Concept,SpeakerIsMale,bs="re"), 
  data=tuscan, family="binomial", discrete=T, nthreads=4)
)

#    user  system elapsed 
#  2322.5    21.1   701.3

smry <- summary(m) # takes 10 minutes to calculate

The results will be discussed next... (Wieling et al., 2014, Language)

Results: fixed effects and smooths

smry$p.table

#                       Estimate Std. Error z value Pr(>|z|)
# (Intercept)            -0.4282     0.1264   -3.39 7.08e-04
# CommunitySize.log.z    -0.0629     0.0223   -2.82 4.87e-03
# SpeakerJob_Farmer       0.0449     0.0169    2.66 7.81e-03
# SpeakerEduLevel.log.z  -0.0678     0.0126   -5.38 7.29e-08
# SpeakerIsMale           0.0378     0.0128    2.95 3.18e-03

head(smry$s.table, 1)

#                                                  edf Ref.df Chi.sq p-value
# te(SpeakerBirthYear.z,ConceptFreq.log.z,Lon,Lat) 221    265   3270       0

Interpreting logit coefficients (recap)

# chance for a male farmer in a
# very small village (z-scored
# population size = -2) for
# which the location is unknown
# with a very low education
# level (z-score = -2) to use a
# non-standard lexical form
(logit <- coef(m)["(Intercept)"] + 
    coef(m)["SpeakerIsMale"] + 
    coef(m)["SpeakerJob_Farmer"] + 
    -2 * coef(m)["CommunitySize.log.z"] + 
    -2 * coef(m)["SpeakerEduLevel.log.z"])

# (Intercept) 
#     -0.0841

plogis(logit)  # was: 0.438 (43.8%)

# (Intercept) 
#       0.479

plot of chunk unnamed-chunk-11

A complex geographical pattern

plot of chunk unnamed-chunk-12

Animation: increasing frequency for older speakers

Animation: increasing frequency for younger speakers

Results: random effects

tail(smry$s.table, 11)  # last 11 smooths are ranefs

#                                                   edf Ref.df Chi.sq   p-value
# s(Speaker)                                       97.1   2005    106  9.40e-03
# s(Location)                                     175.0    209   5642  1.33e-96
# s(Concept)                                      167.0    168 436864  0.00e+00
# s(CommunityRecordingYear.z,Concept)             158.9    170 155893 4.88e-181
# s(CommunitySize.log.z,Concept)                  149.9    169  29991 2.41e-111
# s(CommunityAvgIncome.log.z,Concept)             158.0    170 143207 1.75e-160
# s(CommunityAvgAge.log.z,Concept)                154.4    170 110722 5.80e-195
# s(SpeakerJob_Farmer,Concept)                     86.1    169  26572  1.27e-07
# s(SpeakerJob_Executive_AuxiliaryWorker,Concept)  53.3    170   3319  8.07e-04
# s(SpeakerEduLevel.log.z,Concept)                139.1    169   9377  8.05e-49
# s(SpeakerIsMale,Concept)                         85.4    169 112400  6.55e-11

By-concept random slopes for community size

plot of chunk unnamed-chunk-16

By-concept random slopes for speaker education level

plot of chunk unnamed-chunk-17

Discussion

Comparing Tuscan dialects to standard Italian revealed interesting dialectal patterns
GAMs are very suitable to model the non-linear influence of geography
The regression approach allowed for the simultaneous identification of important social, geographical and lexical predictors
By including many concepts, results are less subjective than traditional analyses focusing on only a few pre-selected concepts
The mixed-effects regression approach still allows a focus on individual concepts
More interested in Tuscan data and analysis? Paper package with all data and analyses available via http://www.martijnwieling.nl

Recap

We have applied GAMs to dialectometry data and learned how to:
- use s() to model two-dimensional interactions on the same scale
- model complex non-linear interactions using te()
- use GAMs to conduct logistic regression (family="binomial")
After the break:
- http://www.let.rug.nl/wieling/statscourse/lecture5/lab
  - We use a subset of Dutch dialect data in the lab (faster: no logistic regression)
  - Similar underlying idea: investigate the effect of geography, word frequency, and location characteristics on pronunciation distances from standard Dutch
Finally: please fill in the evaluation form of the course:
http://www.let.rug.nl/wieling/statscourse/evaluation

Evaluation

Questions?

Thank you for your attention!

http://www.martijnwieling.nl
wieling@gmail.com

Generalized additive modeling and dialectology

Lecture 5 of advanced regression for linguists

This lecture

Question 1

Logistic regression (recap)

Standard Italian and Tuscan dialects

Material: lexical data

Geographic distribution of locations

Material: additional data

Modeling geography's influence with a GAM

(R version 3.3.2 (2016-10-31), mgcv version 1.8.16)

First 15 two-dimensional basis functions

First 15 two-dimensional basis functions

Fitted surface

Thin plate regression spline: scale-dependent

Question 2

Solution: tensor product spline

Varying geography's influence based on concept freq.

Question 3

Full model specification

Results: fixed effects and smooths

Interpreting logit coefficients (recap)

A complex geographical pattern

Animation: increasing frequency for older speakers

Animation: increasing frequency for younger speakers

Results: random effects

By-concept random slopes for community size

By-concept random slopes for speaker education level

Discussion

Recap

Evaluation

Questions?

(R version 3.3.2 (2016-10-31), `mgcv` version 1.8.16)