Generalized additive modeling for dialectology

Martijn Wieling (University of Groningen)

This lecture

  • Introduction
    • Logistic regression
    • Standard Italian and Tuscan dialects
  • Material: Standard Italian and Tuscan dialects
  • Methods: R code
  • Results
  • Discussion

Question 1

Logistic regression

  • Dependent variable is binary (1: success, 0: failure), not continuous
  • Transform to continuous variable via log odds: \(\log(\frac{p}{1-p})\) = logit\((p)\)
    • Automatically in GAM by setting family="binomial"
    • Transformation of dependent variable: generalized additive model
  • interpret coefficients w.r.t. success as logits: in R: plogis(x)

Standard Italian and Tuscan dialects

  • Standard Italian originated in the 14th century as a written language
  • It originated from the prestigious Florentine variety
  • The spoken standard Italian language was adopted in the 20th century
    • People used to speak in their local dialect
  • We investigate the relationship between standard Italian and Tuscan dialects
    • We focus on lexical variation
    • We use social, geographical and lexical variables

Material: lexical data

  • We use lexical data from the Atlante Lessicale Toscano (ALT)
  • We focus on 2060 speakers from 213 locations and 170 concepts
  • Total number of cases: 384,454
    • Binary dependent variable:
      • 1: lexical form was different from standard Italian
      • 0: lexical form was identical to standard Italian

Geographic distribution of locations

Material: predictors

  • Speaker age
  • Speaker sex
  • Speaker education level
  • Speaker employment history
  • Number of inhabitants in each location
  • Average income in each location
  • Average age in each location
  • Frequency of each concept

Modeling geography’s influence with a GAM

(R version 4.4.2 (2024-10-31 ucrt), mgcv version 1.9.3, itsadug version 2.4.1)

library(mgcv)
library(itsadug)
geo <- bam(NotStd ~ s(Lon,Lat,k=30), data=tuscan, family="binomial", discrete=T)
summary(geo) # slides only show the relevant part of the summary
Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -0.247     0.0033   -75.1   <2e-16 ***

Approximate significance of smooth terms:
            edf Ref.df Chi.sq p-value    
s(Lon,Lat) 28.2     29   1591  <2e-16 ***

First 15 two-dimensional basis functions

First 15 two-dimensional basis functions

Fitted surface

fvisgam(geo, view=c("Lon","Lat"), too.far=0.045, main="")

Effect of the number of basis functions

Thin plate regression spline: scale-dependent

geo2 <- bam(NotStd ~ s(km.e,Lat), data=tuscan, family="binomial", discrete=T)
fvisgam(geo2, view=c("km.e","Lat"), too.far=0.045, main="",add.color.legend=F)

Question 2

Solution: tensor product spline

geo3 <- bam(NotStd ~ te(km.e,Lat,k=c(6,6)), data=tuscan, family="binomial", discrete=T)
fvisgam(geo3, view=c("km.e","Lat"), too.far=0.045, main="",add.color.legend=F)

Varying geography’s influence based on concept freq.

  • Wieling, Nerbonne and Baayen (2011) showed that the effect of word frequency varied depending on geography
  • Here we explicitly include this in the GAM with te(), which can model an \(N\)-way non-linear interaction:
    te(Lon, Lat, ConceptFreq, d=c(2,1))
  • As this pattern may be presumed to differ depending on speaker age, we can integrate this in the model as well:
    te(Lon, Lat, ConceptFreq, YearBirth, d=c(2,1,1))

Question 3

Full model specification

system.time(
  m <- bam(NotStd ~ te(Lon, Lat, ConceptFreq.log.z, SpeakerBirthYear.z, d=c(2,1,1)) +
    CommunitySize.log.z + SpeakerJob_Farmer + SpeakerEduLevel.log.z + SpeakerIsMale +
    s(Speaker,bs="re") + s(Location,bs="re") + s(Concept,bs="re") + 
    s(Concept,CommunityRecordingYear.z,bs="re") + s(Concept,CommunitySize.log.z,bs="re") +
    s(Concept,CommunityAvgIncome.log.z,bs="re") + s(Concept,CommunityAvgAge.log.z,bs="re") +
    s(Concept,SpeakerJob_Farmer,bs="re") + s(Concept,SpeakerJob_Executive_AuxiliaryWorker,bs="re") +
    s(Concept,SpeakerEduLevel.log.z,bs="re") + s(Concept,SpeakerIsMale,bs="re"), 
  data=tuscan, family="binomial", discrete=T, nthreads=2)
)
   user  system elapsed 
  632.4    12.2   361.3 

Results: fixed effects and tensor

summary(m, re.test=FALSE) 
Parametric coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)            -0.4249     0.1265   -3.36 0.000781 ***
CommunitySize.log.z    -0.0641     0.0223   -2.87    0.004 ** 
SpeakerJob_Farmer       0.0447     0.0168    2.66    0.008 ** 
SpeakerEduLevel.log.z  -0.0669     0.0126   -5.32 1.06e-07 ***
SpeakerIsMale           0.0378     0.0128    2.95    0.003 ** 

Approximate significance of smooth terms:
                                                 edf Ref.df Chi.sq p-value    
te(SpeakerBirthYear.z,ConceptFreq.log.z,Lon,Lat) 224    268   3289  <2e-16 ***

Interpreting logit coefficients

# chance for a male farmer in a 
# very small village (z-scored 
# population size = -2) for which 
# the location is unknown with a 
# very low education level 
# (z-score = -2) to use a 
# non-standard lexical form
(logit <- coef(m)["(Intercept)"] + 
    coef(m)["SpeakerIsMale"] + 
    coef(m)["SpeakerJob_Farmer"] + 
    -2 * coef(m)["CommunitySize.log.z"] + 
    -2 * coef(m)["SpeakerEduLevel.log.z"])
(Intercept) 
    -0.0803 
plogis(logit)
(Intercept) 
       0.48 

Geographical results: complex!

Sequence: increasing frequency for older speakers

Sequence: increasing frequency for younger speakers

Results: random effects

system.time(smry <- summary(m)) # takes a long time to compute
   user  system elapsed 
 1759.9    17.7  1808.4 
tail( smry$s.table, 11 ) # last 11 smooths are random effects
                                                  edf Ref.df   Chi.sq  p-value
s(Speaker)                                       83.2   2005     89.5 2.25e-02
s(Location)                                     175.2    209   5314.3 0.00e+00
s(Concept)                                      166.9    168 437444.8 0.00e+00
s(CommunityRecordingYear.z,Concept)             158.9    170 156924.5 0.00e+00
s(CommunitySize.log.z,Concept)                  149.9    169  30138.8 0.00e+00
s(CommunityAvgIncome.log.z,Concept)             158.1    170 143131.3 0.00e+00
s(CommunityAvgAge.log.z,Concept)                154.4    170 110864.3 0.00e+00
s(SpeakerJob_Farmer,Concept)                     86.0    169  26203.2 6.46e-07
s(SpeakerJob_Executive_AuxiliaryWorker,Concept)  53.3    170   3315.0 7.56e-04
s(SpeakerEduLevel.log.z,Concept)                139.1    169   9347.8 0.00e+00
s(SpeakerIsMale,Concept)                         85.5    169 111596.1 0.00e+00

By-concept random slopes for community size

By-concept random slopes for speaker education level

Discussion

  • Comparing Tuscan dialects to standard Italian revealed interesting patterns
  • GAMs are very suitable to model the non-linear influence of geography
  • The regression approach allowed for the simultaneous identification of important social, geographical and lexical predictors
  • By including many concepts, results are less subjective than traditional analyses focusing on only a few pre-selected concepts
  • The mixed-effects regression approach still allows a focus on individual concepts
  • Analyses can be made reproducible via paper package with data and code

Recap

  • We have applied GAMs to dialect data and learned how to:
    • use s() to model two-dimensional interactions on the same scale
    • model complex non-linear interactions using te()
    • use GAMs to conduct logistic regression (family="binomial")
  • Associated lab session:

Evaluation

Questions?

Thank you for your attention!

 

https://www.martijnwieling.nl

m.b.wieling@rug.nl