Generalized additive modeling for dialectology

Martijn Wieling (University of Groningen)

This lecture

Introduction
- Logistic regression
- Standard Italian and Tuscan dialects
Material: Standard Italian and Tuscan dialects
Methods: R code
Results
Discussion

Question 1

Logistic regression

Dependent variable is binary (1: success, 0: failure), not continuous
Transform to continuous variable via log odds: \(\log(\frac{p}{1-p})\) = logit\((p)\)
- Automatically in GAM by setting family="binomial"
- Transformation of dependent variable: generalized additive model
interpret coefficients w.r.t. success as logits: in R: plogis(x)

Standard Italian and Tuscan dialects

Standard Italian originated in the 14th century as a written language
It originated from the prestigious Florentine variety
The spoken standard Italian language was adopted in the 20th century
- People used to speak in their local dialect
We investigate the relationship between standard Italian and Tuscan dialects
- We focus on lexical variation
- We use social, geographical and lexical variables

Material: lexical data

We use lexical data from the Atlante Lessicale Toscano (ALT)
We focus on 2060 speakers from 213 locations and 170 concepts
Total number of cases: 384,454
- Binary dependent variable:
  - 1: lexical form was different from standard Italian
  - 0: lexical form was identical to standard Italian

Geographic distribution of locations

Material: predictors

Speaker age
Speaker sex
Speaker education level
Speaker employment history
Number of inhabitants in each location
Average income in each location
Average age in each location
Frequency of each concept

Modeling geography’s influence with a GAM

(R version 4.5.0 (2025-04-11 ucrt), `mgcv` version 1.9.3, `itsadug` version 2.4.1)

library(mgcv)
library(itsadug)
geo <- bam(NotStd ~ s(Lon,Lat,k=30), data=tuscan, family="binomial", discrete=T)
summary(geo) # slides only show the relevant part of the summary

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -0.247     0.0033   -75.1   <2e-16 ***

Approximate significance of smooth terms:
            edf Ref.df Chi.sq p-value    
s(Lon,Lat) 28.2     29   1591  <2e-16 ***

First 15 two-dimensional basis functions

First 15 two-dimensional basis functions

Fitted surface

fvisgam(geo, view=c("Lon","Lat"), too.far=0.045, main="")

Effect of the number of basis functions

Thin plate regression spline: scale-dependent

geo2 <- bam(NotStd ~ s(km.e,Lat,k=30), data=tuscan, family="binomial", discrete=T)
fvisgam(geo2, view=c("km.e","Lat"), too.far=0.045, main="",add.color.legend=F)

Question 2

Solution: tensor product spline

geo3 <- bam(NotStd ~ te(km.e,Lat,k=c(6,6)), data=tuscan, family="binomial", discrete=T)
fvisgam(geo3, view=c("km.e","Lat"), too.far=0.045, main="",add.color.legend=F)

Varying geography’s influence based on concept freq.

Wieling, Nerbonne and Baayen (2011) showed that the effect of word frequency varied depending on geography
Here we explicitly include this in the GAM with te(), which can model an \(N\)-way non-linear interaction:
te(Lon, Lat, ConceptFreq, d=c(2,1))
As this pattern may be presumed to differ depending on speaker age, we can integrate this in the model as well:
te(Lon, Lat, ConceptFreq, YearBirth, d=c(2,1,1))

Question 3

Full model specification

system.time(
  m <- bam(NotStd ~ te(Lon, Lat, ConceptFreq.log.z, SpeakerBirthYear.z, d=c(2,1,1)) +
    CommunitySize.log.z + SpeakerJob_Farmer + SpeakerEduLevel.log.z + SpeakerIsMale +
    s(Speaker,bs="re") + s(Location,bs="re") + s(Concept,bs="re") + 
    s(Concept,CommunityRecordingYear.z,bs="re") + s(Concept,CommunitySize.log.z,bs="re") +
    s(Concept,CommunityAvgIncome.log.z,bs="re") + s(Concept,CommunityAvgAge.log.z,bs="re") +
    s(Concept,SpeakerJob_Farmer,bs="re") + s(Concept,SpeakerJob_Executive_AuxiliaryWorker,bs="re") +
    s(Concept,SpeakerEduLevel.log.z,bs="re") + s(Concept,SpeakerIsMale,bs="re"), 
  data=tuscan, family="binomial", discrete=T, nthreads=2)
)

   user  system elapsed 
 1476.5    10.2   792.7

The results will be discussed next… (Wieling et al., 2014)

Results: fixed effects and tensor

summary(m, re.test=FALSE)

Parametric coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)            -0.4249     0.1265   -3.36 0.000781 ***
CommunitySize.log.z    -0.0641     0.0223   -2.87    0.004 ** 
SpeakerJob_Farmer       0.0447     0.0168    2.66    0.008 ** 
SpeakerEduLevel.log.z  -0.0669     0.0126   -5.32 1.06e-07 ***
SpeakerIsMale           0.0378     0.0128    2.95    0.003 ** 

Approximate significance of smooth terms:
                                                 edf Ref.df Chi.sq p-value    
te(SpeakerBirthYear.z,ConceptFreq.log.z,Lon,Lat) 224    268   3289  <2e-16 ***

Interpreting logit coefficients

# chance for a male farmer in a 
# very small village (z-scored 
# population size = -2) for which 
# the location is unknown with a 
# very low education level 
# (z-score = -2) to use a 
# non-standard lexical form
(logit <- coef(m)["(Intercept)"] + 
    coef(m)["SpeakerIsMale"] + 
    coef(m)["SpeakerJob_Farmer"] + 
    -2 * coef(m)["CommunitySize.log.z"] + 
    -2 * coef(m)["SpeakerEduLevel.log.z"])

(Intercept) 
    -0.0803

plogis(logit)

(Intercept) 
       0.48

Geographical results: complex!

Sequence: increasing frequency for older speakers

Sequence: increasing frequency for younger speakers

Results: random effects

system.time(smry <- summary(m)) # takes a long time to compute

   user  system elapsed 
3676.09    9.59 3694.66

tail( smry$s.table, 11 ) # last 11 smooths are random effects

                                                  edf Ref.df   Chi.sq  p-value
s(Speaker)                                       83.2   2005     89.5 2.25e-02
s(Location)                                     175.2    209   5314.3 0.00e+00
s(Concept)                                      166.9    168 437444.8 0.00e+00
s(CommunityRecordingYear.z,Concept)             158.9    170 156924.5 0.00e+00
s(CommunitySize.log.z,Concept)                  149.9    169  30138.8 0.00e+00
s(CommunityAvgIncome.log.z,Concept)             158.1    170 143131.3 0.00e+00
s(CommunityAvgAge.log.z,Concept)                154.4    170 110864.3 0.00e+00
s(SpeakerJob_Farmer,Concept)                     86.0    169  26203.2 6.46e-07
s(SpeakerJob_Executive_AuxiliaryWorker,Concept)  53.3    170   3315.0 7.56e-04
s(SpeakerEduLevel.log.z,Concept)                139.1    169   9347.8 0.00e+00
s(SpeakerIsMale,Concept)                         85.5    169 111596.1 0.00e+00

By-concept random slopes for community size

By-concept random slopes for speaker education level

Discussion

Comparing Tuscan dialects to standard Italian revealed interesting patterns
GAMs are very suitable to model the non-linear influence of geography
The regression approach allowed for the simultaneous identification of important social, geographical and lexical predictors
By including many concepts, results are less subjective than traditional analyses focusing on only a few pre-selected concepts
The mixed-effects regression approach still allows a focus on individual concepts
Analyses can be made reproducible via paper package with data and code

Recap

We have applied GAMs to dialect data and learned how to:
- use s() to model two-dimensional interactions on the same scale
- model complex non-linear interactions using te()
- use GAMs to conduct logistic regression (family="binomial")
Associated lab session:
- https://www.let.rug.nl/wieling/Statistics/GAM-Dialectology/lab

Evaluation

Questions?

Thank you for your attention!

https://www.martijnwieling.nl

m.b.wieling@rug.nl

Generalized additive modeling for dialectology

This lecture

Question 1

Logistic regression

Standard Italian and Tuscan dialects

Material: lexical data

Geographic distribution of locations

Material: predictors

Modeling geography’s influence with a GAM

(R version 4.5.0 (2025-04-11 ucrt), mgcv version 1.9.3, itsadug version 2.4.1)

First 15 two-dimensional basis functions

First 15 two-dimensional basis functions

Fitted surface

Effect of the number of basis functions

Thin plate regression spline: scale-dependent

Question 2

Solution: tensor product spline

Varying geography’s influence based on concept freq.

Question 3

Full model specification

Results: fixed effects and tensor

Interpreting logit coefficients

Geographical results: complex!

Sequence: increasing frequency for older speakers

Sequence: increasing frequency for younger speakers

Results: random effects

By-concept random slopes for community size

By-concept random slopes for speaker education level

Discussion

Recap

Evaluation

Questions?

(R version 4.5.0 (2025-04-11 ucrt), `mgcv` version 1.9.3, `itsadug` version 2.4.1)