# Generalized additive modeling and dialectology

## Lecture 5 of advanced regression for linguists

Martijn Wieling
Computational Linguistics Research Group

## This lecture

• Introduction
• Logistic regression (recap)
• Standard Italian and Tuscan dialects
• Material: Standard Italian and Tuscan dialects
• Methods: R code
• Results
• Discussion

## Logistic regression (recap)

• Dependent variable is binary (1: success, 0: failure), not continuous
• Transform to continuous variable via log odds: $\log(\frac{p}{1-p})$ = logit$(p)$
• Done automatically in regression by setting family="binomial"
• Generalized linear model: specific link function and error distribution
• interpret coefficients w.r.t. success as logits: in R: plogis(x)

## Standard Italian and Tuscan dialects

• Standard Italian originated in the 14th century as a written language
• It originated from the prestigious Florentine variety
• The spoken standard Italian language was adopted in the 20th century
• People used to speak in their local dialect
• In this study, we investigate the relationship between standard Italian and Tuscan dialects
• We focus on lexical variation
• We assess which social, geographical and lexical variables influence this relationship

## Material: lexical data

• We used lexical data from the Atlante Lessicale Toscano (ALT)
• We focus on 2060 speakers from 213 locations and 170 concepts
• Total number of cases: 384,454
• Dependent variable
• 1: lexical form was different from standard Italian
• 0: lexical form was identical to standard Italian

## Geographic distribution of locations

• Speaker age
• Speaker gender
• Speaker education level
• Speaker employment history
• Number of inhabitants in each location
• Average income in each location
• Average age in each location
• Frequency of each concept

## Modeling geography's influence with a GAM

#### (R version 3.3.2 (2016-10-31), mgcv version 1.8.16)

geo <- bam(NotStd ~ s(Lon, Lat, k = 30), data = tuscan, family = "binomial", discrete = T)
summary(geo)

#
# Family: binomial
#
# Formula:
# NotStd ~ s(Lon, Lat, k = 30)
#
# Parametric coefficients:
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)  -0.2474     0.0033   -75.1   <2e-16 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Approximate significance of smooth terms:
#             edf Ref.df Chi.sq p-value
# s(Lon,Lat) 28.2     29   1591  <2e-16 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# R-sq.(adj) =  0.0042   Deviance explained = 0.312%
# fREML = 6.1609e+05  Scale est. = 1         n = 384454


## Fitted surface

fvisgam(geo, view = c("Lon", "Lat"), too.far = 0.045, main = "", rm.ranef = T)


## Thin plate regression spline: scale-dependent

geo2 <- bam(NotStd ~ s(km.e, Lat), data = tuscan, family = "binomial", discrete = T)
fvisgam(geo2, view = c("km.e", "Lat"), too.far = 0.045, main = "", rm.ranef = T)


## Solution: tensor product spline

geo3 <- bam(NotStd ~ te(km.e, Lat, k = c(6, 6)), data = tuscan, family = "binomial", discrete = T)
fvisgam(geo3, view = c("km.e", "Lat"), too.far = 0.045, main = "", rm.ranef = T)


## Varying geography's influence based on concept freq.

• Wieling, Nerbonne and Baayen (2011) showed that the effect of word frequency varied depending on geography
• Here we explicitly include this in the GAM with te(), which can model an $N$-way non-linear interaction:
te(Lon, Lat, ConceptFreq, d=c(2,1))
• As this pattern may be presumed to differ depending on speaker age, we can integrate this in the model as well:
te(Lon, Lat, ConceptFreq, YearBirth, d=c(2,1,1))

## Full model specification

system.time(
m <- bam(NotStd ~ te(Lon, Lat, ConceptFreq.log.z, SpeakerBirthYear.z, d=c(2,1,1)) +
CommunitySize.log.z + SpeakerJob_Farmer + SpeakerEduLevel.log.z + SpeakerIsMale +
s(Speaker,bs="re") + s(Location,bs="re") + s(Concept,bs="re") +
s(Concept,CommunityRecordingYear.z,bs="re") + s(Concept,CommunitySize.log.z,bs="re") +
s(Concept,CommunityAvgIncome.log.z,bs="re") + s(Concept,CommunityAvgAge.log.z,bs="re") +
s(Concept,SpeakerJob_Farmer,bs="re") + s(Concept,SpeakerJob_Executive_AuxiliaryWorker,bs="re") +
s(Concept,SpeakerEduLevel.log.z,bs="re") + s(Concept,SpeakerIsMale,bs="re"),
)

#    user  system elapsed
#  2322.5    21.1   701.3

smry <- summary(m) # takes 10 minutes to calculate


## Results: fixed effects and smooths

smry$p.table  # Estimate Std. Error z value Pr(>|z|) # (Intercept) -0.4282 0.1264 -3.39 7.08e-04 # CommunitySize.log.z -0.0629 0.0223 -2.82 4.87e-03 # SpeakerJob_Farmer 0.0449 0.0169 2.66 7.81e-03 # SpeakerEduLevel.log.z -0.0678 0.0126 -5.38 7.29e-08 # SpeakerIsMale 0.0378 0.0128 2.95 3.18e-03  head(smry$s.table, 1)

#                                                  edf Ref.df Chi.sq p-value
# te(SpeakerBirthYear.z,ConceptFreq.log.z,Lon,Lat) 221    265   3270       0


## Interpreting logit coefficients (recap)

# chance for a male farmer in a
# very small village (z-scored
# population size = -2) for
# which the location is unknown
# with a very low education
# level (z-score = -2) to use a
# non-standard lexical form
(logit <- coef(m)["(Intercept)"] +
coef(m)["SpeakerIsMale"] +
coef(m)["SpeakerJob_Farmer"] +
-2 * coef(m)["CommunitySize.log.z"] +
-2 * coef(m)["SpeakerEduLevel.log.z"])

# (Intercept)
#     -0.0841

plogis(logit)  # was: 0.438 (43.8%)

# (Intercept)
#       0.479


## Results: random effects

tail(smry\$s.table, 11)  # last 11 smooths are ranefs

#                                                   edf Ref.df Chi.sq   p-value
# s(Speaker)                                       97.1   2005    106  9.40e-03
# s(Location)                                     175.0    209   5642  1.33e-96
# s(Concept)                                      167.0    168 436864  0.00e+00
# s(CommunityRecordingYear.z,Concept)             158.9    170 155893 4.88e-181
# s(CommunitySize.log.z,Concept)                  149.9    169  29991 2.41e-111
# s(CommunityAvgIncome.log.z,Concept)             158.0    170 143207 1.75e-160
# s(CommunityAvgAge.log.z,Concept)                154.4    170 110722 5.80e-195
# s(SpeakerJob_Farmer,Concept)                     86.1    169  26572  1.27e-07
# s(SpeakerJob_Executive_AuxiliaryWorker,Concept)  53.3    170   3319  8.07e-04
# s(SpeakerEduLevel.log.z,Concept)                139.1    169   9377  8.05e-49
# s(SpeakerIsMale,Concept)                         85.4    169 112400  6.55e-11


## Discussion

• Comparing Tuscan dialects to standard Italian revealed interesting dialectal patterns
• GAMs are very suitable to model the non-linear influence of geography
• The regression approach allowed for the simultaneous identification of important social, geographical and lexical predictors
• By including many concepts, results are less subjective than traditional analyses focusing on only a few pre-selected concepts
• The mixed-effects regression approach still allows a focus on individual concepts
• More interested in Tuscan data and analysis? Paper package with all data and analyses available via http://www.martijnwieling.nl

## Recap

• We have applied GAMs to dialectometry data and learned how to:
• use s() to model two-dimensional interactions on the same scale
• model complex non-linear interactions using te()
• use GAMs to conduct logistic regression (family="binomial")
• After the break:
• http://www.let.rug.nl/wieling/statscourse/lecture5/lab
• We use a subset of Dutch dialect data in the lab (faster: no logistic regression)
• Similar underlying idea: investigate the effect of geography, word frequency, and location characteristics on pronunciation distances from standard Dutch
• Finally: please fill in the evaluation form of the course:
http://www.let.rug.nl/wieling/statscourse/evaluation