Generalized additive modeling and dialectology

Lecture 5 of advanced regression for linguists

Martijn Wieling
Department of Information Science

This lecture

  • Introduction
    • Logistic regression
    • Standard Italian and Tuscan dialects
  • Material: Standard Italian and Tuscan dialects
  • Methods: R code
  • Results
  • Discussion

Logistic regression

  • Dependent variable is binary (1: success, 0: failure), not continuous
  • Transform to continuous variable via log odds: \(\log(\frac{p}{1-p})\) = logit\((p)\)
    • Done automatically in regression by setting family="binomial"
    • Transformation of dependent variable: generalized regression model
  • interpret coefficients w.r.t. success as logits: in R: plogis(x) plot of chunk unnamed-chunk-1

Standard Italian and Tuscan dialects

  • Standard Italian originated in the 14th century as a written language
  • It originated from the prestigious Florentine variety
  • The spoken standard Italian language was adopted in the 20th century
    • People used to speak in their local dialect
  • In this study, we investigate the relationship between standard Italian and Tuscan dialects
    • We focus on lexical variation
    • We attempt to identify which social, geographical and lexical variables influence this relationship

Material: lexical data

  • We used lexical data from the Atlante Lessicale Toscano (ALT)
    • We focus on 2060 speakers from 213 locations and 170 concepts
    • Total number of cases: 384454
    • For every case, we identified if the lexical form was different from standard Italian (1) or the same (0)

Geographic distribution of locations

Material: additional data

  • In addition, we obtained the following information:
  • Speaker age
  • Speaker gender
  • Speaker education level
  • Speaker employment history
  • Number of inhabitants in each location
  • Average income in each location
  • Average age in each location
  • Frequency of each concept

Modeling geography's influence with a GAM

(R version 3.2.3 (2015-12-10), mgcv version 1.8.11)

geo = bam(NotStd ~ s(Longitude, Latitude), data = tuscan, family = "binomial", discrete = T)
vis.gam(geo, view = c("Longitude", "Latitude"), plot.type = "contour", color = "terrain", too.far = 0.045, 
    main = "")

plot of chunk unnamed-chunk-2

Interpreting logit coefficients

summary(geo)$p.table
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)   -0.247     0.0033   -75.1        0
plogis(coef(geo)["(Intercept)"])
# (Intercept) 
#       0.438
  • On average 43.8% chance to see non-standard form

plot of chunk unnamed-chunk-4

Varying geography's influence based on concept freq.

  • Wieling, Nerbonne and Baayen (2011, PLOS ONE) showed that the effect of word frequency varied depending on geography
  • Here we explicitly include this in the GAM with te(), which can model an \(N\)-way non-linear interaction:
    te(Longitude, Latitude, ConceptFreq, d=c(2,1))
  • As this pattern may be presumed to differ depending on speaker age, we can integrate this in the model as well:
    te(Longitude, Latitude, ConceptFreq, YearBirth, d=c(2,1,1))

Full model specification

system.time(m <- bam(NotStd ~ te(Longitude, Latitude, ConceptFreq.log.z, SpeakerBirthYear.z, d = c(2, 
    1, 1)) + CommunitySize.log.z + SpeakerJob_Farmer + SpeakerEduLevel.log.z + SpeakerIsMale + s(Speaker, 
    bs = "re") + s(Location, bs = "re") + s(Concept, bs = "re") + s(Concept, CommunityRecordingYear.z, 
    bs = "re") + s(Concept, CommunitySize.log.z, bs = "re") + s(Concept, CommunityAvgIncome.log.z, bs = "re") + 
    s(Concept, CommunityAvgAge.log.z, bs = "re") + s(Concept, SpeakerJob_Farmer, bs = "re") + s(Concept, 
    SpeakerJob_Executive_AuxiliaryWorker, bs = "re") + s(Concept, SpeakerEduLevel.log.z, bs = "re") + 
    s(Concept, SpeakerIsMale, bs = "re"), data = tuscan, family = "binomial", discrete = T, nthreads = 4))
#    user  system elapsed 
#    5120      35    1815
system.time(smry <- summary(m))
#    user  system elapsed 
# 5640.78    6.13 5664.91
  • The results will be discussed next... (Wieling et al., 2014, Language)

Results: fixed effects and smooths

smry$p.table
#                       Estimate Std. Error z value Pr(>|z|)
# (Intercept)            -0.4372     0.1265   -3.46 5.48e-04
# CommunitySize.log.z    -0.0635     0.0225   -2.82 4.80e-03
# SpeakerJob_Farmer       0.0448     0.0168    2.66 7.78e-03
# SpeakerEduLevel.log.z  -0.0675     0.0126   -5.37 8.04e-08
# SpeakerIsMale           0.0380     0.0128    2.97 2.98e-03
head(smry$s.table, 1)
#                                                             edf Ref.df Chi.sq p-value
# te(SpeakerBirthYear.z,ConceptFreq.log.z,Longitude,Latitude) 225    270   3286       0

Interpreting logit coefficients II

# chance for a male farmer in a
# very small village (z-scored
# population size = -2) for
# which the location is unknown
# with a very low education
# level (z-score = -2) to use a
# non-standard lexical form
(logit = coef(m)["(Intercept)"] + 
    coef(m)["SpeakerIsMale"] + 
    coef(m)["SpeakerJob_Farmer"] + 
    -2 * coef(m)["CommunitySize.log.z"] + 
    -2 * coef(m)["SpeakerEduLevel.log.z"])
# (Intercept) 
#     -0.0923
plogis(logit)  # was: 0.438 (43.8%)
# (Intercept) 
#       0.477

plot of chunk unnamed-chunk-8

A complex geographical pattern

plot of chunk unnamed-chunk-9

Animation: increasing frequency for older speakers

Animation: increasing frequency for younger speakers

Results: random effects

tail(smry$s.table, 11)
#                                                   edf Ref.df   Chi.sq   p-value
# s(Speaker)                                       90.2   2005     97.7  1.45e-02
# s(Location)                                     175.2    209   5675.4  1.91e-93
# s(Concept)                                      167.0    168 437792.1  0.00e+00
# s(CommunityRecordingYear.z,Concept)             158.9    170 157471.2 4.46e-184
# s(CommunitySize.log.z,Concept)                  149.9    169  29933.0 1.70e-111
# s(CommunityAvgIncome.log.z,Concept)             158.1    170 143338.9 2.12e-160
# s(CommunityAvgAge.log.z,Concept)                154.4    170 110802.6 5.39e-196
# s(SpeakerJob_Farmer,Concept)                     85.9    169  26191.6  1.75e-07
# s(SpeakerJob_Executive_AuxiliaryWorker,Concept)  53.3    170   3325.5  6.77e-04
# s(SpeakerEduLevel.log.z,Concept)                139.1    169   9421.6  4.67e-49
# s(SpeakerIsMale,Concept)                         85.4    169 111227.6  1.00e-10

By-concept random slopes for community size

plot of chunk unnamed-chunk-13

By-concept random slopes for speaker education level

plot of chunk unnamed-chunk-14

Discussion

  • Using a generalized additive mixed-effects regression model (GAM) to investigate lexical differences between standard Italian and Tuscan dialects revealed interesting dialectal patterns
    • GAMs are very suitable to model the non-linear influence of geography
    • The regression approach allowed for the simultaneous identification of important social, geographical and lexical predictors
    • By including many concepts, results are less subjective than traditional analyses focusing on only a few pre-selected concepts
    • The mixed-effects regression approach still allows a focus on individual concepts
    • More interested in Tuscan data and analysis? Paper package with all data and analyses available via http://www.martijnwieling.nl

Recap

  • We have applied GAMs to dialectometry data and learned how to:
    • use s() to model two-dimensional interactions on the same scale
    • model complex non-linear interactions using te()
    • use GAMs to conduct logistic regression (family="binomial")
  • After the break:
    • http://www.let.rug.nl/wieling/statscourse/lecture5/lab
      • We use a subset of Dutch dialect data in the lab (faster: no logistic regression)
      • Similar underlying idea: investigate the effect of geography, word frequency, and location characteristics on pronunciation distances from standard Dutch
  • Finally: please evaluate the course:
    http://www.let.rug.nl/wieling/statscourse/evaluation

Questions?

Thank you for your attention!

http://www.martijnwieling.nl
wieling@gmail.com