Martijn Wieling

Computational Linguistics Research Group

- Introduction
- Logistic regression (recap)
- Standard Italian and Tuscan dialects

- Material: Standard Italian and Tuscan dialects
- Methods:
`R`

code - Results
- Discussion

- Dependent variable is binary (1: success, 0: failure), not continuous
- Transform to continuous variable via log odds: \(\log(\frac{p}{1-p})\) = logit\((p)\)
- Done automatically in regression by setting
`family="binomial"`

- Done automatically in regression by setting
- Generalized linear model: specific link function and error distribution
- interpret coefficients w.r.t. success as logits: in
`R`

:`plogis(x)`

- Standard Italian originated in the 14th century as a written language
- It originated from the prestigious Florentine variety
- The spoken standard Italian language was adopted in the 20th century
- People used to speak in their local dialect

- In this study, we investigate the relationship between standard Italian and Tuscan dialects
- We focus on lexical variation
- We assess which social, geographical and lexical variables influence this relationship

- We used lexical data from the Atlante Lessicale Toscano (ALT)
- We focus on 2060 speakers from 213 locations and 170 concepts
- Total number of cases: 384,454
- Dependent variable
- 1: lexical form was different from standard Italian
- 0: lexical form was identical to standard Italian

- Speaker age
- Speaker gender
- Speaker education level
- Speaker employment history
- Number of inhabitants in each location
- Average income in each location
- Average age in each location
- Frequency of each concept

`mgcv`

version 1.8.16)```
geo <- bam(NotStd ~ s(Lon, Lat, k = 30), data = tuscan, family = "binomial", discrete = T)
summary(geo)
```

```
#
# Family: binomial
# Link function: logit
#
# Formula:
# NotStd ~ s(Lon, Lat, k = 30)
#
# Parametric coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -0.2474 0.0033 -75.1 <2e-16 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Approximate significance of smooth terms:
# edf Ref.df Chi.sq p-value
# s(Lon,Lat) 28.2 29 1591 <2e-16 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# R-sq.(adj) = 0.0042 Deviance explained = 0.312%
# fREML = 6.1609e+05 Scale est. = 1 n = 384454
```

```
fvisgam(geo, view = c("Lon", "Lat"), too.far = 0.045, main = "", rm.ranef = T)
```

```
geo2 <- bam(NotStd ~ s(km.e, Lat), data = tuscan, family = "binomial", discrete = T)
fvisgam(geo2, view = c("km.e", "Lat"), too.far = 0.045, main = "", rm.ranef = T)
```

```
geo3 <- bam(NotStd ~ te(km.e, Lat, k = c(6, 6)), data = tuscan, family = "binomial", discrete = T)
fvisgam(geo3, view = c("km.e", "Lat"), too.far = 0.045, main = "", rm.ranef = T)
```

- Wieling, Nerbonne and Baayen (2011) showed that the effect of word frequency varied depending on geography
- Here we explicitly include this in the GAM with
`te()`

, which can model an \(N\)-way non-linear interaction:

`te(Lon, Lat, ConceptFreq, d=c(2,1))`

- As this pattern may be presumed to differ depending on speaker age, we can integrate this in the model as well:

`te(Lon, Lat, ConceptFreq, YearBirth, d=c(2,1,1))`

```
system.time(
m <- bam(NotStd ~ te(Lon, Lat, ConceptFreq.log.z, SpeakerBirthYear.z, d=c(2,1,1)) +
CommunitySize.log.z + SpeakerJob_Farmer + SpeakerEduLevel.log.z + SpeakerIsMale +
s(Speaker,bs="re") + s(Location,bs="re") + s(Concept,bs="re") +
s(Concept,CommunityRecordingYear.z,bs="re") + s(Concept,CommunitySize.log.z,bs="re") +
s(Concept,CommunityAvgIncome.log.z,bs="re") + s(Concept,CommunityAvgAge.log.z,bs="re") +
s(Concept,SpeakerJob_Farmer,bs="re") + s(Concept,SpeakerJob_Executive_AuxiliaryWorker,bs="re") +
s(Concept,SpeakerEduLevel.log.z,bs="re") + s(Concept,SpeakerIsMale,bs="re"),
data=tuscan, family="binomial", discrete=T, nthreads=4)
)
```

```
# user system elapsed
# 2322.5 21.1 701.3
```

```
smry <- summary(m) # takes 10 minutes to calculate
```

- The results will be discussed next... (Wieling et al., 2014,
*Language*)

```
smry$p.table
```

```
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -0.4282 0.1264 -3.39 7.08e-04
# CommunitySize.log.z -0.0629 0.0223 -2.82 4.87e-03
# SpeakerJob_Farmer 0.0449 0.0169 2.66 7.81e-03
# SpeakerEduLevel.log.z -0.0678 0.0126 -5.38 7.29e-08
# SpeakerIsMale 0.0378 0.0128 2.95 3.18e-03
```

```
head(smry$s.table, 1)
```

```
# edf Ref.df Chi.sq p-value
# te(SpeakerBirthYear.z,ConceptFreq.log.z,Lon,Lat) 221 265 3270 0
```

```
# chance for a male farmer in a
# very small village (z-scored
# population size = -2) for
# which the location is unknown
# with a very low education
# level (z-score = -2) to use a
# non-standard lexical form
(logit <- coef(m)["(Intercept)"] +
coef(m)["SpeakerIsMale"] +
coef(m)["SpeakerJob_Farmer"] +
-2 * coef(m)["CommunitySize.log.z"] +
-2 * coef(m)["SpeakerEduLevel.log.z"])
```

```
# (Intercept)
# -0.0841
```

```
plogis(logit) # was: 0.438 (43.8%)
```

```
# (Intercept)
# 0.479
```

```
tail(smry$s.table, 11) # last 11 smooths are ranefs
```

```
# edf Ref.df Chi.sq p-value
# s(Speaker) 97.1 2005 106 9.40e-03
# s(Location) 175.0 209 5642 1.33e-96
# s(Concept) 167.0 168 436864 0.00e+00
# s(CommunityRecordingYear.z,Concept) 158.9 170 155893 4.88e-181
# s(CommunitySize.log.z,Concept) 149.9 169 29991 2.41e-111
# s(CommunityAvgIncome.log.z,Concept) 158.0 170 143207 1.75e-160
# s(CommunityAvgAge.log.z,Concept) 154.4 170 110722 5.80e-195
# s(SpeakerJob_Farmer,Concept) 86.1 169 26572 1.27e-07
# s(SpeakerJob_Executive_AuxiliaryWorker,Concept) 53.3 170 3319 8.07e-04
# s(SpeakerEduLevel.log.z,Concept) 139.1 169 9377 8.05e-49
# s(SpeakerIsMale,Concept) 85.4 169 112400 6.55e-11
```

- Comparing Tuscan dialects to standard Italian revealed interesting dialectal patterns
- GAMs are very suitable to model the non-linear influence of geography
- The regression approach allowed for the simultaneous identification of important social, geographical and lexical predictors
- By including many concepts, results are less subjective than traditional analyses focusing on only a few pre-selected concepts
- The mixed-effects regression approach still allows a focus on individual concepts
- More interested in Tuscan data and analysis? Paper package with all data and analyses available via http://www.martijnwieling.nl

- We have applied GAMs to dialectometry data and learned how to:
- use
`s()`

to model two-dimensional interactions on the same scale - model complex non-linear interactions using
`te()`

- use GAMs to conduct logistic regression (
`family="binomial"`

)

- use
- After the break:
- http://www.let.rug.nl/wieling/statscourse/lecture5/lab
- We use a subset of Dutch dialect data in the lab (faster: no logistic regression)
- Similar underlying idea: investigate the effect of geography, word frequency, and location characteristics on pronunciation distances from standard Dutch

- http://www.let.rug.nl/wieling/statscourse/lecture5/lab
- Finally: please fill in the evaluation form of the course:

http://www.let.rug.nl/wieling/statscourse/evaluation

Thank you for your attention!