1 Introduction

In this lab session, we will look at Dutch dialect data. Make sure you have the most recent version of R 64 bits installed.

You can download the R markdown file here: lab.Rmd, add your answers to this file, and generate the HTML report (see also the final section).

2 Initialization

Load the required packages and download the required files

# test if the necessary packages are installed if not, install them
packages <- c("mgcv", "itsadug", "car")
if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
    install.packages(setdiff(packages, rownames(installed.packages())))
}

# load the packages
library(mgcv)
library(itsadug)
library(car)

# download and load the following files containing the dataset and some functions:
download.file("http://www.let.rug.nl/wieling/Statistics/GAM-Dialectology/lab/functions.R",
    "functions.R")
download.file("http://www.let.rug.nl/wieling/Statistics/GAM-Dialectology/lab/dialectdata.rda",
    "dialectdata.rda")

source("functions.R")  # custom plotting functions
load("dialectdata.rda")  # Dutch dialect data, subset of 48 words
# display version information
R.version.string
packageVersion("mgcv")
packageVersion("itsadug")
packageVersion("car")

3 Data investigation

Look at the data to discover its structure. Note that predictors were standardized (z-transformed) and log-transformed (if appropriate). The dependent variable (LD) was centered. To z-transform a variable, you can use scale(dat$predictor). To center a variable, you can use scale(dat$predictor,scale=F). To log-transform you can use log(dat$predictor). You can check if log-transorming is necessary by plotting the density of the predictor with plot(density(dat$predictor)).

3.1 Exercise 1

Investigate the data yourself.

# your code goes here

4 Generalized additive model

We first create a simple generalized additive model in which we assess the influence of geography (i.e. the interaction between Longitude and Latitude). As Longitude and Latitude are on the same scale, we can use s().

model1 <- bam(LD ~ s(Longitude, Latitude), data = dialectdata)
summary(model1)

4.1 Visualization

To visualize the partial effect of geography (i.e. excluding all other effect in the model summary: in this case only the non-significant intercept), we can use pvisgam.

pvisgam(model1, view=c('Longitude','Latitude'), select=1, too.far=0.05, add.color.legend=F)

To visualize the complete effect of geography, we can use vis.gam of the mgcv package, or fvisgam of the itsadug package (the advantage of the latter is that it allows one to exclude the random effects.

par(mfrow=c(1,2))
vis.gam(model1, plot.type="contour", too.far=0.05, view=c("Longitude","Latitude"), 
        main="vis.gam")
fvisgam(model1, plot.type="contour", too.far=0.05, view=c("Longitude","Latitude"), 
        main="fvisgam",add.color.legend=F)

4.2 A note about basis functions

In the course we have talked a bit about basis functions. When we use s() to model non-linearities, the underlying (default) basis function is a thin plate regression spline (i.e. bs='tp'). This is generally suitable, but there are different basis functions, which are frequently quicker to compute (e.g., a cubic spline basis: bs='cr'). For a tensor product spline te(), the default basis is bs='cr'. See https://www.rdocumentation.org/packages/mgcv/versions/1.8-33/topics/smooth.terms for more information. Also note that the default maximum complexity of the smooth differs between te() (where it is 5 per dimension) and s() (where it is about 10 per dimension).

5 Three-dimensional GAM

We now try to model the interaction between longitude, latitude and word frequency.

model2 <- bam(LD ~ te(Longitude,Latitude,WordFreq,d=c(2,1)), data=dialectdata, discrete=T)
summary(model2)

We can assess if model1 is better than model2 by refitting using ML and using compareML.

model1.ml <- bam(LD ~ s(Longitude,Latitude), data=dialectdata, method='ML')
model2.ml <- bam(LD ~ te(Longitude,Latitude,WordFreq,d=c(2,1)), data=dialectdata, method='ML')
compareML(model1.ml,model2.ml)

5.1 Visualization

We use fvisgam to visualize the geographical pattern for low frequency words (WordFreq = -2), mean frequency words (WordFreq = 0), and high frequency words (WordFreq = 2).

par(mfrow = c(1, 3))
fvisgam(model2, too.far = 0.05, cond = list(WordFreq = -2), view = c("Longitude", "Latitude"),
    main = "Effect of Geography for LF words", add.color.legend = F)
fvisgam(model2, too.far = 0.05, cond = list(WordFreq = 0), view = c("Longitude", "Latitude"),
    main = "Effect of Geography for MF words", add.color.legend = F)
fvisgam(model2, too.far = 0.05, cond = list(WordFreq = +2), view = c("Longitude", "Latitude"),
    main = "Effect of Geography for HF words", add.color.legend = F)

5.2 Exercise 2

Create a model (model2b) to investigate the varying effect of geography only depending on population age (PopAge). Assess if the model is better than model1. Also visualize the results of the model.

# your code goes here

6 Random-effects structure

We first add a random intercept for word to model2.

model3 <- bam(LD ~ te(Longitude,Latitude,WordFreq,d=c(2,1)) + s(Word,bs="re"), 
              data=dialectdata, discrete=T)
summary(model3)

The summary immediately shows that the random intercept for word is necessary, as it is significant.

6.1 Random intercept visualization

We can visualize the random intercepts as follows.

plotIntercept.gam(model3, dialectdata, "Word")

6.2 Adding a random slope

We now add a fixed-effect predictor population count (PopCnt) and assess if it needs to be included as a by-word random slope as well.

model4 <- bam(LD ~ te(Longitude, Latitude, WordFreq, d = c(2, 1)) + PopCnt + s(Word, bs = "re") +
    s(Word, PopCnt, bs = "re"), data = dialectdata, discrete = T)
summary(model4)

The summary shows that the predictor is needed as a fixed-effect predictor, but not as a by-word random slope.

6.3 Random slope visualization

We can visualize the random slopes (in this case there is relatively little variability, as can be seen from the scale) as follows.

plotSlope.gam(model4, dialectdata, "Word", "PopCnt")

6.4 Exercise 3

Add a random intercept for location and and visualize the results. Subsequently add a by-location random slope for WordFrequency. Is this random slope significant? Why do you think this is the case?

# your code goes here

Answer: …

7 Final remarks

  • Note that the (geographical) results are different than reported in the paper. This is due to the relatively small dataset (containing less than 10% of the locations).
  • The original dataset and all analyses can be found in the paper package
  • For the Tuscan data presented during the lecture, these can be found in another paper package

8 Replication

To generate the output of the analysis presented above, first install Pandoc. Then copy the following lines to the most recent version of R. (Or use the Knit HTML button when viewing the file lab.Rmd in RStudio.)

# install rmarkdown package if not installed
if (!"rmarkdown" %in% rownames(installed.packages())) {
    install.packages("rmarkdown")
}
library(rmarkdown)  # load rmarkdown package

# download original file if not already exists (to prevent overwriting)
if (!file.exists("lab.Rmd")) {
    download.file("http://www.let.rug.nl/wieling/Statistics/GAM-Dialectology/lab/lab.Rmd",
        "lab.Rmd")
}

# generate output
render("lab.Rmd")  # generates html file with results

# view output in browser
browseURL(paste("file://", file.path(getwd(), "lab.html"), sep = ""))  # shows result