In this lab session, we will look at Dutch dialect data. Make sure you have the most recent version of R
64 bits installed.
You can download the R markdown file here: lab5.Rmd, add your answers to this file, and generate the HTML report (see also the final section).
Load the required packages and download the required files
# test if the necessary packages are installed if not, install them
packages <- c("mgcv", "itsadug", "car")
if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
install.packages(setdiff(packages, rownames(installed.packages())), repos="http://cran.us.r-project.org")
}
# load the packages
library(mgcv)
library(itsadug)
library(car)
# download and load the following files containing the dataset and some functions:
download.file("http://www.let.rug.nl/wieling/statscourse/lecture5/lab/functions.R", "functions.R")
download.file("http://www.let.rug.nl/wieling/statscourse/lecture5/lab/dialectdata.rda", "dialectdata.rda")
source("functions.R") # custom plotting functions
load("dialectdata.rda") # Dutch dialect data, subset of 48 words
# display version information
R.version.string
packageVersion("mgcv")
packageVersion("itsadug")
packageVersion("car")
Look at the data to discover its structure. Note that predictors were standardized (z-transformed) and log-transformed (if appropriate). The dependent variable (LD
) was centered. To z-transform a variable, you can use scale(dat$predictor)
. To center a variable, you can use scale(dat$predictor,scale=F)
. To log-transform you can use log(dat$predictor)
. You can check if log-transorming is necessary by plotting the density of the predictor with plot(density(dat$predictor))
.
Investigate the data yourself.
# your code goes here
We first create a simple generalized additive model in which we assess the influence of geography (i.e. the interaction between Longitude
and Latitude
). As Longitude
and Latitude
are on the same scale, we can use s()
.
model1 <- bam(LD ~ s(Longitude, Latitude), data = dialectdata)
summary(model1)
To visualize the partial effect of geography (i.e. excluding all other effect in the model summary: in this case only the non-significant intercept), we can use pvisgam
pvisgam(model1, view = c("Longitude", "Latitude"), select = 1, color = "topo", too.far = 0.05)
To visualize the complete effect of geography, we can use vis.gam
of the mgcv
package, or fvisgam
of the itsadug
package (the advantage of the latter is that it allows one to exclude the random effects.
par(mfrow = c(1, 2))
vis.gam(model1, plot.type = "contour", color = "topo", too.far = 0.05, view = c("Longitude",
"Latitude"), main = "vis.gam")
fvisgam(model1, plot.type = "contour", color = "topo", too.far = 0.05, view = c("Longitude",
"Latitude"), main = "fvisgam")
In the course, we have not talked about basis functions. When we use s()
to model non-linearities, the underlying (default) basis function is a thin plate regression spline (i.e. bs='tp'
). This is generally suitable, but there are different basis functions, which are frequently quicker to compute (e.g., a cubic spline basis: bs='cr'
). For a tensor product spline te()
, the default basis is bs='cr'
. See http://www.inside-r.org/r-doc/mgcv/smooth.terms for more information. Also note that the default maximum complexity of the smooth differs between te()
(where it is 5 per dimension) and s()
(where it is about 10 per dimension).
We now try to model the interaction between longitude, latitude and word frequency.
model2 <- bam(LD ~ te(Longitude, Latitude, WordFreq, d = c(2, 1)), data = dialectdata, discrete = T)
summary(model2)
We can assess if model1
is better than model2
by refitting using ML
and using compareML
.
model1.ml <- bam(LD ~ s(Longitude, Latitude), data = dialectdata, method = "ML")
model2.ml <- bam(LD ~ te(Longitude, Latitude, WordFreq, d = c(2, 1)), data = dialectdata, method = "ML")
compareML(model1.ml, model2.ml)
We use fvisgam to visualize the geographical pattern for low frequency words (WordFreq = -2
), mean frequency words (WordFreq = 0
), and high frequency words (WordFreq = 2
).
par(mfrow = c(1, 3))
fvisgam(model2, color = "topo", too.far = 0.05, cond = list(WordFreq = -2), view = c("Longitude",
"Latitude"), main = "Effect of Geography for LF words")
fvisgam(model2, color = "topo", too.far = 0.05, cond = list(WordFreq = 0), view = c("Longitude",
"Latitude"), main = "Effect of Geography for MF words")
fvisgam(model2, color = "topo", too.far = 0.05, cond = list(WordFreq = +2), view = c("Longitude",
"Latitude"), main = "Effect of Geography for HF words")
Create a model (model2b
) to investigate the varying effect of geography only depending on population age (PopAge
). Assess if the model is better than model1
. Also visualize the results of the model.
# your code goes here
We first add a random intercept for word to model2
.
model3 <- bam(LD ~ te(Longitude, Latitude, WordFreq, d = c(2, 1)) + s(Word, bs = "re"), data = dialectdata,
discrete = T)
summary(model3)
The summary immediately shows that the random intercept for word is necessary, as it is significant.
We can visualize the random intercepts as follows.
plotIntercept.gam(model3, dialectdata, "Word")
We now add a fixed-effect predictor population count (PopCnt
) and assess if it needs to be included as a by-word random slope as well.
model4 <- bam(LD ~ te(Longitude, Latitude, WordFreq, d = c(2, 1)) + PopCnt + s(Word, bs = "re") +
s(Word, PopCnt, bs = "re"), data = dialectdata, discrete = T)
summary(model4)
The summary shows that the predictor is needed as a fixed-effect predictor as well as a by-word random slope.
We can visualize the random slopes as follows.
plotSlope.gam(model4, dialectdata, "Word", "PopCnt")
Add a random intercept for location and and visualize the results. Subsequently add a by-location random slope for WordFrequency. Is this random slope significant? Why do you think this is the case?
# your code goes here
Answer: …
To generate the output of the analysis presented above, first install Pandoc. Then copy the following lines to the most recent version of R. (Or use the Knit HTML
button when viewing the file lab5.Rmd
in RStudio.)
# install rmarkdown package if not installed
if (!"rmarkdown" %in% rownames(installed.packages())) {
install.packages("rmarkdown", repos="http://cran.us.r-project.org")
}
library(rmarkdown) # load rmarkdown package
# download original file if not already exists (to prevent overwriting)
if (!file.exists("lab5.Rmd")) {
download.file("http://www.let.rug.nl/wieling/statscourse/lecture5/lab/lab5.Rmd", "lab5.Rmd")
}
# generate output
render("lab5.Rmd") # generates html file with results
# view output in browser
browseURL(paste("file://", file.path(getwd(), "lab5.html"), sep = "")) # shows result