1 Introduction

During the lecture we have looked at the effect of Correctness and the effect of Age of Arrival. In this lab session we will look at another additional factor: Structure.

Research question: Is there an effect of the distance between determiner and noun in the violation?

We compare the DN (determiner-noun; e.g., der Garten/*das Garten) structure with the DAN (determiner-adjective-noun; e.g., das frische Gras/*der frische Gras) structure and only use a subset of speakers to enable a relatively fast computation.

You can download the R markdown file here: lab.Rmd, add your answers to this file, and generate the HTML report (see also the final section).

2 Initialization

Load the required packages and download the required files.

# test if the required packages are installed
# if not, install them
packages <- c("mgcv","car","itsadug")
if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
  install.packages(setdiff(packages, rownames(installed.packages())),
                   repos='http://cran.us.r-project.org')  
}

# load the packages
library(mgcv) 
library(itsadug) 
library(car) 

# download and load the following file containing the dataset:
download.file('http://www.let.rug.nl/wieling/Statistics/GAM-EEG/lab/dat.rda', 'dat.rda')
load('dat.rda') # This dataset contains a subset of 15 subjects
# display version information
R.version.string
packageVersion("mgcv")
packageVersion("car")
packageVersion("itsadug")

3 Data investigation

Investigate the data.

head(dat)
str(dat)
summary(dat)
dim(dat)

Note that the data needs to be sorted to be able to correct for autocorrelation. In this case, the data is already sorted, and a column start.event is present.

4 Effect of structure

We would like to know whether the P600 effect is dependent on the structure of the sentence (Determiner-Noun: DN vs. Determiner-Adjective-Noun: DAN). Remember that the size of the P600 is determined by the difference between incorrect and correct. Consequently, we are interested in assessing if that difference varies between the levels of Structure, and we will make a model where we look at the interaction between Correctness and Structure. Note that to prevent lengthy computations during this lab session, we will not take into account factor smooths per word.

dat$CorStruct <- interaction(dat$Correctness, dat$Structure)  # create a new 4-level variable
levels(dat$CorStruct)  # show levels
rhoval <- 0.93  # rho set to fixed value for this lab session

system.time(m1 <- bam(uV ~ s(Time, by = CorStruct) + CorStruct + s(Time, Subject,
    by = CorStruct, bs = "fs", m = 1), data = dat, rho = rhoval, AR.start = dat$start.event,
    discrete = T, nthreads = 2))
summary(m1)

We first visualize the smooths.

# plot the separate smooths
plot_smooth(m1, "Time", plot_all = "CorStruct", rug = F)

And subsequently visualize the two difference curves. Note that these are overly conservative, however.

# plot the (overly conservative) differences between correct and incorrect for the two cases 
# (DN and DAN)
par(mfrow=c(1,2))
plot_diff(m1, "Time", comp=list(CorStruct=c("incor.DAN","cor.DAN")), ylim=c(-5,10))
plot_diff(m1, "Time", comp=list(CorStruct=c("incor.DN","cor.DN")), ylim=c(-5,10))

5 Modeling differences

To model the difference curves directly (and not overly conservative), we can follow two approaches. The first one uses two binary curves (since we’d like to know the difference between correct and incorrect for the DN case and for the DAN case), the second approach is to use two ordered factors.

5.1 Binary difference curves

For binary curves, we first create two binary predictors to mark the incorrect cases as 1.

dat$IsIncorrectDN <- (dat$Correctness == "incor" & dat$Structure == "DN") *
    1
dat$IsIncorrectDAN <- (dat$Correctness == "incor" & dat$Structure == "DAN") *
    1

Now we can create the binary curve model as follows:

m1b <- bam(uV ~ s(Time, by = Structure) + Structure + s(Time, by = IsIncorrectDN) +
    s(Time, by = IsIncorrectDAN) + s(Time, Subject, by = Structure, bs = "fs",
    m = 1) + s(Time, Subject, by = IsIncorrectDN, bs = "fs", m = 1) + s(Time,
    Subject, by = IsIncorrectDAN, bs = "fs", m = 1), data = dat, rho = rhoval,
    AR.start = dat$start.event, discrete = T, nthreads = 2)
summary(m1b)

The binary curves will be 0 for the correct cases, so s(Time, by=Structure) models the two non-linear patterns for correct DN and correct DAN. The binary curves represent the two differences for DN and DAN, respectively.

To interpret the results, visualization is necessary. It is clear these differences are not centered and therefore contain the intercept difference. Also note that these difference curves are less conservative (i.e. have smaller confidence bands) than those of model m1. While Sóskuthy (2021) uses ordered factor difference curves (dubbed random reference/difference smooths) instead of binary difference curves in the random effects, effectively these are the same (see also 5.2.1, below), as the random-effect ordered factor smooths are not centered.

par(mfrow=c(1,2))
plot(m1b, select = 4, shade = T, rug = F, main = "Binary difference curve DAN",ylim=c(-5,10))
abline(h=0)
plot(m1b, select = 3, shade = T, rug = F, main = "Binary difference curve DN", ylim=c(-5,10))
abline(h=0)

5.2 Ordered factors

To separate the intercept differences from the non-linear differences, we can use ordered factors. If we would make CorStruct into an ordered factor, we would only have a single reference level, to which all other three levels would be compared against (similar as a factor included in the parametric part of the model summary). Given that we want to compare the incorrect cases to the correct cases, we therefore convert the binary predictors to two ordered factors.

dat$IsIncorrectDNO <- as.ordered(dat$IsIncorrectDN)
contrasts(dat$IsIncorrectDNO) <- "contr.treatment"
dat$IsIncorrectDANO <- as.ordered(dat$IsIncorrectDAN)
contrasts(dat$IsIncorrectDANO) <- "contr.treatment"

Subsequently, we create the ordered factor model as follows (similar to the binary curve model, only now including the separate intercept differences):

m1o <- bam(uV ~ s(Time, by = Structure) + Structure + s(Time, by = IsIncorrectDNO) +
    IsIncorrectDNO + s(Time, by = IsIncorrectDANO) + IsIncorrectDANO + s(Time,
    Subject, by = Structure, bs = "fs", m = 1) + s(Time, Subject, by = IsIncorrectDNO,
    bs = "fs", m = 1) + s(Time, Subject, by = IsIncorrectDANO, bs = "fs", m = 1),
    data = dat, rho = rhoval, AR.start = dat$start.event, discrete = T, nthreads = 2)
summary(m1o)

To interpret the results, we visualize the difference curves. These ordered factor difference smooths are always centered, and the summary contains the fixed-effect intercept differences (compared to a single reference level).

par(mfrow=c(1,2))
plot(m1o, select = 4, shade = T, rug = F, main = "Binary difference curve DAN", ylim=c(-5,10))
abline(h=0)
plot(m1o, select = 3, shade = T, rug = F, main = "Binary difference curve DN", ylim=c(-5,10))
abline(h=0)

5.2.1 Note on random effects

Earlier we indicated that using binary difference curves in the random effects is equal to using ordered factors. We can validate this by replacing the binary difference curves in the random effects by ordered factor difference curves (exclusively) in the random effects.

m1x <- bam(uV ~ s(Time, by = Structure) + Structure + s(Time, by = IsIncorrectDN) +
    s(Time, by = IsIncorrectDAN) + s(Time, Subject, by = Structure, bs = "fs",
    m = 1) + s(Time, Subject, by = IsIncorrectDNO, bs = "fs", m = 1) + s(Time,
    Subject, by = IsIncorrectDANO, bs = "fs", m = 1), data = dat, rho = rhoval,
    AR.start = dat$start.event, discrete = T, nthreads = 2)
summary(m1x)
AIC(m1b)
AIC(m1x)  # identical

par(mfrow = c(2, 2))
plot(m1b, select = 4, shade = T, rug = F, main = "Diff. curve DAN (binary random effects)",
    ylim = c(-5, 10))
abline(h = 0)
plot(m1b, select = 3, shade = T, rug = F, main = "Diff. curve DN (binary random effects)",
    ylim = c(-5, 10))
abline(h = 0)
plot(m1x, select = 4, shade = T, rug = F, main = "Diff. curve DAN (ord. fac. random effects)",
    ylim = c(-5, 10))
abline(h = 0)
plot(m1x, select = 3, shade = T, rug = F, main = "Diff. curve DN (ord. fac. random effects)",
    ylim = c(-5, 10))
abline(h = 0)

5.3 Exercise 1

For this exercise, please assess if the distinction between DAN and DN is necessary. Rather than comparing two models, use a binary curve model to assess if the difference between cor and incor is different for structure DAN than for structure DN. The idea is to modify model m1b in such a way that it contains one binary variable equal to 1 for all incorrect cases, and another binary variable (which already exists) which is only equal to 1 for the incorrect DAN cases. By including this latter variable, the first binary difference curve will represent the difference between correct and incorrect for DN and the binary difference curve including the latter variable represents how the first binary difference curve needs to change to represent the difference between correct and incorrect for DAN. If this binary difference curve is significant, the differences between correct and incorrect for DAN vs DN differ. If it is not significant, a single difference curve can be used for both DN and DAN. Hint: s(Time, by = Structure) + Structure + s(Time, by = IsIncorrect) + s(Time, by = IsIncorrectDAN) is part of the model specification. Extend this with the random-effects structure and determine how to compute the IsIncorrect predictor.

The reason we are using this binary curve approach, rather than comparing two models, is that fitting an ERP model with method = "ML" takes a very long time (due to the option discrete=T not being available), and the \(p\)-values of the model summary are more reliable than model comparison. While we could fit the models with select=T the above option is more precise. After fitting the model, visualize the difference curves.

# your code goes here

Answer: …

6 Age of arrival effect

In the following, we will investigate how the DN vs. DAN patterns vary depending on the AoArr of the participants. Note that we only use a subset of the data here (N = 15), so there are only a few different values of AoArr.

m2 <- bam(uV ~ te(Time, AoArr, by = CorStruct) + CorStruct + s(Time, Subject,
    by = CorStruct, bs = "fs", m = 1), data = dat, rho = rhoval, AR.start = dat$start.event,
    discrete = T, nthreads = 2)
summary(m2)

And we plot the (overly conservative) differences between correct and incorrect for the two cases:

par(mfrow = c(1, 2))
plot_diff2(m2, view = c("Time", "AoArr"), comp = list(CorStruct = c("incor.DAN",
    "cor.DAN")), add.color.legend = F)
fadeRug(dat$Time, dat$AoArr)  # make areas lighter where there is no data
plot_diff2(m2, view = c("Time", "AoArr"), comp = list(CorStruct = c("incor.DN",
    "cor.DN")), add.color.legend = F)
fadeRug(dat$Time, dat$AoArr)  # make areas lighter where there is no data

6.1 Binary curves

Using binary curves, we can directly assess if the difference between cor and incor is necessary for both structures.

m2b <- bam(uV ~ te(Time, AoArr, by = Structure) + Structure + te(Time, AoArr,
    by = IsIncorrectDAN) + te(Time, AoArr, by = IsIncorrectDN) + s(Time, Subject,
    by = Structure, bs = "fs", m = 1) + s(Time, Subject, by = IsIncorrectDAN,
    bs = "fs", m = 1) + s(Time, Subject, by = IsIncorrectDN, bs = "fs", m = 1),
    data = dat, rho = rhoval, AR.start = dat$start.event, discrete = T, nthreads = 2)
summary(m2b)  # both difference surfaces are necessary

6.2 Exercise 2

For this exercise, please assess if the differences are constant (i.e. only an intercept difference) or non-linear. Hint: use ordered factors instead of binary curves (don’t forget the intercept differences).

# your code goes here

Answer: …

6.3 Exercise 3 (optional)

For this optional (difficult) exercise, decompose the te() (each te() can be decomposed into two s()’s and one ti()) and assess which separate parts are necessary. As before, we do not use model comparison, but decompose the model using difference curves (see Exercise 1) and drop insignificant terms. Hint: as binary predictors may only occur in a model once (otherwise the model does not know in which term to put the constant intercept difference) and the decomposition yields three terms, you need to use ordered factors (see Exercise 2).

# your code goes here

7 Replication

To generate the output of the analysis presented above, first install Pandoc. Then copy the following lines to the most recent version of R. (Or use the Knit HTML button when viewing the file lab.Rmd in RStudio.)

# install rmarkdown package if not installed
if (!"rmarkdown" %in% rownames(installed.packages())) {
    install.packages("rmarkdown", repos = "http://cran.us.r-project.org")
}
library(rmarkdown)  # load rmarkdown package

# download original file if not already exists (to prevent overwriting)
if (!file.exists("lab.Rmd")) {
    download.file("http://www.let.rug.nl/wieling/Statistics/GAM-EEG/lab/lab.Rmd",
        "lab.Rmd")
}

# generate output
render("lab.Rmd")  # generates html file with results

# view output in browser
browseURL(paste("file://", file.path(getwd(), "lab.html"), sep = ""))  # shows result