Introduction
Initialization
Data investigation
Assessing the effect of structure
- Exercise 1
Assessing the effect of age of arrival
- Exercise 2
Model criticism
Scaled-t analysis
Other options for GAMs not covered in the course
Examples of using GAMs in publications
Replication

Introduction

During the lecture we have looked at the effect of Correctness and the effect of Age of Arrival. In this lab session we will look at another additional factor: Structure. Research question: Is there an effect of the distance between determiner and noun in the violation? We compare the DN (determiner-noun, e.g. der Garten/*das Garten) structure with the DAN (determiner-adjective-noun, e.g. das frische Gras/*der frische Gras) structure.

Initialization

Load the required packages and download the required files.

# test if the packages (mgcv, car and itsadug) are installed
# if not, install them
packages <- c("mgcv","car","itsadug")
if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
  install.packages(setdiff(packages, rownames(installed.packages())))  
}

# load the packages
library(mgcv) 
library(itsadug) 
library(car) 

# display version information
R.version.string

## [1] "R version 3.2.1 (2015-06-18)"

packageVersion('mgcv')

## [1] '1.8.6'

packageVersion('car')

## [1] '2.0.25'

packageVersion('itsadug')

## [1] '1.0.1'

# download and load the following files containing the dataset and some functions:
download.file('http://www.let.rug.nl/wieling/LSA/eeg.rda', 'dat.rda')
load('dat.rda') # This dataset is a subset of 15 subjects

Data investigation

Look at the data and sort it (necessary for correcting for autocorrelation).

head(dat)

##       Time Subject    Group Word TrialNr      Roi Structure Correctness L1
## 44947  505   GL102 GenEarly Wald       2 post.mid        DN       incor PL
## 28121  515   GL102 GenEarly Wald       2 post.mid        DN       incor PL
## 40909  525   GL102 GenEarly Wald       2 post.mid        DN       incor PL
## 1294   535   GL102 GenEarly Wald       2 post.mid        DN       incor PL
## 9446   545   GL102 GenEarly Wald       2 post.mid        DN       incor PL
## 81324  555   GL102 GenEarly Wald       2 post.mid        DN       incor PL
##       AoArr LoR Age Edu SeqStart  SubjectCor       uV
## 44947     8  24  32   3     TRUE GL102.incor  8.94400
## 28121     8  24  32   3    FALSE GL102.incor 15.56267
## 40909     8  24  32   3    FALSE GL102.incor 21.30807
## 1294      8  24  32   3    FALSE GL102.incor 13.31573
## 9446      8  24  32   3    FALSE GL102.incor 19.10700
## 81324     8  24  32   3    FALSE GL102.incor 17.95607

str(dat)

## 'data.frame':    102960 obs. of  16 variables:
##  $ Time       : num  505 515 525 535 545 555 565 575 585 595 ...
##  $ Subject    : Factor w/ 15 levels "GL102","GL103",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Group      : Factor w/ 2 levels "GenEarly","GenLate": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Word       : Factor w/ 96 levels "Auge","Bart",..: 56 56 56 56 56 56 56 56 56 56 ...
##  $ TrialNr    : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ Roi        : Factor w/ 1 level "post.mid": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Structure  : Factor w/ 2 levels "DAN","DN": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Correctness: Factor w/ 2 levels "cor","incor": 2 2 2 2 2 2 2 2 2 2 ...
##  $ L1         : Factor w/ 2 levels "PL","RU": 1 1 1 1 1 1 1 1 1 1 ...
##  $ AoArr      : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ LoR        : int  24 24 24 24 24 24 24 24 24 24 ...
##  $ Age        : int  32 32 32 32 32 32 32 32 32 32 ...
##  $ Edu        : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ SeqStart   : logi  TRUE FALSE FALSE FALSE FALSE FALSE ...
##  $ SubjectCor : Factor w/ 30 levels "GL102.cor","GL103.cor",..: 16 16 16 16 16 16 16 16 16 16 ...
##  $ uV         : num  8.94 15.56 21.31 13.32 19.11 ...

summary(dat)

##       Time           Subject           Group             Word      
##  Min.   : 505.0   GL153  : 7680   GenEarly:60160   Auge    : 1200  
##  1st Qu.: 702.5   GL154  : 7680   GenLate :42800   Bart    : 1200  
##  Median : 900.0   GL205  : 7680                    Beginn  : 1200  
##  Mean   : 900.0   GL207  : 7680                    Bein    : 1200  
##  3rd Qu.:1097.5   GL323  : 7680                    Beispiel: 1200  
##  Max.   :1295.0   GL335  : 7680                    Berg    : 1200  
##                   (Other):56880                    (Other) :95760  
##     TrialNr             Roi         Structure   Correctness    L1       
##  Min.   :  1.00   post.mid:102960   DAN:51600   cor  :51360   PL:12400  
##  1st Qu.: 34.00                     DN :51360   incor:51600   RU:90560  
##  Median : 68.00                                                         
##  Mean   : 68.76                                                         
##  3rd Qu.:102.00                                                         
##  Max.   :144.00                                                         
##                                                                         
##      AoArr            LoR             Age             Edu       
##  Min.   : 7.00   Min.   : 7.00   Min.   :20.00   Min.   :3.000  
##  1st Qu.:10.00   1st Qu.:10.00   1st Qu.:23.00   1st Qu.:3.000  
##  Median :13.00   Median :13.00   Median :28.00   Median :3.000  
##  Mean   :15.41   Mean   :13.17   Mean   :28.58   Mean   :3.981  
##  3rd Qu.:21.00   3rd Qu.:16.00   3rd Qu.:32.00   3rd Qu.:5.000  
##  Max.   :26.00   Max.   :24.00   Max.   :44.00   Max.   :5.000  
##                                                                 
##   SeqStart           SubjectCor          uV           
##  Mode :logical   GL110.cor: 3840   Min.   :-132.3745  
##  FALSE:101673    GL153.cor: 3840   1st Qu.:  -8.0165  
##  TRUE :1287      GL154.cor: 3840   Median :   0.4563  
##  NA's :0         GL205.cor: 3840   Mean   :   0.5424  
##                  GL207.cor: 3840   3rd Qu.:   9.0951  
##                  GL323.cor: 3840   Max.   : 135.9779  
##                  (Other)  :79920

dim(dat)

## [1] 102960     16

dat = start_event(dat, event=c("Subject","TrialNr","Roi"))
head(dat)

##   Time Subject    Group Word TrialNr      Roi Structure Correctness L1
## 1  505   GL103 GenEarly Brot       1 post.mid       DAN         cor RU
## 2  515   GL103 GenEarly Brot       1 post.mid       DAN         cor RU
## 3  525   GL103 GenEarly Brot       1 post.mid       DAN         cor RU
## 4  535   GL103 GenEarly Brot       1 post.mid       DAN         cor RU
## 5  545   GL103 GenEarly Brot       1 post.mid       DAN         cor RU
## 6  555   GL103 GenEarly Brot       1 post.mid       DAN         cor RU
##   AoArr LoR Age Edu SeqStart SubjectCor       uV start.event
## 1     8  15  23   3     TRUE  GL103.cor 37.61127        TRUE
## 2     8  15  23   3    FALSE  GL103.cor 38.12013       FALSE
## 3     8  15  23   3    FALSE  GL103.cor 39.86873       FALSE
## 4     8  15  23   3    FALSE  GL103.cor 28.43433       FALSE
## 5     8  15  23   3    FALSE  GL103.cor 34.82213       FALSE
## 6     8  15  23   3    FALSE  GL103.cor 42.42453       FALSE

Assessing the effect of structure

We would like to know whether the P600 effect is dependent on the structure of the sentence (Determiner-Noun: DN vs. Determiner-Adjective-Noun: DAN). Remember that the size of the P600 is determined by the difference between incorrect and correct. Consequently, we are interested in assessing if that difference varies between the levels of Structure, and we will make a model where we look at the interaction between Correctness and Structure. Note that to prevent lengthy computations during this lab session, we will not take into account factor smooths per word.

dat$CorStruct = interaction(dat$Correctness,dat$Structure) # create a new variable with 4 levels
levels(dat$CorStruct) # show levels

## [1] "cor.DAN"   "incor.DAN" "cor.DN"    "incor.DN"

rhoval = 0.93 # rho set to fixed value for this lab session
dat$SubjectCorStruct = interaction(dat$SubjectCor,dat$Structure)
m1 <- bam(uV ~ s(Time,by=CorStruct) + CorStruct + s(Time,SubjectCorStruct,bs='fs',m=1), 
          data=dat, rho = rhoval, AR.start=start.event)
summary(m1)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## uV ~ s(Time, by = CorStruct) + CorStruct + s(Time, SubjectCorStruct, 
##     bs = "fs", m = 1)
## 
## Parametric coefficients:
##                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)         -0.5338     0.9337  -0.572   0.5675  
## CorStructincor.DAN   2.6871     1.3274   2.024   0.0429 *
## CorStructcor.DN     -1.8222     1.3214  -1.379   0.1679  
## CorStructincor.DN    3.1240     1.3216   2.364   0.0181 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                               edf  Ref.df      F  p-value    
## s(Time):CorStructcor.DAN    1.000   1.001  0.027 0.868817    
## s(Time):CorStructincor.DAN  3.229   4.230  4.471 0.001082 ** 
## s(Time):CorStructcor.DN     1.001   1.001  0.017 0.897219    
## s(Time):CorStructincor.DN   1.273   1.505 11.765 0.000114 ***
## s(Time,SubjectCorStruct)   43.411 536.000  0.339  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0572   Deviance explained = 5.77%
## fREML = 3.2555e+05  Scale est. = 235.52    n = 102960

# plot the separate smooths
par(mfrow=c(1,3))
plot_smooth(m1,"Time",plot_all="CorStruct",rm.ranef=T,eegAxis=T,rug=F)

## Summary:
##  * CorStruct : factor; set to the value(s): cor.DAN, cor.DN, incor.DAN, incor.DN. 
##  * Time : numeric predictor; with 30 values ranging from 505.000000 to 1295.000000. 
##  * SubjectCorStruct : factor; set to the value(s): GL106.cor.DAN. (Might be canceled as random effect, check below.) 
##  * NOTE : The following random effects columns are canceled: s(Time,SubjectCorStruct)
##

# plot the differences between correct and incorrect for the two cases (DN and DAN)
plot_diff(m1, "Time", comp=list(CorStruct=c("incor.DAN","cor.DAN")), rm.ranef=T, eegAxis=T)

## Summary:
##  * Time : numeric predictor; with 100 values ranging from 505.000000 to 1295.000000. 
##  * SubjectCorStruct : factor; set to the value(s): GL106.cor.DAN. (Might be canceled as random effect, check below.) 
##  * NOTE : The following random effects columns are canceled: s(Time,SubjectCorStruct)
##

plot_diff(m1, "Time", comp=list(CorStruct=c("incor.DN","cor.DN")), rm.ranef=T, eegAxis=T)

## Summary:
##  * Time : numeric predictor; with 100 values ranging from 505.000000 to 1295.000000. 
##  * SubjectCorStruct : factor; set to the value(s): GL106.cor.DAN. (Might be canceled as random effect, check below.) 
##  * NOTE : The following random effects columns are canceled: s(Time,SubjectCorStruct)
##

Exercise 1

For this exercise, please assess if the distinction between DAN and DN is necessary. Put your code in the following code block, generate the output html file (see below for instructions), and put your conclusion below.

m1b <- bam(uV ~ s(Time,by=Correctness) + Correctness + s(Time,SubjectCorStruct,bs='fs',m=1), 
          data=dat, rho = rhoval, AR.start=start.event)
compareML(m1,m1b)

## m1: uV ~ s(Time, by = CorStruct) + CorStruct + s(Time, SubjectCorStruct, 
##      bs = "fs", m = 1)
## 
## m1b: uV ~ s(Time, by = Correctness) + Correctness + s(Time, SubjectCorStruct, 
##      bs = "fs", m = 1)
## 
## Chi-square test of fREML scores
## -----
##   Model    Score Edf Chisq    Df p.value Sig.
## 1   m1b 325552.6   8                         
## 2    m1 325545.4  14 7.103 6.000   0.027  *

## Warning in compareML(m1, m1b): AIC is not reliable, because an AR1 model is
## included (rho1 = 0.930000, rho2 = 0.930000).

Yes, the distinction between DAN and DN is necessary: compareML shows m1 is a significant improvement over m1b (which is the simpler model).

Assessing the effect of age of arrival

In the following, we will investigate how the DN vs. DAN patterns vary depending on the AoArr of the participants. Note that we only use a subset of the data here (N = 15), so there are not so many different values of AoArr.

m2 = bam(uV ~ te(Time,AoArr,by=CorStruct) + CorStruct + s(Time,SubjectCorStruct,bs='fs',m=1), data=dat, 
         rho=rhoval, AR.start=start.event)
summary(m2)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## uV ~ te(Time, AoArr, by = CorStruct) + CorStruct + s(Time, SubjectCorStruct, 
##     bs = "fs", m = 1)
## 
## Parametric coefficients:
##                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)         -0.5346     0.9509  -0.562    0.574  
## CorStructincor.DAN   2.4984     1.3491   1.852    0.064 .
## CorStructcor.DN     -1.8116     1.3457  -1.346    0.178  
## CorStructincor.DN    3.1024     1.3445   2.308    0.021 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                      edf  Ref.df     F  p-value    
## te(Time,AoArr):CorStructcor.DAN    3.003   3.006 0.902 0.439333    
## te(Time,AoArr):CorStructincor.DAN  8.835  11.679 1.339 0.189962    
## te(Time,AoArr):CorStructcor.DN     3.002   3.005 0.260 0.854563    
## te(Time,AoArr):CorStructincor.DN   3.004   3.007 5.561 0.000815 ***
## s(Time,SubjectCorStruct)          40.513 532.000 0.328  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.058   Deviance explained = 5.85%
## fREML = 3.2552e+05  Scale est. = 235.53    n = 102960

# plot the differences between correct and incorrect for the two cases (DN and DAN)
par(mfrow=c(1,2))
plot_diff2(m2, view=c("Time","AoArr"), comp=list(CorStruct=c("incor.DAN","cor.DAN")), rm.ranef=T)

## Summary:
##  * Time : numeric predictor; with 30 values ranging from 505.000000 to 1295.000000. 
##  * AoArr : numeric predictor; with 30 values ranging from 7.000000 to 26.000000. 
##  * SubjectCorStruct : factor; set to the value(s): GL106.cor.DAN. (Might be canceled as random effect, check below.) 
##  * NOTE : The following random effects columns are canceled: s(Time,SubjectCorStruct)
##

fadeRug(dat$Time,dat$AoArr) # make areas lighter where there is no data

plot_diff2(m2, view=c("Time","AoArr"), comp=list(CorStruct=c("incor.DN","cor.DN")), rm.ranef=T)

## Summary:
##  * Time : numeric predictor; with 30 values ranging from 505.000000 to 1295.000000. 
##  * AoArr : numeric predictor; with 30 values ranging from 7.000000 to 26.000000. 
##  * SubjectCorStruct : factor; set to the value(s): GL106.cor.DAN. (Might be canceled as random effect, check below.) 
##  * NOTE : The following random effects columns are canceled: s(Time,SubjectCorStruct)
##

fadeRug(dat$Time,dat$AoArr) # make areas lighter where there is no data

Exercise 2

For this exercise, please assess if a simpler model with either a smooth over Time or a smooth over AoArr (or both) would be a better choice than staying with m2. Note that each model will take up to 5 minutes to run.

m2a = bam(uV ~ s(Time,by=CorStruct) + CorStruct + s(Time,SubjectCorStruct,bs='fs',m=1), data=dat, 
          rho=rhoval, AR.start=start.event)
compareML(m2a,m2) # is s(Time,by=CorStruct) enough?

## m2a: uV ~ s(Time, by = CorStruct) + CorStruct + s(Time, SubjectCorStruct, 
##      bs = "fs", m = 1)
## 
## m2: uV ~ te(Time, AoArr, by = CorStruct) + CorStruct + s(Time, SubjectCorStruct, 
##      bs = "fs", m = 1)
## 
## Chi-square test of fREML scores
## -----
##   Model    Score Edf  Chisq     Df   p.value Sig.
## 1   m2a 325545.4  14                             
## 2    m2 325522.6  26 22.839 12.000 7.881e-06  ***

## Warning in compareML(m2a, m2): AIC is not reliable, because an AR1 model is
## included (rho1 = 0.930000, rho2 = 0.930000).

m2b = bam(uV ~ s(AoArr,by=CorStruct) + CorStruct + s(Time,SubjectCorStruct,bs='fs',m=1), data=dat, 
          rho=rhoval, AR.start=start.event)
compareML(m2b,m2) # is s(AoArr,by=CorStruct) enough?

## m2b: uV ~ s(AoArr, by = CorStruct) + CorStruct + s(Time, SubjectCorStruct, 
##      bs = "fs", m = 1)
## 
## m2: uV ~ te(Time, AoArr, by = CorStruct) + CorStruct + s(Time, SubjectCorStruct, 
##      bs = "fs", m = 1)
## 
## Chi-square test of fREML scores
## -----
##   Model    Score Edf  Chisq     Df   p.value Sig.
## 1   m2b 325556.0  14                             
## 2    m2 325522.6  26 33.424 12.000 1.239e-09  ***

## Warning in compareML(m2b, m2): AIC is not reliable, because an AR1 model is
## included (rho1 = 0.930000, rho2 = 0.930000).

m2c = bam(uV ~ s(Time,by=CorStruct) + s(AoArr,by=CorStruct) + CorStruct + s(Time,SubjectCorStruct,bs='fs',m=1), data=dat, 
          rho=rhoval, AR.start=start.event)
compareML(m2c,m2) # is the pure interaction necessary?

## m2c: uV ~ s(Time, by = CorStruct) + s(AoArr, by = CorStruct) + CorStruct + 
##      s(Time, SubjectCorStruct, bs = "fs", m = 1)
## 
## m2: uV ~ te(Time, AoArr, by = CorStruct) + CorStruct + s(Time, SubjectCorStruct, 
##      bs = "fs", m = 1)
## 
## Chi-square test of fREML scores
## -----
##   Model    Score Edf  Chisq    Df   p.value Sig.
## 1   m2c 325541.2  22                            
## 2    m2 325522.6  26 18.562 4.000 1.698e-07  ***

## Warning in compareML(m2c, m2): AIC is not reliable, because an AR1 model is
## included (rho1 = 0.930000, rho2 = 0.930000).

In conclusion, m2 is the optimal model as it is always better than the corresponding simpler models.

Model criticism

Model criticism using model m2 shows the familiar pattern:

par(mfrow=c(1,2))
qqp(resid_gam(m2))
hist(resid_gam(m2))

Scaled-t analysis

The code to run the scaled-t analysis is shown below. However, as this analysis is very time consuming, we do not run it here, but only show the commands.

download.file('http://www.let.rug.nl/wieling/LSA/bam.art.fit.R', 'bam.art.fit.R')
source('bam.art.fit.R')
g0 <- gam(uV ~ 1, data=dat, method='REML', family=scat(rho=rhoval, AR.start=dat$SeqStart))
g1 <- gam(uV ~ te(Time,AoArr,by=CorStruct) + CorStruct, data=dat, method='REML',
          family=scat(rho=rhoval, AR.start=dat$SeqStart))

theta0 = g0$family$getTheta(TRUE)
theta1 = g1$family$getTheta(TRUE)

theta0 - theta1 # check if difference is small

m3 = bam(uV ~ te(Time,Aoarr,by=CorStruct) + CorStruct + s(Time,SubjectCorStruct,bs='fs',m=1), data=dat
         method='fREML', family=art(theta=theta1,rho=rhoval), AR.start=SeqStart)

Other options for GAMs not covered in the course

Given that the duration of this course is limited, we did not cover all aspects of GAMs. The most prominent being that we did not cover binary curves and ordered factors. In essence both approaches entail that rather than modeling two separate smooths (for example for both correct and incorrect), we model a single curve for the reference level (we could choose correct as the reference level) and then add another curve which models the difference between correct and incorrect. In this way, a p-value is associated with the difference curve and the summary immediately shows if the distinction in two levels is necessary (i.e. if the difference curve is not significant, a single smooth can be used to represent both cases). So model comparison is not necessary anymore in that case. The difference between a binary curve and an ordered factor is that a binary curve does not have to be centered and therefore also includes the (constant) intercept difference (including the uncertainty about the constant intercept difference), whereas the ordered factor separates the intercept and the nonlinear effect. Here is the example code for a comparable model, first modeled in the normal way, then with a binary curve, and then with an ordered factor.

m <- bam(uV ~ s(Time,by=Correctness) + Correctness + s(Time,Subject,bs='fs',m=1), 
          data=dat, rho = rhoval, AR.start=start.event)
summary(m)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## uV ~ s(Time, by = Correctness) + Correctness + s(Time, Subject, 
##     bs = "fs", m = 1)
## 
## Parametric coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.2682     0.6585  -1.926   0.0541 .  
## Correctnessincor   3.5347     0.4450   7.943 1.99e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                             edf  Ref.df     F  p-value    
## s(Time):Correctnesscor    1.001   1.002 0.001     0.98    
## s(Time):Correctnessincor  3.060   4.009 7.025 1.18e-05 ***
## s(Time,Subject)          12.212 134.000 0.676 9.19e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0321   Deviance explained = 3.23%
## fREML = 3.2557e+05  Scale est. = 235.76    n = 102960

dat$IsIncorrect = (dat$Correctness=='incor')*1 # 0 for correct, 1 for incorrect
m_binary <- bam(uV ~ s(Time) + s(Time,by=IsIncorrect) + s(Time,Subject,bs='fs',m=1), 
                data=dat, rho = rhoval, AR.start=start.event)

## Warning in gam.side(sm, X, tol = .Machine$double.eps^0.5): model has
## repeated 1-d smooths of same variable.

summary(m_binary)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## uV ~ s(Time) + s(Time, by = IsIncorrect) + s(Time, Subject, bs = "fs", 
##     m = 1)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -1.2682     0.6585  -1.926   0.0541 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                        edf  Ref.df      F  p-value    
## s(Time)              1.001   1.002  0.001     0.98    
## s(Time):IsIncorrect  4.060   5.009 15.017 8.78e-15 ***
## s(Time,Subject)     12.213 134.000  0.676 9.19e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0321   Deviance explained = 3.23%
## fREML = 3.2557e+05  Scale est. = 235.76    n = 102960

dat$CorrectnessO = as.ordered(dat$Correctness)
contrasts(dat$CorrectnessO) = 'contr.treatment' # contrast treatment
m_ordfac <- bam(uV ~ s(Time) + s(Time,by=CorrectnessO) + CorrectnessO + 
                s(Time,Subject,bs='fs',m=1), data=dat, rho = rhoval, AR.start=start.event)

## Warning in gam.side(sm, X, tol = .Machine$double.eps^0.5): model has
## repeated 1-d smooths of same variable.

summary(m_ordfac)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## uV ~ s(Time) + s(Time, by = CorrectnessO) + CorrectnessO + s(Time, 
##     Subject, bs = "fs", m = 1)
## 
## Parametric coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.2681     0.6585  -1.926   0.0541 .  
## CorrectnessOincor   3.5346     0.4450   7.943    2e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                              edf  Ref.df     F p-value    
## s(Time)                    1.002   1.004 0.001 0.98080    
## s(Time):CorrectnessOincor  3.059   4.008 5.190 0.00035 ***
## s(Time,Subject)           12.215 134.000 0.676 9.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0321   Deviance explained = 3.23%
## fREML = 3.2557e+05  Scale est. = 235.76    n = 102960

par(mfrow=c(1,3))
plot_diff(m,view='Time',comp=list(Correctness=c('incor','cor')),rm.ranef=T,
          eegAxis=T,ylim=c(-6,6), main='m: calculated difference')

## Summary:
##  * Time : numeric predictor; with 100 values ranging from 505.000000 to 1295.000000. 
##  * Subject : factor; set to the value(s): GL153. (Might be canceled as random effect, check below.) 
##  * NOTE : The following random effects columns are canceled: s(Time,Subject)
##

plot(m_binary,select=2, rug=F, shade=T,ylim=c(6,-6), main='difference curve (binary: not centered)')
abline(h=0)
plot(m_ordfac,select=2, rug=F, shade=T,ylim=c(6,-6), main='difference curve (ordered factor: centered)')
abline(h=0)

Perhaps it helps to see the analogy of this approach with how contrasts between the levels of a factorial predictor are modeled in the fixed-effects structure. There you have the intercept which models the effect of the reference level of the factorial predictor. And the lines corresponding to the remaining levels show the difference with respect to the intercept. With a binary curve and ordered factors the s(Time) thus is the reference level (just like the intercept), and the other lines represent the contrast (i.e. the difference) with respect to this reference level. With the binary curve there is always one additional line (for whenever the binary value equals 1, when the binary value equals 0 the curve is ignored as it is equal to 0) besides the reference level, whereas the number of additional lines for the ordered factor is always one fewer than the total number of levels. So suppose a factorial predictor has three levels (A, B and C, with A being the reference level), you would need to define two binary curves (IsB and IsC) to model the difference between A and B and A and C. With the ordered factor you get these two difference curves automatically.

Finally, please note that you can not use the binary predictor multiple times in the model formula (e.g., this is not OK: s(Time) + s(Time,by=IsIncorrect) + s(AoArr) + s(AoArr,by=IsIncorrect)). Then the model does not know where to put the constant intercept difference between the correct and incorrect case and becomes non-deterministic. Instead you then must use ordered factors which account for the intercept difference separately: s(Time) + s(Time,by=IsIncorrectO) + s(AoArr) + s(AoArr,by=IsIncorrectO) + IsIncorrectO. More information about these issues can be found here (using the same data as discussed in the lecture here): http://www.let.rug.nl/~wieling/statscourse/lecture4/presentation.pdf.

In sum, binary curves and ordered factors allow you to model the data in a slightly different way. In most cases, however, fitting separate models and applying model comparison is adequate as well.

Examples of using GAMs in publications

Given that generalized additive models have only relatively recently been applied to analyzing time series data in linguistics, the method will be unfamiliar to many potential reviewers. Consequently, when using GAMs, you need to put some more effort in describing the method than if you would use a more traditional (but less adequate!) approach. Fortunately, more and more publications are coming out which have used GAMs in their analysis (and provide a detailed explanation of the method) to which you can refer. Some examples are given below:

EEG analysis: Nienke Meulman, Martijn Wieling, Simone Sprenger, Laurie Stowe and Monika Schmid. Age effects in L2 grammar processing as revealed by ERPs and how (not) to study them. Submitted (May 14, 2015) to PLOS ONE. Supplementary material showing the analysis (including scaled-t) and results.
EEG analysis: Cecile De Cat, Ekaterini Klepousniotou and R. Harald Baayen (2015). Representational deficit or processing effect? An electrophysiological study of noun-noun compound processing by very advanced L2 speakers of English. Frontiers in Psychology, 6, article 77.
Articulography analysis: Martijn Wieling, Fabian Tomaschek, Denis Arnold, Mark Tiede and R. Harald Baayen. Investigating dialectal differences using articulography. Accepted for publication (April 1, 2015) in Proceedings of ICPhS 2015, Glasgow, August 10-14.
Dialectometric analysis: Martijn Wieling, Simonetta Montemagni, John Nerbonne and R. Harald Baayen (2014). Lexical differences between Tuscan dialects and standard Italian: Accounting for geographic and socio-demographic variation using generalized additive mixed modeling. Language, 90(3), 669-692. Supplementary material including data, analysis and results.

Replication

To generate the output of the analysis presented above, first install Pandoc. Then copy the following lines to the most recent version of R.

# install rmarkdown package if not installed
if(!"rmarkdown" %in% rownames(installed.packages())) {
   install.packages("rmarkdown")
}
library(rmarkdown) # load rmarkdown package

# download original file if not already exists (to prevent overwriting)
if (!file.exists('analysisEEG.Rmd')) { 
   download.file('http://www.let.rug.nl/wieling/LSA/analysisEEG.Rmd', 'analysisEEG.Rmd')
} 

# generate output
render('analysisEEG.Rmd') # generates html file with results

# view output in browser
browseURL(paste('file://', file.path(getwd(),'analysisEEG.html'), sep='')) # shows result

LSA GAM course - EEG lab session

Martijn Wieling and Jacolien van Rij (with help of Nienke Meulman)

Generation date: juli 14, 2015 - 13:44:22