Random effects and autocorrelation (optional)

Download the source file nonlinear_0.1.tar.gz and install the packages by the following commands in R:

download.file("http://www.sfs.uni-tuebingen.de/~jvanrij/packages/nonlinear_0.1.tar.gz", "nonlinear_0.1.tar.gz")
install.packages("nonlinear_0.1.tar.gz", type="source", repos=NULL)

If the package is installed succesfully, you could remove the source file (delete it).

Run the demo about autocorrelation as follows:

library(nonlinear)
demoAR()

Setting up R session

Load packages:

library(mgcv)
library(itsadug)

Check R version and package versions:

R.version.string

## [1] "R version 3.2.1 (2015-06-18)"

packageVersion("mgcv")

## [1] '1.8.6'

packageVersion("itsadug")

## [1] '1.0.1'

If these versions do not agree with the ones included in the Lab Assignment file, please let us know.

Loading data

The data discussed in this lab session is collected during two consequent courses (2014, 2015) at the CCP Spring Training in Experimental Psycholinguistics (STEP), in Edmonton, Canada. The data is collected in a noisy environment with participants who had already seen the experiment. The experiment is simplified version of an eye-tracking experiment of Vince Poretta (in prep; 2015).

Download the data to your working directory and load the data in R (see more info about loading data below)

download.file("http://www.sfs.uni-tuebingen.de/~jvanrij/LSA2015/Data/gazedata.rda", "countdat.rda")

load('countdat.rda')

Inspection of the data

Inspect the data with the follwoing two commands. Pay attention to the column InterestAreaLabel.

head(countdat)
str(countdat)

Visualizing the data

In this assignment we are interested in the looks to the Target (e.g., “shop”) in comparison with the looks to the Onset Competitor (e.g., “shut”) and RhymeCompetitor (e.g., “drop”), and with the looks to a distractor (e.g., “fill”).

First, we plot the data. The data plots provide a sanity check for the GAMMs: if the pattern of the smooths are completely off (different range of values, or a completely different pattern), you may want to see what causes these differences.

Calculating the grand averages of the proportions of looks to the different areas of interest per timebin.

# IMPORTANT: note the spaces in the column names:
colnames(countdat)

# removing spaces from column names, because these will give trouble:
colnames(countdat) <- gsub("\\ ","", colnames(countdat))

# calculating proportions:
countdat$TotalCount  <- with(countdat,  InterestAreaLabel.Target_IA  
    + InterestAreaLabel.OnsetComp_IA 
    + InterestAreaLabel.RhymeComp_IA   
    + InterestAreaLabel.Distract_IA 
    + InterestAreaLabel.NA )

countdat$propTarget <- with(countdat, InterestAreaLabel.Target_IA  / TotalCount )
countdat$propOnset <- with(countdat, InterestAreaLabel.OnsetComp_IA  / TotalCount )
countdat$propRhyme <- with(countdat, InterestAreaLabel.RhymeComp_IA  / TotalCount )
countdat$propDistr <- with(countdat, InterestAreaLabel.Distract_IA  / TotalCount )

Plot the grand averages per timebin. Use your preferred plotting functions, or use the code below.

# Setup empty plot window:
emptyPlot(range(countdat$Timebin), c(0,1), 
    h=.5,
    main='Looks to interest areas', xlab='Time (ms)', ylab="Proportion")

# Distractor
avg <- aggregate(propDistr ~ Timebin, data=countdat, mean, na.rm=TRUE)
lines(avg$Timebin, avg$propDistr, col="black", lty=2)
# Onset competitor
avg <- aggregate(propOnset ~ Timebin, data=countdat, mean, na.rm=TRUE)
lines(avg$Timebin, avg$propOnset, col="indianred", lwd=2)
# Rhyme competitor
avg <- aggregate(propRhyme ~ Timebin, data=countdat, mean, na.rm=TRUE)
lines(avg$Timebin, avg$propRhyme, col="steelblue", lwd=2)
# Target
avg <- aggregate(propTarget ~ Timebin, data=countdat, mean, na.rm=TRUE)
lines(avg$Timebin, avg$propTarget, col="green", lwd=2)

As most people seem to have found the target before 2000 ms after word onset, we reduce the data by excluding all data points recorded after 2000 ms:

countdat <- countdat[countdat$Timebin < 2000,]

Analyzing looks to target versus looks to onset competitor

Create a new column that contains a matrix with looks to the target (success) and looks to the onset competitor (failures).

countdat$TargetLooks <- cbind(countdat$InterestAreaLabel.Target_IA, countdat$InterestAreaLabel.OnsetComp_IA)
head(countdat)
str(countdat)

cbind() data

If you are working in R Studio, the command View(countdat) will show something different than head(countdat) for the column TargetLooks.

The cause for these differences is the use of the function cbind: cbind combines vectors into different columns of a matrix. Try for example:

test1 <- 1:10
test2 <- 2:11

mat <- cbind(test1,test2)
class(mat)

We have stored a matrix with two columns in the data frame countdat under the name TargetLooks. So each cell of the column countdat$TargetLooks has two values:

head(countdat$TargetLooks)

The first column indicates the counts of successes (i.e., looks to the target), and the second column indicates the counts of failures (i.e., looks to the distractor).

Regression functions (e.g., lm, lmer, bam) will recognize this structure as binomial data: counts of successes and failures.

As there are more than 10.000 rows in the data, we will use the function bam():

nrow(countdat)

a. Trend over time

Run a GAM model that includes the following nonlinear effects:

a main effect of Timebin
a main effect of TrialIndex
a random smooth for subjects over Timebin (include k=5 to reduce the running time); the subjects are indicated by the column RecordingSessionLabel, but we change this first into Subject (saves typing).

Because logistic models take more time to run, you can leave out other random effects (it will take too long to complete this lab session).

# first order the data:
countdat <- start_event(countdat, column="Timebin", event="Event")

# make RecordingSessionLabel as factor, and store it ito a column with shorter name:
countdat$Subject <- as.factor(countdat$RecordingSessionLabel)

m1 <- bam(TargetLooks ~  ... , 
          data=countdat, family="binomial")

Inspect the summary of the model.

What does the intercept represent in this model? What does the intercept tell: do people look on average more to the target or to the onset competitor?
What does the smooth over Timebin represent?

Visualizing the trend over time:

par(mfrow=c(1,2))
plot(m1, select=1, scale=0, shade=TRUE)
abline(h=0)

plot_smooth(m1, view="Timebin", rm.ranef=TRUE)

Interpretation of the plots. - Why is the shape of the smooth different from the target looks as we saw in the data plots?

What do these plot tell about the looks to the target versus the looks to the onset competitor?

For interpretation of the results, we could tranform the summed effects back to proportions with the function plot_smooth:

par(mfrow=c(1,1))
plot_smooth(m1, view="Timebin", rm.ranef=TRUE, 
            transform=plogis, ylim=c(0,1))

When do people start to look significantly more to the target than to the onset competitor?

b. Nonlinear interactions

Run a new GAMM model that includes a nonlinear interaction between Timebin and TrialIndex, and a random smooth for subjects (Subject) over Timebin (include k=5 for the random effect to reduce the running time).

m2 <- bam(TargetLooks ~  ... , 
          data=countdat, family="binomial")

Inspect the summary of model m2.

What does the intercept represent in this model?
What does it mean that the smooth term te(Timebin, TrialIndex) is significant in the summary?

Compare this model with model m1 with the function compareML.

Could you conclude that the interaction between Timebin and TrialIndex is significant?

Visualize the interaction:

par(mfrow=c(1,2), cex=1.1)

# plot the partial effect:
plot(m2, rug=FALSE, main="Interaction", 
     cex.axis=2, cex.lab=2) # parameters for increasing font size
# Now with colors:
pvisgam(m2, select=1, view=c("Timebin", "TrialIndex"), zlim=c(-4,14), main="")

Which of the two predictor seems to have a stronger effect on the gaze pattern: Timebin or TrialIndex?
How would the summed effect plot look like? What would change?
How do the random effects change the interaction surface?

Plot the summed effects (note: pvisgam is for Partial effects, fvisgam for Fitted values):

fvisgam(m2, view=c("Timebin", "TrialIndex"), zlim=c(-2,12), main="Interaction", rm.ranef=TRUE)

What do the negative and the positive values in the surface represent?

To plot the summed effects in proportion space, add transform=plogis:

fvisgam(m2, view=c("Timebin", "TrialIndex"), zlim=c(0,1), 
        main="Interaction", rm.ranef=TRUE, transform=plogis)

c. Effect of native language

To test if native English participants differ from nonnative English participants, we would like to include the factor NativeEnglish.

Run a new GAMM model that splits the nonlinear interaction between Timebin and TrialIndex by NativeEnglish. Don’t forget to include an intercept term (parametric) for NativeEnglish. In addition add a random smooth for subjects (Subject) over Timebin (include k=5 for the random effect to reduce the running time).

countdat$NativeEnglish <- as.factor(countdat$NativeEnglish)
m3 <- bam(TargetLooks ~  ... , 
          data=countdat, family="binomial")

Inspect the summary of model m3.

What does the summary reveal about the interaction between Timebin and Trial for the two groups?
Can you conclude that the two surfaces are different from each other?

Run a model comparison using the function compareML for the models m2 and m3.

Which model is preferred?
What does this result indicate about the difference between native and nonnative English participants?

Plot the two surfaces and the difference between the two surfaces:

par(mfrow=c(1,3))
fvisgam(m3, view=c("Timebin","TrialIndex"),
        cond=list(NativeEnglish="Native"),
        main="Native", rm.ranef=TRUE, zlim=c(-3,20))
fvisgam(m3, view=c("Timebin","TrialIndex"),
        cond=list(NativeEnglish="Nonnative"),
        main="Nonnative", rm.ranef=TRUE, zlim=c(-3,20))
plot_diff2(m3, view=c("Timebin","TrialIndex"), plotCI=TRUE,
        comp=list(NativeEnglish=c("Native","Nonnative")),
        main="Difference", rm.ranef=TRUE, zlim=c(-2,8))

Do these plots support your earlier conclusion about the difference between native and nonnative English participants?

d. Decomposing nonlinear interactions * assignment *

Decompose the nonlinear interaction between Timebin and Time, as included in model m2 in two main effects and an interaction surface (hint: use ti()). Also include a random smooth for subjects (Subject) over Timebin (include k=5 for the random effect to reduce the running time).

m4 <- bam(TargetLooks ~  ... ,
          data=countdat, family="binomial")

Inspect the summary of model m4.

What does the interaction term represent?
What could you conclude on the basis of this summary with respect to the interaction between Timebin and TrialIndex?

Plot the partial effect of the interaction in model m4 and compare these with the plots of model m2.

par(mfrow=c(1,2))
pvisgam(m4, view=c("Timebin","TrialIndex"),
        select=3,
        main="model m4", 
        zlim=c(-2,2))
pvisgam(m2, view=c("Timebin","TrialIndex"),
        select=1,
        main="model m2", 
        zlim=c(-8,8))

Why are these two surfaces so different? Note that the zlimit ranges are different, but also the pattern.

Step by step add interactions with NativeEnglish to model m4 to see if the main effects and interaction are influenced by native language. Use model comparisons to determine the best fitting model.

Hand in your best-fitting model. Is there a difference in native and nonnative speakers with respect to looks to the target?
Plot the distribution and the ACF plot of the final model, as indicated below. Do you trust the model, or did you find something that needs to improve? (If so, what did you find?) Do you expect the residuals to follow a normal distribution?

par(mfrow=c(1,2))

# check distribution:
qqnorm(resid(m4))
qqline(resid(m4))

# check autocorrelation:
acf(resid(m4))

What is the difference between te() and ti()? Name an advantage and a disadvantage of using ti().
We would like to compare how the Target and the Rhyme competitor are different from each other in the following analysis. Which counts do we need to combine with cbind for analyzing this difference? (For this assignment you do not have to perform that analysis.)

Assignment

Hand in the R code, R output, the plots, and the answers on the questions in the section ‘d. Model criticism’.

Deadline: before next class.

Format: html document (R Markdown)

An easy way to include everything in a single document is by using R markdown.

Quick instructions for R Markdown in R Studio: www.sfs.uni-tuebingen.de/~jvanrij/Rinfo.html)

When using NOT using R Studio, try the following steps:

download.file('http://www.let.rug.nl/~wieling/statscourse/lecture4/lab/answers/lab-including-answers.Rmd', 'analysis.Rmd')
library(rmarkdown)
render('analysis.Rmd') # generates html file with results
browseURL(paste('file://', file.path(getwd(),'analysis.html'),
sep='')) # shows result

Alternatively you could hand in a pdf that contains all the information.

The assignments can be send by mail to vanrij.jacolien ‘AT’ gmail.com, with “LSA lab 2” in the subject line.

Three-way interaction (not part of assignment)

Run the demo about nonlinear three-way interactions ( te("Longitude", "Latitude", "WordFrequency") ) as follows:

library(nonlinear)
demoInteraction()

More info on loading data

Note that the data for this lab session is saved in a different format as the data of previous lab session. Here are some examples for loading frequently used data types in R:

.Rdata or .rda files contain one or more R objects with names.

# load some (not existing) file in R: 
load('file.rda')
# use ls() to see which objects are loaded in your workspace:
ls()

.rds files contain one R object without the name. Therefore, when read in R the object must be stored into a variable.

# load some (not existing) file in R: 
newdat <- readRDS('file.rds')
# use str() to see what object is loaded:
str(newdat)

text files

# for tab delimited text with header (non-existing file):
dat <- read.table('file.txt', header=TRUE, sep='\t', dec='.')
# for cvs file, exported from MS Excel:
dat <- read.table('file.cvs', header=TRUE, sep=',', dec='.')
# see help file for other options to set:
help(read.table)

LSA | Advanced Analysis of Time series data

Lab session 2: Gaze data

Jacolien van Rij and Martijn Wieling