Download the source file nonlinear_0.1.tar.gz
and install the packages by the following commands in R:
download.file("http://www.sfs.uni-tuebingen.de/~jvanrij/packages/nonlinear_0.1.tar.gz", "nonlinear_0.1.tar.gz")
install.packages("nonlinear_0.1.tar.gz", type="source", repos=NULL)
If the package is installed succesfully, you could remove the source file (delete it).
Run the demo about autocorrelation as follows:
library(nonlinear)
demoAR()
Load packages:
library(mgcv)
library(itsadug)
Check R version and package versions:
R.version.string
## [1] "R version 3.2.1 (2015-06-18)"
packageVersion("mgcv")
## [1] '1.8.6'
packageVersion("itsadug")
## [1] '1.0.1'
If these versions do not agree with the ones included in the Lab Assignment file, please let us know.
The data discussed in this lab session is collected during two consequent courses (2014, 2015) at the CCP Spring Training in Experimental Psycholinguistics (STEP), in Edmonton, Canada. The data is collected in a noisy environment with participants who had already seen the experiment. The experiment is simplified version of an eye-tracking experiment of Vince Poretta (in prep; 2015).
Download the data to your working directory and load the data in R (see more info about loading data below)
download.file("http://www.sfs.uni-tuebingen.de/~jvanrij/LSA2015/Data/gazedata.rda", "countdat.rda")
load('countdat.rda')
Inspect the data with the follwoing two commands. Pay attention to the column InterestAreaLabel
.
head(countdat)
str(countdat)
In this assignment we are interested in the looks to the Target (e.g., “shop”) in comparison with the looks to the Onset Competitor (e.g., “shut”) and RhymeCompetitor (e.g., “drop”), and with the looks to a distractor (e.g., “fill”).
First, we plot the data. The data plots provide a sanity check for the GAMMs: if the pattern of the smooths are completely off (different range of values, or a completely different pattern), you may want to see what causes these differences.
Calculating the grand averages of the proportions of looks to the different areas of interest per timebin.
# IMPORTANT: note the spaces in the column names:
colnames(countdat)
# removing spaces from column names, because these will give trouble:
colnames(countdat) <- gsub("\\ ","", colnames(countdat))
# calculating proportions:
countdat$TotalCount <- with(countdat, InterestAreaLabel.Target_IA
+ InterestAreaLabel.OnsetComp_IA
+ InterestAreaLabel.RhymeComp_IA
+ InterestAreaLabel.Distract_IA
+ InterestAreaLabel.NA )
countdat$propTarget <- with(countdat, InterestAreaLabel.Target_IA / TotalCount )
countdat$propOnset <- with(countdat, InterestAreaLabel.OnsetComp_IA / TotalCount )
countdat$propRhyme <- with(countdat, InterestAreaLabel.RhymeComp_IA / TotalCount )
countdat$propDistr <- with(countdat, InterestAreaLabel.Distract_IA / TotalCount )
Plot the grand averages per timebin. Use your preferred plotting functions, or use the code below.
# Setup empty plot window:
emptyPlot(range(countdat$Timebin), c(0,1),
h=.5,
main='Looks to interest areas', xlab='Time (ms)', ylab="Proportion")
# Distractor
avg <- aggregate(propDistr ~ Timebin, data=countdat, mean, na.rm=TRUE)
lines(avg$Timebin, avg$propDistr, col="black", lty=2)
# Onset competitor
avg <- aggregate(propOnset ~ Timebin, data=countdat, mean, na.rm=TRUE)
lines(avg$Timebin, avg$propOnset, col="indianred", lwd=2)
# Rhyme competitor
avg <- aggregate(propRhyme ~ Timebin, data=countdat, mean, na.rm=TRUE)
lines(avg$Timebin, avg$propRhyme, col="steelblue", lwd=2)
# Target
avg <- aggregate(propTarget ~ Timebin, data=countdat, mean, na.rm=TRUE)
lines(avg$Timebin, avg$propTarget, col="green", lwd=2)
As most people seem to have found the target before 2000 ms after word onset, we reduce the data by excluding all data points recorded after 2000 ms:
countdat <- countdat[countdat$Timebin < 2000,]
Create a new column that contains a matrix with looks to the target (success) and looks to the onset competitor (failures).
countdat$TargetLooks <- cbind(countdat$InterestAreaLabel.Target_IA, countdat$InterestAreaLabel.OnsetComp_IA)
head(countdat)
str(countdat)
cbind() data
If you are working in R Studio, the command View(countdat)
will show something different than head(countdat)
for the column TargetLooks
.
The cause for these differences is the use of the function cbind
: cbind
combines vectors into different columns of a matrix. Try for example:
test1 <- 1:10
test2 <- 2:11
mat <- cbind(test1,test2)
class(mat)
We have stored a matrix with two columns in the data frame countdat
under the name TargetLooks
. So each cell of the column countdat$TargetLooks
has two values:
head(countdat$TargetLooks)
The first column indicates the counts of successes (i.e., looks to the target), and the second column indicates the counts of failures (i.e., looks to the distractor).
Regression functions (e.g., lm
, lmer
, bam
) will recognize this structure as binomial data: counts of successes and failures.
As there are more than 10.000 rows in the data, we will use the function bam()
:
nrow(countdat)
Run a GAM model that includes the following nonlinear effects:
a main effect of Timebin
a main effect of TrialIndex
a random smooth for subjects over Timebin
(include k=5
to reduce the running time); the subjects are indicated by the column RecordingSessionLabel
, but we change this first into Subject
(saves typing).
Because logistic models take more time to run, you can leave out other random effects (it will take too long to complete this lab session).
# first order the data:
countdat <- start_event(countdat, column="Timebin", event="Event")
# make RecordingSessionLabel as factor, and store it ito a column with shorter name:
countdat$Subject <- as.factor(countdat$RecordingSessionLabel)
m1 <- bam(TargetLooks ~ ... ,
data=countdat, family="binomial")
Inspect the summary of the model.
What does the intercept represent in this model? What does the intercept tell: do people look on average more to the target or to the onset competitor?
What does the smooth over Timebin
represent?
Visualizing the trend over time:
par(mfrow=c(1,2))
plot(m1, select=1, scale=0, shade=TRUE)
abline(h=0)
plot_smooth(m1, view="Timebin", rm.ranef=TRUE)
Interpretation of the plots. - Why is the shape of the smooth different from the target looks as we saw in the data plots?
For interpretation of the results, we could tranform the summed effects back to proportions with the function plot_smooth
:
par(mfrow=c(1,1))
plot_smooth(m1, view="Timebin", rm.ranef=TRUE,
transform=plogis, ylim=c(0,1))
Run a new GAMM model that includes a nonlinear interaction between Timebin
and TrialIndex
, and a random smooth for subjects (Subject
) over Timebin
(include k=5
for the random effect to reduce the running time).
m2 <- bam(TargetLooks ~ ... ,
data=countdat, family="binomial")
Inspect the summary of model m2
.
What does the intercept represent in this model?
What does it mean that the smooth term te(Timebin, TrialIndex)
is significant in the summary?
Compare this model with model m1
with the function compareML
.
Timebin
and TrialIndex
is significant?Visualize the interaction:
par(mfrow=c(1,2), cex=1.1)
# plot the partial effect:
plot(m2, rug=FALSE, main="Interaction",
cex.axis=2, cex.lab=2) # parameters for increasing font size
# Now with colors:
pvisgam(m2, select=1, view=c("Timebin", "TrialIndex"), zlim=c(-4,14), main="")
Which of the two predictor seems to have a stronger effect on the gaze pattern: Timebin
or TrialIndex
?
How would the summed effect plot look like? What would change?
Plot the summed effects (note: pvisgam
is for Partial effects, fvisgam
for Fitted values):
fvisgam(m2, view=c("Timebin", "TrialIndex"), zlim=c(-2,12), main="Interaction", rm.ranef=TRUE)
To plot the summed effects in proportion space, add transform=plogis
:
fvisgam(m2, view=c("Timebin", "TrialIndex"), zlim=c(0,1),
main="Interaction", rm.ranef=TRUE, transform=plogis)
To test if native English participants differ from nonnative English participants, we would like to include the factor NativeEnglish
.
Run a new GAMM model that splits the nonlinear interaction between Timebin
and TrialIndex
by NativeEnglish
. Don’t forget to include an intercept term (parametric) for NativeEnglish
. In addition add a random smooth for subjects (Subject
) over Timebin
(include k=5
for the random effect to reduce the running time).
countdat$NativeEnglish <- as.factor(countdat$NativeEnglish)
m3 <- bam(TargetLooks ~ ... ,
data=countdat, family="binomial")
Inspect the summary of model m3
.
What does the summary reveal about the interaction between Timebin
and Trial
for the two groups?
Can you conclude that the two surfaces are different from each other?
Run a model comparison using the function compareML
for the models m2
and m3
.
Which model is preferred?
Plot the two surfaces and the difference between the two surfaces:
par(mfrow=c(1,3))
fvisgam(m3, view=c("Timebin","TrialIndex"),
cond=list(NativeEnglish="Native"),
main="Native", rm.ranef=TRUE, zlim=c(-3,20))
fvisgam(m3, view=c("Timebin","TrialIndex"),
cond=list(NativeEnglish="Nonnative"),
main="Nonnative", rm.ranef=TRUE, zlim=c(-3,20))
plot_diff2(m3, view=c("Timebin","TrialIndex"), plotCI=TRUE,
comp=list(NativeEnglish=c("Native","Nonnative")),
main="Difference", rm.ranef=TRUE, zlim=c(-2,8))
Do these plots support your earlier conclusion about the difference between native and nonnative English participants?
Decompose the nonlinear interaction between Timebin
and Time
, as included in model m2
in two main effects and an interaction surface (hint: use ti()
). Also include a random smooth for subjects (Subject
) over Timebin
(include k=5
for the random effect to reduce the running time).
m4 <- bam(TargetLooks ~ ... ,
data=countdat, family="binomial")
Inspect the summary of model m4
.
What does the interaction term represent?
Timebin
and TrialIndex
?Plot the partial effect of the interaction in model m4
and compare these with the plots of model m2
.
par(mfrow=c(1,2))
pvisgam(m4, view=c("Timebin","TrialIndex"),
select=3,
main="model m4",
zlim=c(-2,2))
pvisgam(m2, view=c("Timebin","TrialIndex"),
select=1,
main="model m2",
zlim=c(-8,8))
Step by step add interactions with NativeEnglish
to model m4
to see if the main effects and interaction are influenced by native language. Use model comparisons to determine the best fitting model.
Hand in your best-fitting model. Is there a difference in native and nonnative speakers with respect to looks to the target?
Plot the distribution and the ACF plot of the final model, as indicated below. Do you trust the model, or did you find something that needs to improve? (If so, what did you find?) Do you expect the residuals to follow a normal distribution?
par(mfrow=c(1,2))
# check distribution:
qqnorm(resid(m4))
qqline(resid(m4))
# check autocorrelation:
acf(resid(m4))
What is the difference between te()
and ti()
? Name an advantage and a disadvantage of using ti()
.
We would like to compare how the Target and the Rhyme competitor are different from each other in the following analysis. Which counts do we need to combine with cbind
for analyzing this difference? (For this assignment you do not have to perform that analysis.)
Hand in the R code, R output, the plots, and the answers on the questions in the section ‘d. Model criticism’.
Deadline: before next class.
Format: html document (R Markdown)
An easy way to include everything in a single document is by using R markdown.
Quick instructions for R Markdown in R Studio: www.sfs.uni-tuebingen.de/~jvanrij/Rinfo.html)
When using NOT using R Studio, try the following steps:
download.file('http://www.let.rug.nl/~wieling/statscourse/lecture4/lab/answers/lab-including-answers.Rmd', 'analysis.Rmd')
library(rmarkdown)
render('analysis.Rmd') # generates html file with results
browseURL(paste('file://', file.path(getwd(),'analysis.html'),
sep='')) # shows result
Alternatively you could hand in a pdf that contains all the information.
The assignments can be send by mail to vanrij.jacolien ‘AT’ gmail.com, with “LSA lab 2” in the subject line.Run the demo about nonlinear three-way interactions ( te("Longitude", "Latitude", "WordFrequency")
) as follows:
library(nonlinear)
demoInteraction()
Note that the data for this lab session is saved in a different format as the data of previous lab session. Here are some examples for loading frequently used data types in R:
.Rdata
or .rda
files contain one or more R objects with names.# load some (not existing) file in R:
load('file.rda')
# use ls() to see which objects are loaded in your workspace:
ls()
.rds
files contain one R object without the name. Therefore, when read in R the object must be stored into a variable.# load some (not existing) file in R:
newdat <- readRDS('file.rds')
# use str() to see what object is loaded:
str(newdat)
# for tab delimited text with header (non-existing file):
dat <- read.table('file.txt', header=TRUE, sep='\t', dec='.')
# for cvs file, exported from MS Excel:
dat <- read.table('file.cvs', header=TRUE, sep=',', dec='.')
# see help file for other options to set:
help(read.table)