Download the source file nonlinear_0.1.tar.gz
and install the packages by the following commands in R:
download.file("http://www.sfs.uni-tuebingen.de/~jvanrij/packages/nonlinear_0.1.tar.gz", "nonlinear_0.1.tar.gz")
install.packages("nonlinear_0.1.tar.gz", type="source", repos=NULL)
If the package is installed succesfully, you could remove the source file (delete it).
Run the demo about autocorrelation as follows:
library(nonlinear)
demoAR()
Load packages:
library(mgcv)
library(itsadug)
Check R version and package versions:
R.version.string
## [1] "R version 3.2.1 (2015-06-18)"
packageVersion("mgcv")
## [1] '1.8.6'
packageVersion("itsadug")
## [1] '1.0.2'
If these versions do not agree with the ones included in the Lab Assignment file, please let us know.
The data discussed in this lab session is collected during two consequent courses (2014, 2015) at the CCP Spring Training in Experimental Psycholinguistics (STEP), in Edmonton, Canada. The data is collected in a noisy environment with participants who had already seen the experiment. The experiment is simplified version of an eye-tracking experiment of Vince Poretta (in prep; 2015).
Download the data to your working directory and load the data in R (see more info about loading data below)
download.file("http://www.sfs.uni-tuebingen.de/~jvanrij/LSA2015/Data/gazedata.rda", "countdat.rda")
load('countdat.rda')
Inspect the data with the follwoing two commands. Pay attention to the column InterestAreaLabel
.
head(countdat)
str(countdat)
In this assignment we are interested in the looks to the Target (e.g., “shop”) in comparison with the looks to the Onset Competitor (e.g., “shut”) and RhymeCompetitor (e.g., “drop”), and with the looks to a distractor (e.g., “fill”).
First, we plot the data. The data plots provide a sanity check for the GAMMs: if the pattern of the smooths are completely off (different range of values, or a completely different pattern), you may want to see what causes these differences.
Calculating the grand averages of the proportions of looks to the different areas of interest per timebin.
# IMPORTANT: note the spaces in the column names:
tail(colnames(countdat), 15)
## [1] "RhymecompReg" "OnsetcompFreq"
## [3] "OnsetcompReg" "DistractorFreq"
## [5] "DistractorReg" "Vowel"
## [7] "Stimtype" "Datafile"
## [9] "NativeEnlish" "StartTarget"
## [11] "InterestAreaLabel.OnsetComp_IA " "InterestAreaLabel.Target_IA "
## [13] "InterestAreaLabel.RhymeComp_IA " "InterestAreaLabel.Distract_IA "
## [15] "InterestAreaLabel.NA"
# removing spaces from column names, because these will give trouble:
colnames(countdat) <- gsub("\\ ","", colnames(countdat))
# calculating proportions:
countdat$TotalCount <- with(countdat, InterestAreaLabel.Target_IA
+ InterestAreaLabel.OnsetComp_IA
+ InterestAreaLabel.RhymeComp_IA
+ InterestAreaLabel.Distract_IA
+ InterestAreaLabel.NA )
countdat$propTarget <- with(countdat, InterestAreaLabel.Target_IA / TotalCount )
countdat$propOnset <- with(countdat, InterestAreaLabel.OnsetComp_IA / TotalCount )
countdat$propRhyme <- with(countdat, InterestAreaLabel.RhymeComp_IA / TotalCount )
countdat$propDistr <- with(countdat, InterestAreaLabel.Distract_IA / TotalCount )
Plot the grand averages per timebin. Use your preferred plotting functions, or use the code below.
# Setup empty plot window:
emptyPlot(range(countdat$Timebin), c(0,1),
h=.5,
main='Looks to interest areas', xlab='Time (ms)', ylab="Proportion")
# Distractor
avg <- aggregate(propDistr ~ Timebin, data=countdat, mean, na.rm=TRUE)
lines(avg$Timebin, avg$propDistr, col="black", lty=2)
# Onset competitor
avg <- aggregate(propOnset ~ Timebin, data=countdat, mean, na.rm=TRUE)
lines(avg$Timebin, avg$propOnset, col="indianred", lwd=2)
# Rhyme competitor
avg <- aggregate(propRhyme ~ Timebin, data=countdat, mean, na.rm=TRUE)
lines(avg$Timebin, avg$propRhyme, col="steelblue", lwd=2)
# Target
avg <- aggregate(propTarget ~ Timebin, data=countdat, mean, na.rm=TRUE)
lines(avg$Timebin, avg$propTarget, col="green", lwd=2)
As most people seem to have found the target before 2000 ms after word onset, we reduce the data by excluding all data points recorded after 2000 ms:
countdat <- countdat[countdat$Timebin < 2000,]
Create a new column that contains a matrix with looks to the target (success) and looks to the onset competitor (failures).
countdat$TargetLooks <- cbind(countdat$InterestAreaLabel.Target_IA, countdat$InterestAreaLabel.OnsetComp_IA)
head(countdat)
str(countdat)
cbind() data
If you are working in R Studio, the command View(countdat)
will show something different than head(countdat)
for the column TargetLooks
.
The cause for these differences is the use of the function cbind
: cbind
combines vectors into different columns of a matrix. Try for example:
test1 <- 1:10
test2 <- 2:11
mat <- cbind(test1,test2)
class(mat)
## [1] "matrix"
We have stored a matrix with two columns in the data frame countdat
under the name TargetLooks
. So each cell of the column countdat$TargetLooks
has two values:
head(countdat$TargetLooks)
## [,1] [,2]
## [1,] 0 0
## [2,] 14 0
## [3,] 20 0
## [4,] 20 0
## [5,] 20 0
## [6,] 20 0
The first column indicates the counts of successes (i.e., looks to the target), and the second column indicates the counts of failures (i.e., looks to the distractor).
Regression functions (e.g., lm
, lmer
, bam
) will recognize this structure as binomial data: counts of successes and failures.
As there are more than 10.000 rows in the data, we will use the function bam()
:
nrow(countdat)
## [1] 20958
Run a GAM model that includes the following nonlinear effects:
a main effect of Timebin
a main effect of TrialIndex
a random smooth for subjects over Timebin
(include k=5
to reduce the running time); the subjects are indicated by the column RecordingSessionLabel
, but we change this first into Subject
(saves typing).
Because logistic models take more time to run, you can leave out other random effects (it will take too long to complete this lab session).
# order the data with function start_event:
countdat <- start_event(countdat, column="Timebin", event="Event")
# make RecordingSessionLabel as factor, and store it ito a column with shorter name:
countdat$Subject <- as.factor(countdat$RecordingSessionLabel)
m1 <- bam(TargetLooks ~ s(Timebin) + s(TrialIndex)
+ s(Timebin, Subject, bs='fs', m=1, k=5),
data=countdat, family="binomial")
## Warning in gam.side(sm, X, tol = .Machine$double.eps^0.5): model has
## repeated 1-d smooths of same variable.
Inspect the summary of the model.
summary(m1)
##
## Family: binomial
## Link function: logit
##
## Formula:
## TargetLooks ~ s(Timebin) + s(TrialIndex) + s(Timebin, Subject,
## bs = "fs", m = 1, k = 5)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.4412 0.5366 6.413 1.43e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(Timebin) 7.455 7.981 217.7 <2e-16 ***
## s(TrialIndex) 6.201 7.355 318.2 <2e-16 ***
## s(Timebin,Subject) 112.074 124.000 6470.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.421 Deviance explained = 22.3%
## fREML = 1.1538e+05 Scale est. = 1 n = 20958
The intercept is the log odds ratio between looks to the target and looks to the onset competitor. As the intercept is far higher than 0, the intercept is higher than a proportion of 0.5, rather close to a proportion of 1. This means overall more looks to target than to onset competitor.
Timebin
represent?The partial effect of how the log odds for looks to the target versus looks to the onset competitor change over time.
More info on logits (binomial count data)
The regression model uses the counts of target and onset competitor looks to calculate a logit, which is the log odds ratio between looks to the target and looks to the onset competitor: \(log(\frac{N_{target.looks}}{N_{onsetcomp.looks}})\) (or following the logit definition: \(log(\frac{N_{target.looks}}{N - N_{target.looks}})\) with\(N=N_{target.looks}+N_{onsetcomp.looks}\)).
If \(N_{target.looks}\) equals \(N_{onsetcomp.looks}\), the ratio is 1, and the logit is 0 (log(1)=
0).
If \(N_{target.looks}\) is a larger count than the \(N_{onsetcomp.looks}\), the ratio is higher than 1, and the logit is a positive value, e.g. (log(2)=
0.6931472).
If \(N_{target.looks}\) is a smaller count than the \(N_{onsetcomp.looks}\), the ratio is lower than 1, and the logit is a negative value, e.g. (log(.5)=
-0.6931472).
Illustration of the relation between logits and proportions:
Code for plotting the logits:
total <- 500
successes <- 1:total
proportions <- successes / total
logits <- log( proportions / (1-proportions))
plot(proportions, logits)
Back transformation of logits to proportions:
plot(logits, plogis(logits))
Visualizing the trend over time:
par(mfrow=c(1,2))
plot(m1, select=1, scale=0, shade=TRUE)
abline(h=0)
plot_smooth(m1, view="Timebin", rm.ranef=TRUE)
Interpretation of the plots. - Why is the shape of the smooth different from the target looks as we saw in the data plots?
In the models we compare the ratio of target looks and onset competitor looks. The earlier plot visualized the proportions of looks to target, onset competitor, rhyme competitor and distractor looks.
out <- plot_smooth(m1, view="Timebin", rm.ranef=TRUE)
(diff <- find_difference(out$fv$fit, out$fv$CI, xVals=out$fv$Timebin))
## $start
## [1] 708.6207
##
## $end
## [1] 1950
##
## $xVals
## [1] TRUE
addInterval(0, lowVals=diff$start, highVals = diff$end, col='red', lwd=2)
abline(v=c(diff$start, diff$end), lty=3, col='red')
text(mean(c(diff$start, diff$end)), 1, "sign. more looks to target", col='red', font=3)
For interpretation of the results, we could tranform the summed effects back to proportions with the function plot_smooth
:
par(mfrow=c(1,1))
plot_smooth(m1, view="Timebin", rm.ranef=TRUE,
transform=plogis, ylim=c(0,1))
abline(h=.5, v=700, col='red', lty=2)
As we already derived from the logit plot, we see that around 700 ms after sound onset the proportion of taregt looks increases .5 . significantly.
Run a new GAMM model that includes a nonlinear interaction between Timebin
and TrialIndex
, and a random smooth for subjects (Subject
) over Timebin
(include k=5
for the random effect to reduce the running time).
m2 <- bam(TargetLooks ~ te(Timebin, TrialIndex)
+ s(Timebin, Subject, bs='fs', m=1, k=5),
data=countdat, family="binomial")
Inspect the summary of model m2
.
summary(m2)
##
## Family: binomial
## Link function: logit
##
## Formula:
## TargetLooks ~ te(Timebin, TrialIndex) + s(Timebin, Subject, bs = "fs",
## m = 1, k = 5)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.4212 0.5741 5.959 2.53e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## te(Timebin,TrialIndex) 21.82 22.62 1594 <2e-16 ***
## s(Timebin,Subject) 113.72 124.00 6511 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.415 Deviance explained = 21.1%
## fREML = 1.2794e+05 Scale est. = 1 n = 20958
The log odds ratio between looks to the target and looks to the onset competitor, and a constant value that needs to be added to the interaction surface and the random effects.
te(Timebin, TrialIndex)
is significant in the summary? The interaction surface is at some point significantly different from a flat plane through zero. However, this does not mean that the interaction itself is significant: it could be that the two main effects included in the interaction cause the surface to be significantly different from a flat zero surface.
Compare this model with model m1
with the function compareML
.
compareML(m1,m2)
## m1: TargetLooks ~ s(Timebin) + s(TrialIndex) + s(Timebin, Subject,
## bs = "fs", m = 1, k = 5)
##
## m2: TargetLooks ~ te(Timebin, TrialIndex) + s(Timebin, Subject, bs = "fs",
## m = 1, k = 5)
##
## Model m1 preferred: lower fREML score (12561.705), and lower df (1.000).
## -----
## Model Score Edf Difference Df
## 1 m2 127943.4 8
## 2 m1 115381.6 7 12561.705 -1.000
##
## AIC difference: -345.21, model m1 has lower AIC.
Timebin
and TrialIndex
is significant?Actually the model without interaction is significant, because it’s fREML score is much lower, and its also has less degrees of freedom to model the data. The model that includes only the two main effects is preferred, so the nonlinear trend over time is changing nonlinearly for different trials, but only with a constant value.
Visualize the interaction:
par(mfrow=c(1,2), cex=1.1)
# plot the partial effect:
plot(m2, select=1, rug=FALSE, main="Interaction",
cex.axis=2, cex.lab=2) # parameters for increasing font size
# Now with colors:
pvisgam(m2, select=1, view=c("Timebin", "TrialIndex"), zlim=c(-5,14), main="")
## [1] "Tensor(s) to be plotted: te(Timebin,TrialIndex)"
Timebin
or TrialIndex
?Timebin, because most lines in the surface are vertical. Vertical lines indicate change over time, but a relatively stable pattern over TrialIndex.
Illustration of the interpretation of contour plots
The code for the plot:
par(mfrow=c(2,2), cex=1.1)
# Partial effect of interaction:
pvisgam(m2, select=1, view=c("Timebin", "TrialIndex"), zlim=c(-5,14), main="")
# add a horizontal arrow at TrialIndex=20:
arrows(x0=getFigCoords('p')[1], x1=getFigCoords('f')[2], y0=20, y1=20, col="red", lwd=3, code=2, length=.2, xpd=TRUE)
# add a vertical arrow at Timebin=1250:
arrows(x0=1250, x1=1250, y0=getFigCoords('p')[4], y1=getFigCoords('f')[3], col="orange", lwd=3, code=2, length=.2, xpd=TRUE)
# second plot, TrialIndex=20:
fv <- get_modelterm(m2, select=1, cond=list(TrialIndex=20), as.data.frame=TRUE)
emptyPlot(range(fv$Timebin), c(-5,14), h0=0, main="TrialIndex=20", ylab="Partial effect", xlab="Timebin")
plot_error(fv$Timebin, fv$fit, fv$se.fit, shade=TRUE, col='red')
# third plot, Timebin = 1250:
fv <- get_modelterm(m2, select=1, cond=list(Timebin=1250), as.data.frame=TRUE)
emptyPlot(range(fv$TrialIndex), c(-5,14), h0=0, main="Timebin=1250", ylab="Partial effect", xlab="TrialIndex")
plot_error(fv$TrialIndex, fv$fit, fv$se.fit, shade=TRUE, col='orange')
How would the summed effect plot look like? What would change?
How do the random effects change the interaction surface?
The summed effects start with the value for the intercept (3.42, illustrated in the top-left panel in the plot below). The interaction surface (center-top panel in the plot below) modeling the interaction between Timebin
and TrialIndex
is added. In addition, one of the subjects is selected by the model (top-right panel in the plot below), and this random adjustment over time added to the surface and intercept (summed effect in center-bottom panel, and the tranformed summed effect in the bottom-right panel). It changes the time pattern (e.g., leveling off the second half of the time course), but not the effect of TrialIndex
.
Plot the summed effects (note: pvisgam
is for Partial effects, fvisgam
for Fitted values):
fvisgam(m2, view=c("Timebin", "TrialIndex"), zlim=c(-2,12), main="Interaction", rm.ranef=TRUE)
negative values: more looks to onset competitor;
positive values: more looks to target.
To plot the summed effects in proportion space, add transform=plogis
:
fvisgam(m2, view=c("Timebin", "TrialIndex"), zlim=c(0,1),
main="Interaction", rm.ranef=TRUE, transform=plogis, dec=1)
To test if native English participants differ from nonnative English participants, we would like to include the factor NativeEnglish
.
Run a new GAMM model that splits the nonlinear interaction between Timebin
and TrialIndex
by NativeEnglish
. Don’t forget to include an intercept term (parametric) for NativeEnglish
. In addition add a random smooth for subjects (Subject
) over Timebin
(include k=5
for the random effect to reduce the running time).
countdat$NativeEnglish <- as.factor(countdat$NativeEnlish)
m3 <- bam(TargetLooks ~ NativeEnglish + te(Timebin, TrialIndex, by=NativeEnglish)
+ s(Timebin, Subject, bs='fs', m=1, k=5),
data=countdat, family="binomial")
Inspect the summary of model m3
.
summary(m3)
##
## Family: binomial
## Link function: logit
##
## Formula:
## TargetLooks ~ NativeEnglish + te(Timebin, TrialIndex, by = NativeEnglish) +
## s(Timebin, Subject, bs = "fs", m = 1, k = 5)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.5358 0.8998 3.930 8.5e-05 ***
## NativeEnglishNonnative -0.2486 1.1931 -0.208 0.835
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## te(Timebin,TrialIndex):NativeEnglishNative 21.59 22.36 685.8 <2e-16
## te(Timebin,TrialIndex):NativeEnglishNonnative 21.80 22.46 1721.3 <2e-16
## s(Timebin,Subject) 111.61 123.00 4842.4 <2e-16
##
## te(Timebin,TrialIndex):NativeEnglishNative ***
## te(Timebin,TrialIndex):NativeEnglishNonnative ***
## s(Timebin,Subject) ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.411 Deviance explained = 19.9%
## fREML = 1.3862e+06 Scale est. = 1 n = 20958
Timebin
and Trial
for the two groups?The two interaction surfaces are significantly different from a flat surface through zero. The edf of the the surfaces are very similar, so the surfaces are equally wiggly.
The surfaces do not differ significantly overall (the intercept difference NativeEnglishNonnative
is not significant, p>.1), but we cannot say whether the two surfaces differ from each other based on the summary. We need to use model comparisons or difference plots.
Run a model comparison using the function compareML
for the models m2
and m3
.
infoMessages("on")
compareML(m2, m3)
## m2: TargetLooks ~ te(Timebin, TrialIndex) + s(Timebin, Subject, bs = "fs",
## m = 1, k = 5)
##
## m3: TargetLooks ~ NativeEnglish + te(Timebin, TrialIndex, by = NativeEnglish) +
## s(Timebin, Subject, bs = "fs", m = 1, k = 5)
##
## Model m2 preferred: lower fREML score (1258242.166), and lower df (6.000).
## -----
## Model Score Edf Difference Df
## 1 m3 1386185.5 14
## 2 m2 127943.4 8 1258242.166 -6.000
##
## AIC difference: -355.06, model m2 has lower AIC.
Model m2
has a lower fREML score and less degrees of freedom, so this model is prefered. This implies that the interaction between NativeEnglish and Timebin and TrialIndex is not significantly explaining more variance in the data than the model without NativeEnglish
.
Plot the two surfaces and the difference between the two surfaces:
par(mfrow=c(1,3))
fvisgam(m3, view=c("Timebin","TrialIndex"),
cond=list(NativeEnglish="Native"),
main="Native", rm.ranef=TRUE, zlim=c(-3,20))
fvisgam(m3, view=c("Timebin","TrialIndex"),
cond=list(NativeEnglish="Nonnative"),
main="Nonnative", rm.ranef=TRUE, zlim=c(-3,20))
plot_diff2(m3, view=c("Timebin","TrialIndex"), plotCI=TRUE,
comp=list(NativeEnglish=c("Native","Nonnative")),
main="Difference", rm.ranef=TRUE, zlim=c(-2,8))
Do these plots support your earlier conclusion about the difference between native and nonnative English participants?
There seems to be some difference at the end of the timeseries (Timebin) and at the end of the experiment (TrialIndex, i.e., upper-right corner), where the nonnative participants less likely to look at the target than native participants. However, this difference is not found to be significant with model comparisons.
Decompose the nonlinear interaction between Timebin
and Time
, as included in model m2
in two main effects and an interaction surface (hint: use ti()
). Also include a random smooth for subjects (Subject
) over Timebin
(include k=5
for the random effect to reduce the running time).
m4 <- bam(TargetLooks ~ s(Timebin) + s(TrialIndex)
+ ti(Timebin, TrialIndex)
+ s(Timebin, Subject, bs='fs', m=1, k=5),
data=countdat, family="binomial")
Inspect the summary of model m4
.
summary(m4)
##
## Family: binomial
## Link function: logit
##
## Formula:
## TargetLooks ~ s(Timebin) + s(TrialIndex) + ti(Timebin, TrialIndex) +
## s(Timebin, Subject, bs = "fs", m = 1, k = 5)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.5086 0.5446 6.442 1.18e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(Timebin) 7.562 8.061 228.8 <2e-16 ***
## s(TrialIndex) 7.312 8.354 329.8 <2e-16 ***
## ti(Timebin,TrialIndex) 14.938 15.815 1057.7 <2e-16 ***
## s(Timebin,Subject) 112.027 124.000 6479.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.414 Deviance explained = 21.2%
## fREML = 1.2594e+05 Scale est. = 1 n = 20958
The partial interaction surface, that modulates the two main effects s(Timebin)
and s(TrialIndex)
and the intercept.
Timebin
and TrialIndex
? The interaction surface is significantly different from a flat zero surface. As this is the partial interaction term, we can conclude that interaction is significantly adding to the two main effects. With the te()
interaction the two main effects are included in the interaction. Then we cannot conclude that the interaction is significant, or whether the two main effects render the total interaction significant. (Note that we cannot use the significance of a ti
term in the summary when is a categorical variable is included in the interaction.)
Plot the partial effect of the interaction in model m4
and compare these with the plots of model m2
.
par(mfrow=c(1,2))
pvisgam(m4, view=c("Timebin","TrialIndex"),
select=3,
main="model m4",
zlim=c(-2,2))
pvisgam(m2, view=c("Timebin","TrialIndex"),
select=1,
main="model m2",
zlim=c(-8,8))
Model m4 has four smooth terms to model the data, but model m2 only two. In addition, the smooth terms of model m4 allow for more wigglyness: the s
functions have k set to 10 by default (so these can use 9 basefunctions), whereas the k for te
is by default set to 5. Another difference is that the s
function uses a different, more precise, class of base function: “tp” instead of “cr”. So model m4 has more freedom and more precision to model the data.
Note, however, that the decompositional model sometimes provides more precise patterns than we can interpret.
Step by step add interactions with NativeEnglish
to model m4
to see if the main effects and interaction are influenced by native language. Use model comparisons to determine the best fitting model.
Different ways are possible. This would be my solution:
# full model:
m5 <- bam(TargetLooks ~ NativeEnglish
+ s(Timebin, by=NativeEnglish)
+ s(TrialIndex, by=NativeEnglish)
+ ti(Timebin, TrialIndex, by=NativeEnglish)
+ s(Timebin, Subject, bs='fs', m=1, k=5),
data=countdat, family="binomial")
compareML(m4, m5)
## m4: TargetLooks ~ s(Timebin) + s(TrialIndex) + ti(Timebin, TrialIndex) +
## s(Timebin, Subject, bs = "fs", m = 1, k = 5)
##
## m5: TargetLooks ~ NativeEnglish + s(Timebin, by = NativeEnglish) +
## s(TrialIndex, by = NativeEnglish) + ti(Timebin, TrialIndex,
## by = NativeEnglish) + s(Timebin, Subject, bs = "fs", m = 1,
## k = 5)
##
## Model m4 preferred: lower fREML score (1838320.678), and lower df (8.000).
## -----
## Model Score Edf Difference Df
## 1 m5 1964258.9 18
## 2 m4 125938.2 10 1838320.678 -8.000
##
## AIC difference: -686.24, model m4 has lower AIC.
# Does not seem very good... let's check one by one.
# remove most complex interaction:
m5a <- bam(TargetLooks ~ NativeEnglish
+ s(Timebin, by=NativeEnglish)
+ s(TrialIndex, by=NativeEnglish)
+ ti(Timebin, TrialIndex)
+ s(Timebin, Subject, bs='fs', m=1, k=5),
data=countdat, family="binomial")
# m5a is preferred, so we delete the interaction
compareML(m5, m5a)
## m5: TargetLooks ~ NativeEnglish + s(Timebin, by = NativeEnglish) +
## s(TrialIndex, by = NativeEnglish) + ti(Timebin, TrialIndex,
## by = NativeEnglish) + s(Timebin, Subject, bs = "fs", m = 1,
## k = 5)
##
## m5a: TargetLooks ~ NativeEnglish + s(Timebin, by = NativeEnglish) +
## s(TrialIndex, by = NativeEnglish) + ti(Timebin, TrialIndex) +
## s(Timebin, Subject, bs = "fs", m = 1, k = 5)
##
## Model m5a preferred: lower fREML score (1832594.201), and lower df (3.000).
## -----
## Model Score Edf Difference Df
## 1 m5 1964258.9 18
## 2 m5a 131664.7 15 -1832594.201 3.000
##
## AIC difference: 410.05, model m5a has lower AIC.
# remove next interaction:
m5b <- bam(TargetLooks ~ NativeEnglish
+ s(Timebin, by=NativeEnglish)
+ s(TrialIndex)
+ ti(Timebin, TrialIndex)
+ s(Timebin, Subject, bs='fs', m=1, k=5),
data=countdat, family="binomial")
# m5a is preferred, so we DON'T delete the interaction
compareML(m5a, m5b)
## m5a: TargetLooks ~ NativeEnglish + s(Timebin, by = NativeEnglish) +
## s(TrialIndex, by = NativeEnglish) + ti(Timebin, TrialIndex) +
## s(Timebin, Subject, bs = "fs", m = 1, k = 5)
##
## m5b: TargetLooks ~ NativeEnglish + s(Timebin, by = NativeEnglish) +
## s(TrialIndex) + ti(Timebin, TrialIndex) + s(Timebin, Subject,
## bs = "fs", m = 1, k = 5)
##
## Chi-square test of fREML scores
## -----
## Model Score Edf Chisq Df p.value Sig.
## 1 m5b 132748.3 13
## 2 m5a 131664.7 15 1083.639 2.000 < 2e-16 ***
##
## AIC difference: -65.32, model m5a has lower AIC.
# remove next interaction:
m5b <- bam(TargetLooks ~ NativeEnglish
+ s(Timebin)
+ s(TrialIndex, by=NativeEnglish)
+ ti(Timebin, TrialIndex)
+ s(Timebin, Subject, bs='fs', m=1, k=5),
data=countdat, family="binomial")
# m5b is preferred, so we delete the interaction
compareML(m5a, m5b)
## m5a: TargetLooks ~ NativeEnglish + s(Timebin, by = NativeEnglish) +
## s(TrialIndex, by = NativeEnglish) + ti(Timebin, TrialIndex) +
## s(Timebin, Subject, bs = "fs", m = 1, k = 5)
##
## m5b: TargetLooks ~ NativeEnglish + s(Timebin) + s(TrialIndex, by = NativeEnglish) +
## ti(Timebin, TrialIndex) + s(Timebin, Subject, bs = "fs",
## m = 1, k = 5)
##
## Model m5b preferred: lower fREML score (6271.126), and lower df (2.000).
## -----
## Model Score Edf Difference Df
## 1 m5a 131664.7 15
## 2 m5b 125393.6 13 -6271.126 2.000
##
## AIC difference: 404.18, model m5b has lower AIC.
# Now we test again the interaction:
m5c <- bam(TargetLooks ~ NativeEnglish
+ s(Timebin)
+ s(TrialIndex)
+ ti(Timebin, TrialIndex)
+ s(Timebin, Subject, bs='fs', m=1, k=5),
data=countdat, family="binomial")
# m5b is preferred, so we DON'T delete the interaction
compareML(m5b, m5c)
## m5b: TargetLooks ~ NativeEnglish + s(Timebin) + s(TrialIndex, by = NativeEnglish) +
## ti(Timebin, TrialIndex) + s(Timebin, Subject, bs = "fs",
## m = 1, k = 5)
##
## m5c: TargetLooks ~ NativeEnglish + s(Timebin) + s(TrialIndex) + ti(Timebin,
## TrialIndex) + s(Timebin, Subject, bs = "fs", m = 1, k = 5)
##
## Chi-square test of fREML scores
## -----
## Model Score Edf Chisq Df p.value Sig.
## 1 m5c 127257.4 11
## 2 m5b 125393.6 13 1863.831 2.000 < 2e-16 ***
##
## AIC difference: -67.83, model m5b has lower AIC.
# now we compare m4 and m5, to test whether trialindex by native english is significant:
# m5b is preferred:
compareML(m4, m5b)
## m4: TargetLooks ~ s(Timebin) + s(TrialIndex) + ti(Timebin, TrialIndex) +
## s(Timebin, Subject, bs = "fs", m = 1, k = 5)
##
## m5b: TargetLooks ~ NativeEnglish + s(Timebin) + s(TrialIndex, by = NativeEnglish) +
## ti(Timebin, TrialIndex) + s(Timebin, Subject, bs = "fs",
## m = 1, k = 5)
##
## Chi-square test of fREML scores
## -----
## Model Score Edf Chisq Df p.value Sig.
## 1 m4 125938.2 10
## 2 m5b 125393.6 13 544.649 3.000 < 2e-16 ***
##
## AIC difference: 127.99, model m5b has lower AIC.
Best-fitting model:
summary(m5b)
##
## Family: binomial
## Link function: logit
##
## Formula:
## TargetLooks ~ NativeEnglish + s(Timebin) + s(TrialIndex, by = NativeEnglish) +
## ti(Timebin, TrialIndex) + s(Timebin, Subject, bs = "fs",
## m = 1, k = 5)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.1026 0.7842 5.231 1.68e-07 ***
## NativeEnglishNonnative -1.0595 1.0104 -1.049 0.294
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(Timebin) 7.571 8.068 230.6 <2e-16 ***
## s(TrialIndex):NativeEnglishNative 8.787 8.987 216.2 <2e-16 ***
## s(TrialIndex):NativeEnglishNonnative 8.763 8.983 445.4 <2e-16 ***
## ti(Timebin,TrialIndex) 14.951 15.825 983.3 <2e-16 ***
## s(Timebin,Subject) 111.082 123.000 5725.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.415 Deviance explained = 21.7%
## fREML = 1.2539e+05 Scale est. = 1 n = 20958
par(mfrow=c(1,2))
plot(m5b, select=2, shade=TRUE, scale=0,
main="Native", ylim=c(-1,.5))
abline(h=0)
plot(m5b, select=3, shade=TRUE, scale=0,
main="Nonnative", ylim=c(-1,.5))
abline(h=0)
Yes, there is a difference between Native and Nonnative English participants in how their looks change over the course over the experiment: an nonlinear interaction between TrialIndex and NativeEnglish. Although this effect is relatively small:
par(mfrow=c(1,2))
fvisgam(m5b, view=c("Timebin", "TrialIndex"),
main="Native English",
cond=list(NativeEnglish="Native"),
zlim=c(0,1), transform=plogis,
rm.ranef=TRUE, dec=1)
fvisgam(m5b, view=c("Timebin", "TrialIndex"),
main="Nonnative English",
cond=list(NativeEnglish="Nonnative"),
zlim=c(0,1), transform=plogis,
rm.ranef=TRUE, dec=1)
par(mfrow=c(1,2))
# check distribution:
qqnorm(resid(m5b))
qqline(resid(m5b))
# check autocorrelation:
acf(resid(m5b))
The distribution does not look normal, but that is no problem, because we model binomial data (which is not supposed to be normally distributed). The autocorrelation in the residuals is too heigh, but not extreme. However, it would be better to reduce it, because now the model may be too liberal.
Note however, that for binomial data we cannot include an AR1 model. So we either need to improve the model fit, or downsample the data.
te()
and ti()
? Name an advantage and a disadvantage of using ti()
. - te()
includes the main effects in the interaction, whereas ti()
models the interaction surface separately from the main effects. - Using ti()
has the advantage that we with continuous predictors could use the summary to estimate whether an interaction is significant. - Using ti()
gives the model more freedom to fit the data, so it is more precise. - However, often the ti()
models are difficult to interpret. All the terms need to be added to see the complete picture. - Often the decomposed models provide too much information: what does it mean that an interaction ti(A,B,C, by=G)
is significant, but ti(A,B, by=G)
not? Often te()
models are better interpretable.
cbind
for analyzing this difference? (For this assignment you do not have to perform that analysis.)
countdat$TargetLooks <- cbind(countdat$InterestAreaLabel.Target_IA, countdat$InterestAreaLabel.RhymeComp_IA)
Hand in the R code, R output, the plots, and the answers on the questions in the section ‘d. Model criticism’.
Deadline: before next class.
Format: html document (R Markdown)
An easy way to include everything in a single document is by using R markdown.
Quick instructions for R Markdown in R Studio: www.sfs.uni-tuebingen.de/~jvanrij/Rinfo.html)
When using NOT using R Studio, try the following steps:
download.file('http://www.let.rug.nl/~wieling/statscourse/lecture4/lab/answers/lab-including-answers.Rmd', 'analysis.Rmd')
library(rmarkdown)
render('analysis.Rmd') # generates html file with results
browseURL(paste('file://', file.path(getwd(),'analysis.html'),
sep='')) # shows result
Alternatively you could hand in a pdf that contains all the information.
The assignments can be send by mail to vanrij.jacolien ‘AT’ gmail.com, with “LSA lab 2” in the subject line.Run the demo about nonlinear three-way interactions ( te("Longitude", "Latitude", "WordFrequency")
) as follows:
library(nonlinear)
demoInteraction()
Note that the data for this lab session is saved in a different format as the data of previous lab session. Here are some examples for loading frequently used data types in R:
.Rdata
or .rda
files contain one or more R objects with names.# load some (not existing) file in R:
load('file.rda')
# use ls() to see which objects are loaded in your workspace:
ls()
.rds
files contain one R object without the name. Therefore, when read in R the object must be stored into a variable.# load some (not existing) file in R:
newdat <- readRDS('file.rds')
# use str() to see what object is loaded:
str(newdat)
# for tab delimited text with header (non-existing file):
dat <- read.table('file.txt', header=TRUE, sep='\t', dec='.')
# for cvs file, exported from MS Excel:
dat <- read.table('file.cvs', header=TRUE, sep=',', dec='.')
# see help file for other options to set:
help(read.table)