This file contains the answers of the lab session: http://www.let.rug.nl/wieling/BCN-Stats/lecture3/lab.
We will first download the data (which was saved as a dataframe in R: an .rda
file), and load it into R.
download.file("http://www.let.rug.nl/wieling/BCN-Stats/lecture3/lab/tv.rda", "tv.rda")
load("tv.rda") # an rda file can be loaded with the command load
Investigate the data with descriptive statistics and plots.
str(tv)
# 'data.frame': 80 obs. of 7 variables:
# $ subj : Factor w/ 80 levels "1","2","3","4",..: 34 23 69 73 77 15 58 60 76 79 ...
# $ mot.education: num 8 17 16 16 16 12 16 20 16 16 ...
# $ tv.hours : num 18 15 12 23 9 15 10 10 17 10 ...
# $ daycare.hours: num 0 0 0 0 0 0 0 0 0 0 ...
# $ book.reading : num 2 4 5 5 5 3 5 3 4 3 ...
# $ cdi : num 97 99 99 99 99 100 100 100 100 100 ...
# $ gender : Factor w/ 2 levels "female","male": 2 2 2 2 2 1 2 2 1 1 ...
summary(tv)
# subj mot.education tv.hours daycare.hours book.reading cdi
# 1 : 1 Min. : 8.0 Min. : 2.0 Min. : 0.0 Min. :1.00 Min. : 97.0
# 2 : 1 1st Qu.:12.0 1st Qu.: 9.0 1st Qu.:10.0 1st Qu.:3.00 1st Qu.: 99.0
# 3 : 1 Median :16.0 Median :11.0 Median :20.0 Median :4.00 Median :100.0
# 4 : 1 Mean :14.8 Mean :11.7 Mean :20.4 Mean :3.94 Mean : 99.9
# 5 : 1 3rd Qu.:16.0 3rd Qu.:14.0 3rd Qu.:31.2 3rd Qu.:5.00 3rd Qu.:101.0
# 6 : 1 Max. :22.0 Max. :23.0 Max. :41.0 Max. :6.00 Max. :103.0
# (Other):74
# gender
# female:40
# male :40
#
#
#
#
#
head(tv)
# subj mot.education tv.hours daycare.hours book.reading cdi gender
# 34 34 8 18 0 2 97 male
# 23 23 17 15 0 4 99 male
# 69 69 16 12 0 5 99 male
# 73 73 16 23 0 5 99 male
# 77 77 16 9 0 5 99 male
# 15 15 12 15 0 3 100 female
table(tv$gender)
#
# female male
# 40 40
cor(tv[, c("tv.hours", "mot.education", "book.reading", "daycare.hours")])
# tv.hours mot.education book.reading daycare.hours
# tv.hours 1.000 -0.292 -0.1860 -0.1071
# mot.education -0.292 1.000 0.4066 -0.0910
# book.reading -0.186 0.407 1.0000 -0.0852
# daycare.hours -0.107 -0.091 -0.0852 1.0000
par(mfrow = c(1, 2))
hist(tv$tv.hours)
boxplot(tv$tv.hours)
hist(tv$mot.education)
boxplot(tv$mot.education)
hist(tv$book.reading)
boxplot(tv$book.reading)
hist(tv$daycare.hours)
boxplot(tv$daycare.hours)
In this section, we will fit the best model for the data, predicting the cdi
language development score on the basis of various predictors.
Fit the best model without taking into account interactions.
summary(m0 <- lm(cdi ~ tv.hours, data = tv))
#
# Call:
# lm(formula = cdi ~ tv.hours, data = tv)
#
# Residuals:
# Min 1Q Median 3Q Max
# -2.6907 -0.6907 -0.0407 0.6343 3.1594
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 101.6400 0.3863 263.1 < 2e-16 ***
# tv.hours -0.1500 0.0312 -4.8 7.4e-06 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 1.13 on 78 degrees of freedom
# Multiple R-squared: 0.228, Adjusted R-squared: 0.218
# F-statistic: 23.1 on 1 and 78 DF, p-value: 7.44e-06
summary(m1 <- lm(cdi ~ tv.hours + gender, data = tv))
#
# Call:
# lm(formula = cdi ~ tv.hours + gender, data = tv)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.9936 -0.5485 0.0416 0.5544 2.3397
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 101.5545 0.2937 345.83 < 2e-16 ***
# tv.hours -0.0745 0.0257 -2.90 0.0049 **
# gendermale -1.5921 0.2087 -7.63 5.2e-11 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 0.861 on 77 degrees of freedom
# Multiple R-squared: 0.56, Adjusted R-squared: 0.549
# F-statistic: 49.1 on 2 and 77 DF, p-value: 1.8e-14
AIC(m0) - AIC(m1)
# [1] 43
summary(m2 <- lm(cdi ~ tv.hours + gender + mot.education, data = tv))
#
# Call:
# lm(formula = cdi ~ tv.hours + gender + mot.education, data = tv)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.7963 -0.4824 0.0672 0.5274 2.2577
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 99.7868 0.5347 186.62 < 2e-16 ***
# tv.hours -0.0534 0.0243 -2.20 0.03114 *
# gendermale -1.4936 0.1940 -7.70 4.1e-11 ***
# mot.education 0.0998 0.0260 3.83 0.00026 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 0.794 on 76 degrees of freedom
# Multiple R-squared: 0.632, Adjusted R-squared: 0.617
# F-statistic: 43.4 on 3 and 76 DF, p-value: <2e-16
AIC(m1) - AIC(m2)
# [1] 12.1
summary(m3 <- lm(cdi ~ tv.hours + gender + mot.education + book.reading, data = tv))
#
# Call:
# lm(formula = cdi ~ tv.hours + gender + mot.education + book.reading,
# data = tv)
#
# Residuals:
# Min 1Q Median 3Q Max
# -2.076 -0.457 0.070 0.453 2.175
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 99.4418 0.5440 182.80 < 2e-16 ***
# tv.hours -0.0542 0.0237 -2.29 0.0250 *
# gendermale -1.3820 0.1957 -7.06 7.1e-10 ***
# mot.education 0.0783 0.0271 2.89 0.0051 **
# book.reading 0.1561 0.0702 2.22 0.0292 *
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 0.774 on 75 degrees of freedom
# Multiple R-squared: 0.654, Adjusted R-squared: 0.636
# F-statistic: 35.5 on 4 and 75 DF, p-value: <2e-16
AIC(m2) - AIC(m3)
# [1] 3.1
summary(m4 <- lm(cdi ~ tv.hours + gender + mot.education + book.reading + daycare.hours, data = tv))
#
# Call:
# lm(formula = cdi ~ tv.hours + gender + mot.education + book.reading +
# daycare.hours, data = tv)
#
# Residuals:
# Min 1Q Median 3Q Max
# -2.028 -0.479 0.171 0.452 2.093
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 99.04027 0.58484 169.35 < 2e-16 ***
# tv.hours -0.04863 0.02362 -2.06 0.0430 *
# gendermale -1.38150 0.19315 -7.15 5.1e-10 ***
# mot.education 0.08281 0.02691 3.08 0.0029 **
# book.reading 0.16365 0.06945 2.36 0.0211 *
# daycare.hours 0.01182 0.00683 1.73 0.0876 .
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 0.764 on 74 degrees of freedom
# Multiple R-squared: 0.668, Adjusted R-squared: 0.645
# F-statistic: 29.8 on 5 and 74 DF, p-value: <2e-16
AIC(m3) - AIC(m4) # m3 is the best model
# [1] 1.18
step(m4) # selects, the full model, but I'm more conservative and would select the simpler model (< 2 AIC difference)
# Start: AIC=-37.4
# cdi ~ tv.hours + gender + mot.education + book.reading + daycare.hours
#
# Df Sum of Sq RSS AIC
# <none> 43.2 -37.3
# - daycare.hours 1 1.75 44.9 -36.2
# - tv.hours 1 2.47 45.6 -34.9
# - book.reading 1 3.24 46.4 -33.6
# - mot.education 1 5.52 48.7 -29.7
# - gender 1 29.84 73.0 2.7
#
# Call:
# lm(formula = cdi ~ tv.hours + gender + mot.education + book.reading +
# daycare.hours, data = tv)
#
# Coefficients:
# (Intercept) tv.hours gendermale mot.education book.reading daycare.hours
# 99.0403 -0.0486 -1.3815 0.0828 0.1636 0.0118
# linearity
library(car) # if not installed: install.packages('car')
par(mfrow = c(1, 3))
crPlot(m3, var = "tv.hours") # linearity not perfect
crPlot(m3, var = "mot.education")
crPlot(m3, var = "book.reading")
par(mfrow = c(1, 1))
# homoscedasticity: OK
plot(m3, which = 1)
ncvTest(m3)
# Non-constant Variance Score Test
# Variance formula: ~ fitted.values
# Chisquare = 0.00474 Df = 1 p = 0.945
ncvTest(m3, ~tv.hours)
# Non-constant Variance Score Test
# Variance formula: ~ tv.hours
# Chisquare = 0.0104 Df = 1 p = 0.919
ncvTest(m3, ~mot.education)
# Non-constant Variance Score Test
# Variance formula: ~ mot.education
# Chisquare = 0.103 Df = 1 p = 0.748
ncvTest(m3, ~book.reading)
# Non-constant Variance Score Test
# Variance formula: ~ book.reading
# Chisquare = 0.142 Df = 1 p = 0.707
ncvTest(m3, ~gender)
# Non-constant Variance Score Test
# Variance formula: ~ gender
# Chisquare = 0.0217 Df = 1 p = 0.883
# multicollinearity: OK
vif(m3)
# tv.hours gendermale mot.education book.reading
# 1.24 1.28 1.27 1.29
# autocorrelation in residuals: OK
acf(resid(m3))
# distribution of residuals: OK
plot(m3, which = 2)
Fit the best model while taking into account interactions. For simplicity and speed, we’ll only investigate potential interactions with gender.
summary(m4 <- lm(cdi ~ tv.hours + gender + mot.education + book.reading + daycare.hours * gender, data = tv))
#
# Call:
# lm(formula = cdi ~ tv.hours + gender + mot.education + book.reading +
# daycare.hours * gender, data = tv)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.999 -0.527 0.154 0.400 2.132
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 99.07330 0.58197 170.24 < 2e-16 ***
# tv.hours -0.04857 0.02348 -2.07 0.0421 *
# gendermale -1.74356 0.32759 -5.32 1.1e-06 ***
# mot.education 0.08851 0.02708 3.27 0.0017 **
# book.reading 0.17872 0.06992 2.56 0.0127 *
# daycare.hours 0.00261 0.00957 0.27 0.7855
# gendermale:daycare.hours 0.01893 0.01388 1.36 0.1767
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 0.759 on 73 degrees of freedom
# Multiple R-squared: 0.676, Adjusted R-squared: 0.65
# F-statistic: 25.4 on 6 and 73 DF, p-value: 4.55e-16
AIC(m3) - AIC(m4) # m3 is still the best model
# [1] 1.19
summary(m4 <- lm(cdi ~ tv.hours + gender + mot.education + book.reading * gender, data = tv))
#
# Call:
# lm(formula = cdi ~ tv.hours + gender + mot.education + book.reading *
# gender, data = tv)
#
# Residuals:
# Min 1Q Median 3Q Max
# -2.0420 -0.4569 0.0724 0.4699 2.1632
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 99.3498 0.6241 159.19 <2e-16 ***
# tv.hours -0.0544 0.0239 -2.28 0.0255 *
# gendermale -1.2218 0.5583 -2.19 0.0318 *
# mot.education 0.0789 0.0274 2.88 0.0051 **
# book.reading 0.1752 0.0943 1.86 0.0671 .
# gendermale:book.reading -0.0406 0.1323 -0.31 0.7600
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 0.779 on 74 degrees of freedom
# Multiple R-squared: 0.655, Adjusted R-squared: 0.632
# F-statistic: 28.1 on 5 and 74 DF, p-value: 7.74e-16
AIC(m3) - AIC(m4) # m3 is still the best model
# [1] -1.9
summary(m4 <- lm(cdi ~ tv.hours + gender + mot.education * gender + book.reading, data = tv))
#
# Call:
# lm(formula = cdi ~ tv.hours + gender + mot.education * gender +
# book.reading, data = tv)
#
# Residuals:
# Min 1Q Median 3Q Max
# -2.021 -0.498 0.027 0.438 2.191
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 99.9439 0.6962 143.56 <2e-16 ***
# tv.hours -0.0551 0.0237 -2.33 0.0226 *
# gendermale -2.2277 0.7597 -2.93 0.0045 **
# mot.education 0.0475 0.0381 1.25 0.2165
# book.reading 0.1533 0.0701 2.19 0.0319 *
# gendermale:mot.education 0.0571 0.0496 1.15 0.2531
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 0.772 on 74 degrees of freedom
# Multiple R-squared: 0.661, Adjusted R-squared: 0.638
# F-statistic: 28.8 on 5 and 74 DF, p-value: 4.26e-16
AIC(m3) - AIC(m4) # m3 is still the best model
# [1] -0.578
summary(m4 <- lm(cdi ~ tv.hours * gender + gender + mot.education + book.reading, data = tv))
#
# Call:
# lm(formula = cdi ~ tv.hours * gender + gender + mot.education +
# book.reading, data = tv)
#
# Residuals:
# Min 1Q Median 3Q Max
# -2.0725 -0.5370 0.0557 0.4938 2.1296
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 99.1812 0.6169 160.77 <2e-16 ***
# tv.hours -0.0296 0.0363 -0.82 0.4175
# gendermale -0.9001 0.5704 -1.58 0.1188
# mot.education 0.0799 0.0272 2.93 0.0044 **
# book.reading 0.1531 0.0704 2.17 0.0329 *
# tv.hours:gendermale -0.0422 0.0469 -0.90 0.3714
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 0.775 on 74 degrees of freedom
# Multiple R-squared: 0.658, Adjusted R-squared: 0.635
# F-statistic: 28.5 on 5 and 74 DF, p-value: 5.47e-16
AIC(m3) - AIC(m4) # m3 remains the best model
# [1] -1.13
Apply model criticism by excluding observations with residuals outside 2.5 SD.
tv2 = tv[abs(scale(resid(m3))) < 2.5, ]
nrow(tv2) # 2 outliers excluded
# [1] 78
summary(m3.2 <- lm(cdi ~ tv.hours + gender + mot.education + book.reading, data = tv2))
#
# Call:
# lm(formula = cdi ~ tv.hours + gender + mot.education + book.reading,
# data = tv2)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.4693 -0.5174 0.0573 0.4393 1.5407
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 99.4963 0.4906 202.82 < 2e-16 ***
# tv.hours -0.0599 0.0214 -2.80 0.0066 **
# gendermale -1.2489 0.1791 -6.97 1.2e-09 ***
# mot.education 0.0686 0.0246 2.79 0.0068 **
# book.reading 0.1785 0.0643 2.78 0.0069 **
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 0.698 on 73 degrees of freedom
# Multiple R-squared: 0.683, Adjusted R-squared: 0.665
# F-statistic: 39.3 on 4 and 73 DF, p-value: <2e-16
par(mfrow = c(1, 2))
plot(m3, which = 2, main = "m3")
plot(m3.2, which = 2, main = "m3.2")
Check for overfitting using the validate()
function of library(rms)
.
library(rms) # if not installed: install.packages('rms')
tv.val <- ols(cdi ~ tv.hours + gender + mot.education + book.reading, data = tv2, x = TRUE, y = TRUE)
validate(tv.val, B = 1000) # not too optimistic about the slopes
# index.orig training test optimism index.corrected n
# R-square 0.683 0.691 0.662 0.0289 0.654 1000
# MSE 0.456 0.431 0.485 -0.0544 0.510 1000
# g 1.151 1.141 1.142 -0.0002 1.152 1000
# Intercept 0.000 0.000 0.355 -0.3554 0.355 1000
# Slope 1.000 1.000 0.997 0.0035 0.997 1000
Conduct bootstrap sampling with 1000 repetitions.
m3.boot <- Boot(m3.2, R = 1000)
summary(m3.boot)
# R original bootBias bootSE bootMed
# (Intercept) 1000 99.4963 4.02e-03 0.4236 99.4962
# tv.hours 1000 -0.0599 5.06e-05 0.0197 -0.0605
# gendermale 1000 -1.2489 1.67e-02 0.1476 -1.2334
# mot.education 1000 0.0686 -1.41e-03 0.0227 0.0667
# book.reading 1000 0.1785 2.08e-03 0.0648 0.1794
confint(m3.boot, type = "norm")
# Bootstrap quantiles, type = normal
#
# 2.5 % 97.5 %
# (Intercept) 98.6619 100.3226
# tv.hours -0.0987 -0.0213
# gendermale -1.5549 -0.9763
# mot.education 0.0255 0.1146
# book.reading 0.0494 0.3035
Obtain the effect size, both of the full model and the individual predictors
library(lmSupport) # if not installed: install.packages('lmSupport')
summary(m3.2)$adj.r.squared # adjusted R^2
# [1] 0.665
modelEffectSizes(m3.2) # partial eta-squared
# lm(formula = cdi ~ tv.hours + gender + mot.education + book.reading,
# data = tv2)
#
# Coefficients
# SSR df pEta-sqr dR-sqr
# (Intercept) 20022.03 1 0.9982 NA
# tv.hours 3.81 1 0.0969 0.0340
# gender 23.67 1 0.3999 0.2115
# mot.education 3.78 1 0.0960 0.0337
# book.reading 3.76 1 0.0956 0.0336
#
# Sum of squared errors (SSE): 35.5
# Sum of squared total (SST): 112.0
From within RStudio, you can simply download this file using the commands:
# download original file if not already exists (to prevent overwriting)
if (!file.exists("answers.Rmd")) {
download.file("http://www.let.rug.nl/wieling/BCN-Stats/lecture3/lab/answers/answers.Rmd", "answers.Rmd")
}
Subsequently, open it in the editor and use the Knit HMTL button to generate the html file.
If you use plain R, you first have to install Pandoc. Then copy the following lines to the most recent version of R.
# install rmarkdown package if not installed
if (!"rmarkdown" %in% rownames(installed.packages())) {
install.packages("rmarkdown")
}
library(rmarkdown) # load rmarkdown package
# download original file if not already exists (to prevent overwriting)
if (!file.exists("answers.Rmd")) {
download.file("http://www.let.rug.nl/wieling/BCN-Stats/lecture3/lab/answers/answers.Rmd", "answers.Rmd")
}
# generate output
render("answers.Rmd") # generates html file with results
# view output in browser
browseURL(paste("file://", file.path(getwd(), "answers.html"), sep = "")) # shows result