Introduction

This file contains the answers of the lab session: http://www.let.rug.nl/wieling/BCN-Stats/lecture3/lab.

Importing the data

We will first download the data (which was saved as a dataframe in R: an .rda file), and load it into R.

download.file("http://www.let.rug.nl/wieling/BCN-Stats/lecture3/lab/tv.rda", "tv.rda")
load("tv.rda")  # an rda file can be loaded with the command load

Structure of the data

Investigate the data with descriptive statistics and plots.

str(tv)

# 'data.frame': 80 obs. of  7 variables:
#  $ subj         : Factor w/ 80 levels "1","2","3","4",..: 34 23 69 73 77 15 58 60 76 79 ...
#  $ mot.education: num  8 17 16 16 16 12 16 20 16 16 ...
#  $ tv.hours     : num  18 15 12 23 9 15 10 10 17 10 ...
#  $ daycare.hours: num  0 0 0 0 0 0 0 0 0 0 ...
#  $ book.reading : num  2 4 5 5 5 3 5 3 4 3 ...
#  $ cdi          : num  97 99 99 99 99 100 100 100 100 100 ...
#  $ gender       : Factor w/ 2 levels "female","male": 2 2 2 2 2 1 2 2 1 1 ...

summary(tv)

#       subj    mot.education     tv.hours    daycare.hours   book.reading       cdi       
#  1      : 1   Min.   : 8.0   Min.   : 2.0   Min.   : 0.0   Min.   :1.00   Min.   : 97.0  
#  2      : 1   1st Qu.:12.0   1st Qu.: 9.0   1st Qu.:10.0   1st Qu.:3.00   1st Qu.: 99.0  
#  3      : 1   Median :16.0   Median :11.0   Median :20.0   Median :4.00   Median :100.0  
#  4      : 1   Mean   :14.8   Mean   :11.7   Mean   :20.4   Mean   :3.94   Mean   : 99.9  
#  5      : 1   3rd Qu.:16.0   3rd Qu.:14.0   3rd Qu.:31.2   3rd Qu.:5.00   3rd Qu.:101.0  
#  6      : 1   Max.   :22.0   Max.   :23.0   Max.   :41.0   Max.   :6.00   Max.   :103.0  
#  (Other):74                                                                              
#     gender  
#  female:40  
#  male  :40  
#             
#             
#             
#             
#

head(tv)

#    subj mot.education tv.hours daycare.hours book.reading cdi gender
# 34   34             8       18             0            2  97   male
# 23   23            17       15             0            4  99   male
# 69   69            16       12             0            5  99   male
# 73   73            16       23             0            5  99   male
# 77   77            16        9             0            5  99   male
# 15   15            12       15             0            3 100 female

table(tv$gender)

# 
# female   male 
#     40     40

cor(tv[, c("tv.hours", "mot.education", "book.reading", "daycare.hours")])

#               tv.hours mot.education book.reading daycare.hours
# tv.hours         1.000        -0.292      -0.1860       -0.1071
# mot.education   -0.292         1.000       0.4066       -0.0910
# book.reading    -0.186         0.407       1.0000       -0.0852
# daycare.hours   -0.107        -0.091      -0.0852        1.0000

par(mfrow = c(1, 2))
hist(tv$tv.hours)
boxplot(tv$tv.hours)

hist(tv$mot.education)
boxplot(tv$mot.education)

hist(tv$book.reading)
boxplot(tv$book.reading)

hist(tv$daycare.hours)
boxplot(tv$daycare.hours)

Regression modeling

In this section, we will fit the best model for the data, predicting the cdi language development score on the basis of various predictors.

The best model without interactions

Fit the best model without taking into account interactions.

summary(m0 <- lm(cdi ~ tv.hours, data = tv))

# 
# Call:
# lm(formula = cdi ~ tv.hours, data = tv)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -2.6907 -0.6907 -0.0407  0.6343  3.1594 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 101.6400     0.3863   263.1  < 2e-16 ***
# tv.hours     -0.1500     0.0312    -4.8  7.4e-06 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 1.13 on 78 degrees of freedom
# Multiple R-squared:  0.228,   Adjusted R-squared:  0.218 
# F-statistic: 23.1 on 1 and 78 DF,  p-value: 7.44e-06

summary(m1 <- lm(cdi ~ tv.hours + gender, data = tv))

# 
# Call:
# lm(formula = cdi ~ tv.hours + gender, data = tv)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -1.9936 -0.5485  0.0416  0.5544  2.3397 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 101.5545     0.2937  345.83  < 2e-16 ***
# tv.hours     -0.0745     0.0257   -2.90   0.0049 ** 
# gendermale   -1.5921     0.2087   -7.63  5.2e-11 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 0.861 on 77 degrees of freedom
# Multiple R-squared:  0.56,    Adjusted R-squared:  0.549 
# F-statistic: 49.1 on 2 and 77 DF,  p-value: 1.8e-14

AIC(m0) - AIC(m1)

# [1] 43

summary(m2 <- lm(cdi ~ tv.hours + gender + mot.education, data = tv))

# 
# Call:
# lm(formula = cdi ~ tv.hours + gender + mot.education, data = tv)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -1.7963 -0.4824  0.0672  0.5274  2.2577 
# 
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    99.7868     0.5347  186.62  < 2e-16 ***
# tv.hours       -0.0534     0.0243   -2.20  0.03114 *  
# gendermale     -1.4936     0.1940   -7.70  4.1e-11 ***
# mot.education   0.0998     0.0260    3.83  0.00026 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 0.794 on 76 degrees of freedom
# Multiple R-squared:  0.632,   Adjusted R-squared:  0.617 
# F-statistic: 43.4 on 3 and 76 DF,  p-value: <2e-16

AIC(m1) - AIC(m2)

# [1] 12.1

summary(m3 <- lm(cdi ~ tv.hours + gender + mot.education + book.reading, data = tv))

# 
# Call:
# lm(formula = cdi ~ tv.hours + gender + mot.education + book.reading, 
#     data = tv)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
# -2.076 -0.457  0.070  0.453  2.175 
# 
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    99.4418     0.5440  182.80  < 2e-16 ***
# tv.hours       -0.0542     0.0237   -2.29   0.0250 *  
# gendermale     -1.3820     0.1957   -7.06  7.1e-10 ***
# mot.education   0.0783     0.0271    2.89   0.0051 ** 
# book.reading    0.1561     0.0702    2.22   0.0292 *  
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 0.774 on 75 degrees of freedom
# Multiple R-squared:  0.654,   Adjusted R-squared:  0.636 
# F-statistic: 35.5 on 4 and 75 DF,  p-value: <2e-16

AIC(m2) - AIC(m3)

# [1] 3.1

summary(m4 <- lm(cdi ~ tv.hours + gender + mot.education + book.reading + daycare.hours, data = tv))

# 
# Call:
# lm(formula = cdi ~ tv.hours + gender + mot.education + book.reading + 
#     daycare.hours, data = tv)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
# -2.028 -0.479  0.171  0.452  2.093 
# 
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   99.04027    0.58484  169.35  < 2e-16 ***
# tv.hours      -0.04863    0.02362   -2.06   0.0430 *  
# gendermale    -1.38150    0.19315   -7.15  5.1e-10 ***
# mot.education  0.08281    0.02691    3.08   0.0029 ** 
# book.reading   0.16365    0.06945    2.36   0.0211 *  
# daycare.hours  0.01182    0.00683    1.73   0.0876 .  
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 0.764 on 74 degrees of freedom
# Multiple R-squared:  0.668,   Adjusted R-squared:  0.645 
# F-statistic: 29.8 on 5 and 74 DF,  p-value: <2e-16

AIC(m3) - AIC(m4)  # m3 is the best model

# [1] 1.18

step(m4)  # selects, the full model, but I'm more conservative and would select the simpler model (< 2 AIC difference)

# Start:  AIC=-37.4
# cdi ~ tv.hours + gender + mot.education + book.reading + daycare.hours
# 
#                 Df Sum of Sq  RSS   AIC
# <none>                       43.2 -37.3
# - daycare.hours  1      1.75 44.9 -36.2
# - tv.hours       1      2.47 45.6 -34.9
# - book.reading   1      3.24 46.4 -33.6
# - mot.education  1      5.52 48.7 -29.7
# - gender         1     29.84 73.0   2.7

# 
# Call:
# lm(formula = cdi ~ tv.hours + gender + mot.education + book.reading + 
#     daycare.hours, data = tv)
# 
# Coefficients:
#   (Intercept)       tv.hours     gendermale  mot.education   book.reading  daycare.hours  
#       99.0403        -0.0486        -1.3815         0.0828         0.1636         0.0118

Assess the regression assumptions

# linearity
library(car)  # if not installed: install.packages('car')
par(mfrow = c(1, 3))
crPlot(m3, var = "tv.hours")  # linearity not perfect
crPlot(m3, var = "mot.education")
crPlot(m3, var = "book.reading")

par(mfrow = c(1, 1))

# homoscedasticity: OK
plot(m3, which = 1)

ncvTest(m3)

# Non-constant Variance Score Test 
# Variance formula: ~ fitted.values 
# Chisquare = 0.00474    Df = 1     p = 0.945

ncvTest(m3, ~tv.hours)

# Non-constant Variance Score Test 
# Variance formula: ~ tv.hours 
# Chisquare = 0.0104    Df = 1     p = 0.919

ncvTest(m3, ~mot.education)

# Non-constant Variance Score Test 
# Variance formula: ~ mot.education 
# Chisquare = 0.103    Df = 1     p = 0.748

ncvTest(m3, ~book.reading)

# Non-constant Variance Score Test 
# Variance formula: ~ book.reading 
# Chisquare = 0.142    Df = 1     p = 0.707

ncvTest(m3, ~gender)

# Non-constant Variance Score Test 
# Variance formula: ~ gender 
# Chisquare = 0.0217    Df = 1     p = 0.883

# multicollinearity: OK
vif(m3)

#      tv.hours    gendermale mot.education  book.reading 
#          1.24          1.28          1.27          1.29

# autocorrelation in residuals: OK
acf(resid(m3))

# distribution of residuals: OK
plot(m3, which = 2)

Are interactions necessary?

Fit the best model while taking into account interactions. For simplicity and speed, we’ll only investigate potential interactions with gender.

summary(m4 <- lm(cdi ~ tv.hours + gender + mot.education + book.reading + daycare.hours * gender, data = tv))

# 
# Call:
# lm(formula = cdi ~ tv.hours + gender + mot.education + book.reading + 
#     daycare.hours * gender, data = tv)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
# -1.999 -0.527  0.154  0.400  2.132 
# 
# Coefficients:
#                          Estimate Std. Error t value Pr(>|t|)    
# (Intercept)              99.07330    0.58197  170.24  < 2e-16 ***
# tv.hours                 -0.04857    0.02348   -2.07   0.0421 *  
# gendermale               -1.74356    0.32759   -5.32  1.1e-06 ***
# mot.education             0.08851    0.02708    3.27   0.0017 ** 
# book.reading              0.17872    0.06992    2.56   0.0127 *  
# daycare.hours             0.00261    0.00957    0.27   0.7855    
# gendermale:daycare.hours  0.01893    0.01388    1.36   0.1767    
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 0.759 on 73 degrees of freedom
# Multiple R-squared:  0.676,   Adjusted R-squared:  0.65 
# F-statistic: 25.4 on 6 and 73 DF,  p-value: 4.55e-16

AIC(m3) - AIC(m4)  # m3 is still the best model

# [1] 1.19

summary(m4 <- lm(cdi ~ tv.hours + gender + mot.education + book.reading * gender, data = tv))

# 
# Call:
# lm(formula = cdi ~ tv.hours + gender + mot.education + book.reading * 
#     gender, data = tv)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -2.0420 -0.4569  0.0724  0.4699  2.1632 
# 
# Coefficients:
#                         Estimate Std. Error t value Pr(>|t|)    
# (Intercept)              99.3498     0.6241  159.19   <2e-16 ***
# tv.hours                 -0.0544     0.0239   -2.28   0.0255 *  
# gendermale               -1.2218     0.5583   -2.19   0.0318 *  
# mot.education             0.0789     0.0274    2.88   0.0051 ** 
# book.reading              0.1752     0.0943    1.86   0.0671 .  
# gendermale:book.reading  -0.0406     0.1323   -0.31   0.7600    
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 0.779 on 74 degrees of freedom
# Multiple R-squared:  0.655,   Adjusted R-squared:  0.632 
# F-statistic: 28.1 on 5 and 74 DF,  p-value: 7.74e-16

AIC(m3) - AIC(m4)  # m3 is still the best model

# [1] -1.9

summary(m4 <- lm(cdi ~ tv.hours + gender + mot.education * gender + book.reading, data = tv))

# 
# Call:
# lm(formula = cdi ~ tv.hours + gender + mot.education * gender + 
#     book.reading, data = tv)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
# -2.021 -0.498  0.027  0.438  2.191 
# 
# Coefficients:
#                          Estimate Std. Error t value Pr(>|t|)    
# (Intercept)               99.9439     0.6962  143.56   <2e-16 ***
# tv.hours                  -0.0551     0.0237   -2.33   0.0226 *  
# gendermale                -2.2277     0.7597   -2.93   0.0045 ** 
# mot.education              0.0475     0.0381    1.25   0.2165    
# book.reading               0.1533     0.0701    2.19   0.0319 *  
# gendermale:mot.education   0.0571     0.0496    1.15   0.2531    
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 0.772 on 74 degrees of freedom
# Multiple R-squared:  0.661,   Adjusted R-squared:  0.638 
# F-statistic: 28.8 on 5 and 74 DF,  p-value: 4.26e-16

AIC(m3) - AIC(m4)  # m3 is still the best model

# [1] -0.578

summary(m4 <- lm(cdi ~ tv.hours * gender + gender + mot.education + book.reading, data = tv))

# 
# Call:
# lm(formula = cdi ~ tv.hours * gender + gender + mot.education + 
#     book.reading, data = tv)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -2.0725 -0.5370  0.0557  0.4938  2.1296 
# 
# Coefficients:
#                     Estimate Std. Error t value Pr(>|t|)    
# (Intercept)          99.1812     0.6169  160.77   <2e-16 ***
# tv.hours             -0.0296     0.0363   -0.82   0.4175    
# gendermale           -0.9001     0.5704   -1.58   0.1188    
# mot.education         0.0799     0.0272    2.93   0.0044 ** 
# book.reading          0.1531     0.0704    2.17   0.0329 *  
# tv.hours:gendermale  -0.0422     0.0469   -0.90   0.3714    
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 0.775 on 74 degrees of freedom
# Multiple R-squared:  0.658,   Adjusted R-squared:  0.635 
# F-statistic: 28.5 on 5 and 74 DF,  p-value: 5.47e-16

AIC(m3) - AIC(m4)  # m3 remains the best model

# [1] -1.13

Model criticism

Apply model criticism by excluding observations with residuals outside 2.5 SD.

tv2 = tv[abs(scale(resid(m3))) < 2.5, ]
nrow(tv2)  # 2 outliers excluded

# [1] 78

summary(m3.2 <- lm(cdi ~ tv.hours + gender + mot.education + book.reading, data = tv2))

# 
# Call:
# lm(formula = cdi ~ tv.hours + gender + mot.education + book.reading, 
#     data = tv2)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -1.4693 -0.5174  0.0573  0.4393  1.5407 
# 
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    99.4963     0.4906  202.82  < 2e-16 ***
# tv.hours       -0.0599     0.0214   -2.80   0.0066 ** 
# gendermale     -1.2489     0.1791   -6.97  1.2e-09 ***
# mot.education   0.0686     0.0246    2.79   0.0068 ** 
# book.reading    0.1785     0.0643    2.78   0.0069 ** 
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 0.698 on 73 degrees of freedom
# Multiple R-squared:  0.683,   Adjusted R-squared:  0.665 
# F-statistic: 39.3 on 4 and 73 DF,  p-value: <2e-16

par(mfrow = c(1, 2))
plot(m3, which = 2, main = "m3")
plot(m3.2, which = 2, main = "m3.2")

Overfitting

Check for overfitting using the validate() function of library(rms).

library(rms)  # if not installed: install.packages('rms')
tv.val <- ols(cdi ~ tv.hours + gender + mot.education + book.reading, data = tv2, x = TRUE, y = TRUE)
validate(tv.val, B = 1000)  # not too optimistic about the slopes

#           index.orig training  test optimism index.corrected    n
# R-square       0.683    0.691 0.662   0.0289           0.654 1000
# MSE            0.456    0.431 0.485  -0.0544           0.510 1000
# g              1.151    1.141 1.142  -0.0002           1.152 1000
# Intercept      0.000    0.000 0.355  -0.3554           0.355 1000
# Slope          1.000    1.000 0.997   0.0035           0.997 1000

Bootstrap sampling

Conduct bootstrap sampling with 1000 repetitions.

m3.boot <- Boot(m3.2, R = 1000)
summary(m3.boot)

#                  R original  bootBias bootSE bootMed
# (Intercept)   1000  99.4963  4.02e-03 0.4236 99.4962
# tv.hours      1000  -0.0599  5.06e-05 0.0197 -0.0605
# gendermale    1000  -1.2489  1.67e-02 0.1476 -1.2334
# mot.education 1000   0.0686 -1.41e-03 0.0227  0.0667
# book.reading  1000   0.1785  2.08e-03 0.0648  0.1794

confint(m3.boot, type = "norm")

# Bootstrap quantiles, type =  normal 
# 
#                 2.5 %   97.5 %
# (Intercept)   98.6619 100.3226
# tv.hours      -0.0987  -0.0213
# gendermale    -1.5549  -0.9763
# mot.education  0.0255   0.1146
# book.reading   0.0494   0.3035

Effect size

Obtain the effect size, both of the full model and the individual predictors

library(lmSupport)  # if not installed: install.packages('lmSupport')
summary(m3.2)$adj.r.squared  # adjusted R^2

# [1] 0.665

modelEffectSizes(m3.2)  # partial eta-squared

# lm(formula = cdi ~ tv.hours + gender + mot.education + book.reading, 
#     data = tv2)
# 
# Coefficients
#                    SSR df pEta-sqr dR-sqr
# (Intercept)   20022.03  1   0.9982     NA
# tv.hours          3.81  1   0.0969 0.0340
# gender           23.67  1   0.3999 0.2115
# mot.education     3.78  1   0.0960 0.0337
# book.reading      3.76  1   0.0956 0.0336
# 
# Sum of squared errors (SSE): 35.5
# Sum of squared total  (SST): 112.0

Replication

From within RStudio, you can simply download this file using the commands:

# download original file if not already exists (to prevent overwriting)
if (!file.exists("answers.Rmd")) {
    download.file("http://www.let.rug.nl/wieling/BCN-Stats/lecture3/lab/answers/answers.Rmd", "answers.Rmd")
}

Subsequently, open it in the editor and use the Knit HMTL button to generate the html file.

If you use plain R, you first have to install Pandoc. Then copy the following lines to the most recent version of R.

# install rmarkdown package if not installed
if (!"rmarkdown" %in% rownames(installed.packages())) {
    install.packages("rmarkdown")
}
library(rmarkdown)  # load rmarkdown package

# download original file if not already exists (to prevent overwriting)
if (!file.exists("answers.Rmd")) {
    download.file("http://www.let.rug.nl/wieling/BCN-Stats/lecture3/lab/answers/answers.Rmd", "answers.Rmd")
}

# generate output
render("answers.Rmd")  # generates html file with results

# view output in browser
browseURL(paste("file://", file.path(getwd(), "answers.html"), sep = ""))  # shows result

Correlation and regression - answers

Martijn Wieling (http://www.martijnwieling.nl)

Generation date: jun 10, 2016 - 20:12:45