1 Regression

1.1 Introduction

This file contains the answers of the lab session: https://www.let.rug.nl/wieling/Statistics/Regression/lab.

1.2 Importing the data

We will first download the data (which was saved as a dataframe in R: an .rda file), and load it into R.

download.file("http://www.let.rug.nl/wieling/Statistics/Regression/lab/tv.rda", "tv.rda")
load("tv.rda")  # an rda file can be loaded with the command load 

1.3 Structure of the data

Investigate the data with descriptive statistics and plots. Also calculate the correlations between the numerical variables.

str(tv)
# 'data.frame': 80 obs. of  7 variables:
#  $ subj         : Factor w/ 80 levels "1","2","3","4",..: 34 23 69 73 77 15 58 60 76 79 ...
#  $ mot.education: num  8 17 16 16 16 12 16 20 16 16 ...
#  $ tv.hours     : num  18 15 12 23 9 15 10 10 17 10 ...
#  $ daycare.hours: num  0 0 0 0 0 0 0 0 0 0 ...
#  $ book.reading : num  2 4 5 5 5 3 5 3 4 3 ...
#  $ cdi          : num  97 99 99 99 99 100 100 100 100 100 ...
#  $ sex          : Factor w/ 2 levels "female","male": 2 2 2 2 2 1 2 2 1 1 ...
summary(tv)
#       subj    mot.education     tv.hours    daycare.hours   book.reading 
#  1      : 1   Min.   : 8.0   Min.   : 2.0   Min.   : 0.0   Min.   :1.00  
#  2      : 1   1st Qu.:12.0   1st Qu.: 9.0   1st Qu.:10.0   1st Qu.:3.00  
#  3      : 1   Median :16.0   Median :11.0   Median :20.0   Median :4.00  
#  4      : 1   Mean   :14.8   Mean   :11.7   Mean   :20.4   Mean   :3.94  
#  5      : 1   3rd Qu.:16.0   3rd Qu.:14.0   3rd Qu.:31.2   3rd Qu.:5.00  
#  6      : 1   Max.   :22.0   Max.   :23.0   Max.   :41.0   Max.   :6.00  
#  (Other):74                                                              
#       cdi            sex    
#  Min.   : 97.0   female:40  
#  1st Qu.: 99.0   male  :40  
#  Median :100.0              
#  Mean   : 99.9              
#  3rd Qu.:101.0              
#  Max.   :103.0              
# 
head(tv)
#    subj mot.education tv.hours daycare.hours book.reading cdi    sex
# 34   34             8       18             0            2  97   male
# 23   23            17       15             0            4  99   male
# 69   69            16       12             0            5  99   male
# 73   73            16       23             0            5  99   male
# 77   77            16        9             0            5  99   male
# 15   15            12       15             0            3 100 female
table(tv$sex)
# 
# female   male 
#     40     40
cor(tv[, c("tv.hours", "mot.education", "book.reading", "daycare.hours")])
#               tv.hours mot.education book.reading daycare.hours
# tv.hours         1.000        -0.292      -0.1860       -0.1071
# mot.education   -0.292         1.000       0.4066       -0.0910
# book.reading    -0.186         0.407       1.0000       -0.0852
# daycare.hours   -0.107        -0.091      -0.0852        1.0000
par(mfrow = c(1, 2))
hist(tv$tv.hours)
boxplot(tv$tv.hours)

hist(tv$mot.education)
boxplot(tv$mot.education)

hist(tv$book.reading)
boxplot(tv$book.reading)

hist(tv$daycare.hours)
boxplot(tv$daycare.hours)

1.4 Regression modeling

In this section, we will fit the best model for the data, predicting the cdi language development score on the basis of various predictors.

1.4.1 The best model without interactions

Fit the best model without taking into account interactions.

summary(m0 <- lm(cdi ~ tv.hours, data = tv))
# 
# Call:
# lm(formula = cdi ~ tv.hours, data = tv)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -2.6907 -0.6907 -0.0407  0.6343  3.1594 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 101.6400     0.3863   263.1  < 2e-16 ***
# tv.hours     -0.1500     0.0312    -4.8  7.4e-06 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 1.13 on 78 degrees of freedom
# Multiple R-squared:  0.228,   Adjusted R-squared:  0.218 
# F-statistic: 23.1 on 1 and 78 DF,  p-value: 7.44e-06
summary(m1 <- lm(cdi ~ tv.hours + sex, data = tv))
# 
# Call:
# lm(formula = cdi ~ tv.hours + sex, data = tv)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -1.9936 -0.5485  0.0416  0.5544  2.3397 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 101.5545     0.2937  345.83  < 2e-16 ***
# tv.hours     -0.0745     0.0257   -2.90   0.0049 ** 
# sexmale      -1.5921     0.2087   -7.63  5.2e-11 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 0.861 on 77 degrees of freedom
# Multiple R-squared:  0.56,    Adjusted R-squared:  0.549 
# F-statistic: 49.1 on 2 and 77 DF,  p-value: 1.8e-14
AIC(m0) - AIC(m1)
# [1] 43
summary(m2 <- lm(cdi ~ tv.hours + sex + mot.education, data = tv))
# 
# Call:
# lm(formula = cdi ~ tv.hours + sex + mot.education, data = tv)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -1.7963 -0.4824  0.0672  0.5274  2.2577 
# 
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    99.7868     0.5347  186.62  < 2e-16 ***
# tv.hours       -0.0534     0.0243   -2.20  0.03114 *  
# sexmale        -1.4936     0.1940   -7.70  4.1e-11 ***
# mot.education   0.0998     0.0260    3.83  0.00026 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 0.794 on 76 degrees of freedom
# Multiple R-squared:  0.632,   Adjusted R-squared:  0.617 
# F-statistic: 43.4 on 3 and 76 DF,  p-value: <2e-16
AIC(m1) - AIC(m2)
# [1] 12.1
summary(m3 <- lm(cdi ~ tv.hours + sex + mot.education + book.reading, data = tv))
# 
# Call:
# lm(formula = cdi ~ tv.hours + sex + mot.education + book.reading, 
#     data = tv)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
# -2.076 -0.457  0.070  0.453  2.175 
# 
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    99.4418     0.5440  182.80  < 2e-16 ***
# tv.hours       -0.0542     0.0237   -2.29   0.0250 *  
# sexmale        -1.3820     0.1957   -7.06  7.1e-10 ***
# mot.education   0.0783     0.0271    2.89   0.0051 ** 
# book.reading    0.1561     0.0702    2.22   0.0292 *  
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 0.774 on 75 degrees of freedom
# Multiple R-squared:  0.654,   Adjusted R-squared:  0.636 
# F-statistic: 35.5 on 4 and 75 DF,  p-value: <2e-16
AIC(m2) - AIC(m3)
# [1] 3.1
summary(m4 <- lm(cdi ~ tv.hours + sex + mot.education + book.reading + daycare.hours,
    data = tv))
# 
# Call:
# lm(formula = cdi ~ tv.hours + sex + mot.education + book.reading + 
#     daycare.hours, data = tv)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
# -2.028 -0.479  0.171  0.452  2.093 
# 
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   99.04027    0.58484  169.35  < 2e-16 ***
# tv.hours      -0.04863    0.02362   -2.06   0.0430 *  
# sexmale       -1.38150    0.19315   -7.15  5.1e-10 ***
# mot.education  0.08281    0.02691    3.08   0.0029 ** 
# book.reading   0.16365    0.06945    2.36   0.0211 *  
# daycare.hours  0.01182    0.00683    1.73   0.0876 .  
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 0.764 on 74 degrees of freedom
# Multiple R-squared:  0.668,   Adjusted R-squared:  0.645 
# F-statistic: 29.8 on 5 and 74 DF,  p-value: <2e-16
AIC(m3) - AIC(m4)  # m3 is the best model
# [1] 1.18

1.4.2 Visualize the results

library(visreg)  # if not installed: install.packages('visreg')
par(mfrow = c(2, 2))
visreg(m3)

1.4.3 Assess the regression assumptions discussed in the lecture

# linearity
library(car)  # if not installed: install.packages('car')
par(mfrow = c(1, 3))
crPlot(m3, var = "tv.hours")  # linearity not perfect
crPlot(m3, var = "mot.education")
crPlot(m3, var = "book.reading")

par(mfrow = c(1, 1))

# autocorrelation in residuals: OK
acf(resid(m3))

# multicollinearity: OK
car::vif(m3)  # make sure car version is used, not rms version
#      tv.hours           sex mot.education  book.reading 
#          1.24          1.28          1.27          1.29
# homoscedasticity: OK
plot(fitted(m3), resid(m3))

ncvTest(m3)
# Non-constant Variance Score Test 
# Variance formula: ~ fitted.values 
# Chisquare = 0.00474, Df = 1, p = 0.9
ncvTest(m3, ~tv.hours)
# Non-constant Variance Score Test 
# Variance formula: ~ tv.hours 
# Chisquare = 0.0104, Df = 1, p = 0.9
ncvTest(m3, ~mot.education)
# Non-constant Variance Score Test 
# Variance formula: ~ mot.education 
# Chisquare = 0.103, Df = 1, p = 0.7
ncvTest(m3, ~book.reading)
# Non-constant Variance Score Test 
# Variance formula: ~ book.reading 
# Chisquare = 0.142, Df = 1, p = 0.7
ncvTest(m3, ~sex)
# Non-constant Variance Score Test 
# Variance formula: ~ sex 
# Chisquare = 0.0217, Df = 1, p = 0.9
# distribution of residuals: OK
qqnorm(resid(m3))
qqline(resid(m3))

1.4.4 Are interactions necessary?

Fit the best model while taking into account interactions. For simplicity and speed, we’ll only investigate potential interactions with sex. Also visualize the results if an interaction is necessary.

summary(m4 <- lm(cdi ~ tv.hours + mot.education + book.reading + daycare.hours * sex,
    data = tv))
# 
# Call:
# lm(formula = cdi ~ tv.hours + mot.education + book.reading + 
#     daycare.hours * sex, data = tv)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
# -1.999 -0.527  0.154  0.400  2.132 
# 
# Coefficients:
#                       Estimate Std. Error t value Pr(>|t|)    
# (Intercept)           99.07330    0.58197  170.24  < 2e-16 ***
# tv.hours              -0.04857    0.02348   -2.07   0.0421 *  
# mot.education          0.08851    0.02708    3.27   0.0017 ** 
# book.reading           0.17872    0.06992    2.56   0.0127 *  
# daycare.hours          0.00261    0.00957    0.27   0.7855    
# sexmale               -1.74356    0.32759   -5.32  1.1e-06 ***
# daycare.hours:sexmale  0.01893    0.01388    1.36   0.1767    
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 0.759 on 73 degrees of freedom
# Multiple R-squared:  0.676,   Adjusted R-squared:  0.65 
# F-statistic: 25.4 on 6 and 73 DF,  p-value: 4.55e-16
AIC(m3) - AIC(m4)  # m3 is still the best model
# [1] 1.19
summary(m4 <- lm(cdi ~ tv.hours + mot.education + book.reading * sex, data = tv))
# 
# Call:
# lm(formula = cdi ~ tv.hours + mot.education + book.reading * 
#     sex, data = tv)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -2.0420 -0.4569  0.0724  0.4699  2.1632 
# 
# Coefficients:
#                      Estimate Std. Error t value Pr(>|t|)    
# (Intercept)           99.3498     0.6241  159.19   <2e-16 ***
# tv.hours              -0.0544     0.0239   -2.28   0.0255 *  
# mot.education          0.0789     0.0274    2.88   0.0051 ** 
# book.reading           0.1752     0.0943    1.86   0.0671 .  
# sexmale               -1.2218     0.5583   -2.19   0.0318 *  
# book.reading:sexmale  -0.0406     0.1323   -0.31   0.7600    
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 0.779 on 74 degrees of freedom
# Multiple R-squared:  0.655,   Adjusted R-squared:  0.632 
# F-statistic: 28.1 on 5 and 74 DF,  p-value: 7.74e-16
AIC(m3) - AIC(m4)  # m3 is still the best model
# [1] -1.9
summary(m4 <- lm(cdi ~ tv.hours + mot.education * sex + book.reading, data = tv))
# 
# Call:
# lm(formula = cdi ~ tv.hours + mot.education * sex + book.reading, 
#     data = tv)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
# -2.021 -0.498  0.027  0.438  2.191 
# 
# Coefficients:
#                       Estimate Std. Error t value Pr(>|t|)    
# (Intercept)            99.9439     0.6962  143.56   <2e-16 ***
# tv.hours               -0.0551     0.0237   -2.33   0.0226 *  
# mot.education           0.0475     0.0381    1.25   0.2165    
# sexmale                -2.2277     0.7597   -2.93   0.0045 ** 
# book.reading            0.1533     0.0701    2.19   0.0319 *  
# mot.education:sexmale   0.0571     0.0496    1.15   0.2531    
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 0.772 on 74 degrees of freedom
# Multiple R-squared:  0.661,   Adjusted R-squared:  0.638 
# F-statistic: 28.8 on 5 and 74 DF,  p-value: 4.26e-16
AIC(m3) - AIC(m4)  # m3 is still the best model
# [1] -0.578
summary(m4 <- lm(cdi ~ tv.hours * sex + mot.education + book.reading, data = tv))
# 
# Call:
# lm(formula = cdi ~ tv.hours * sex + mot.education + book.reading, 
#     data = tv)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -2.0725 -0.5370  0.0557  0.4938  2.1296 
# 
# Coefficients:
#                  Estimate Std. Error t value Pr(>|t|)    
# (Intercept)       99.1812     0.6169  160.77   <2e-16 ***
# tv.hours          -0.0296     0.0363   -0.82   0.4175    
# sexmale           -0.9001     0.5704   -1.58   0.1188    
# mot.education      0.0799     0.0272    2.93   0.0044 ** 
# book.reading       0.1531     0.0704    2.17   0.0329 *  
# tv.hours:sexmale  -0.0422     0.0469   -0.90   0.3714    
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 0.775 on 74 degrees of freedom
# Multiple R-squared:  0.658,   Adjusted R-squared:  0.635 
# F-statistic: 28.5 on 5 and 74 DF,  p-value: 5.47e-16
AIC(m3) - AIC(m4)  # m3 remains the best model
# [1] -1.13

1.4.5 Effect size

Obtain the effect size, both of the full model and the individual predictors

library(lsr)
etaSquared(m3, type = 3)
#               eta.sq eta.sq.part
# tv.hours      0.0241      0.0652
# sex           0.2297      0.3993
# mot.education 0.0384      0.1000
# book.reading  0.0228      0.0618

1.4.6 Model criticism

Apply model criticism by excluding observations with residuals outside 2.5 SD.

tv2 = tv[abs(scale(resid(m3))) < 2.5, ]
nrow(tv2)  # 2 outliers excluded
# [1] 78
summary(m3.2 <- lm(cdi ~ tv.hours + sex + mot.education + book.reading, data = tv2))
# 
# Call:
# lm(formula = cdi ~ tv.hours + sex + mot.education + book.reading, 
#     data = tv2)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -1.4693 -0.5174  0.0573  0.4393  1.5407 
# 
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    99.4963     0.4906  202.82  < 2e-16 ***
# tv.hours       -0.0599     0.0214   -2.80   0.0066 ** 
# sexmale        -1.2489     0.1791   -6.97  1.2e-09 ***
# mot.education   0.0686     0.0246    2.79   0.0068 ** 
# book.reading    0.1785     0.0643    2.78   0.0069 ** 
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 0.698 on 73 degrees of freedom
# Multiple R-squared:  0.683,   Adjusted R-squared:  0.665 
# F-statistic: 39.3 on 4 and 73 DF,  p-value: <2e-16
par(mfrow = c(1, 2))
qqnorm(resid(m3), main = "m3")
qqline(resid(m3))
qqnorm(resid(m3.2), main = "m3.2")
qqline(resid(m3.2))

1.4.7 Bootstrap sampling

Conduct bootstrap sampling with 1000 repetitions. Use set.seed(100) to set the seed of the random number generator.

set.seed(100)
m3.boot <- Boot(m3.2, R = 1000)
summary(m3.boot)
# 
# Number of bootstrap replications R = 1000 
#               original  bootBias bootSE bootMed
# (Intercept)    99.4963 -0.008852 0.4079 99.5114
# tv.hours       -0.0599 -0.000706 0.0200 -0.0610
# sexmale        -1.2489  0.008245 0.1519 -1.2425
# mot.education   0.0686  0.000323 0.0239  0.0679
# book.reading    0.1785  0.001639 0.0643  0.1810
confint(m3.boot, type = "norm")
# Bootstrap normal confidence intervals
# 
#                 2.5 %   97.5 %
# (Intercept)   98.7056 100.3047
# tv.hours      -0.0983  -0.0201
# sexmale       -1.5548  -0.9594
# mot.education  0.0215   0.1151
# book.reading   0.0508   0.3030

2 Logistic regression

2.1 Introduction

In this assignment we will use logistic regression to assess the differences in pronunciation at the end of syllables in New York during the 1950s and 1960s depending on the socioeconomic class (phoneme [r] vs [ə] (schwa)).

We use the following data:

outcome status
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
schwa upper
schwa upper
schwa upper
schwa upper
schwa upper
schwa upper
r middle
r middle
r middle
r middle
r middle
r middle
r middle
r middle
r middle
r middle
r middle
r middle
r middle
r middle
r middle
r middle
r middle
r middle
r middle
r middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
r lower
r lower
r lower
r lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower

2.2 Create data frame

Use R code to generate this data without having to type all the observations (there are 184 observations…). Hint: you can use the function c() to concatenate information and function rep() to repeat one element multiple times, e.g. rep("upper",36) will create a vector in which "upper" is repeated 36 times. In the end you should have a data frame with 1 dependent variable (pronunciation, with 2 levels: schwa and r) and 1 predictor (status, with 3 levels: lower, middle and upper).

status <- c(rep("upper", 36), rep("middle", 94), rep("lower", 54))
pronunciation <- c(rep("r", 30), rep("schwa", 6), rep("r", 20), rep("schwa", 74), rep("r",
    4), rep("schwa", 50))

# create the data frame
dat <- data.frame(pronunciation, status)

Furthermore, make sure that both variables are coded as factors and show that this indeed is the case. Set lower as your reference level for the predictor status. Now add an additional predictor IsR which is equal to 1 if pronunciation equals [r] and 0 otherwise. This will be our dependent variable.

dat$status <- as.factor(dat$status)
dat$pronunciation <- as.factor(dat$pronunciation)
str(dat)
# 'data.frame': 184 obs. of  2 variables:
#  $ pronunciation: Factor w/ 2 levels "r","schwa": 1 1 1 1 1 1 1 1 1 1 ...
#  $ status       : Factor w/ 3 levels "lower","middle",..: 3 3 3 3 3 3 3 3 3 3 ...
dat$status <- relevel(dat$status, "lower")
dat$IsR <- (dat$pronunciation == "r") * 1

2.3 Logistic regression model

Now fit a logistic regression model to the data

m <- glm(IsR ~ status, data = dat, family = "binomial")
summary(m)
# 
# Call:
# glm(formula = IsR ~ status, family = "binomial", data = dat)
# 
# Coefficients:
#              Estimate Std. Error z value Pr(>|z|)    
# (Intercept)    -2.526      0.520   -4.86  1.2e-06 ***
# statusmiddle    1.217      0.578    2.11    0.035 *  
# statusupper     4.135      0.686    6.03  1.6e-09 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 222.73  on 183  degrees of freedom
# Residual deviance: 158.27  on 181  degrees of freedom
# AIC: 164.3
# 
# Number of Fisher Scoring iterations: 5

2.4 Visualize the results

Now visualize the results.

par(mfrow = c(1, 2))
visreg(m)
visreg(m, scale = "response")

2.5 Assumptions

Check the logistic regression assumptions.

# No quantitative predictors, so no linearity assumption to check

# Multicollinearity is only necessary to assess when there are multiple predictors

2.6 Goodness of fit

Evaluate how well the model performs, by calculating the index of concordance.

library(rms)  # if not installed: install.packages('rms')
ml <- lrm(IsR ~ status, data = dat)
ml  # index of concordance = 0.805: well-performing model
# Logistic Regression Model
# 
# lrm(formula = IsR ~ status, data = dat)
# 
#                        Model Likelihood      Discrimination    Rank Discrim.    
#                              Ratio Test             Indexes          Indexes    
# Obs           184    LR chi2      64.46      R2       0.421    C       0.805    
#  0            130    d.f.             2      R2(2,184)0.288    Dxy     0.610    
#  1             54    Pr(> chi2) <0.0001    R2(2,114.5)0.421    gamma   0.829    
# max |deriv| 5e-05                            Brier    0.133    tau-a   0.254    
# 
#               Coef    S.E.   Wald Z Pr(>|Z|)
# Intercept     -2.5257 0.5196 -4.86  <0.0001 
# status=middle  1.2174 0.5775  2.11  0.0350  
# status=upper   4.1352 0.6856  6.03  <0.0001

2.7 Interpretation

Given your model, what are the odds of r to schwa when the predictor status is at its reference level (lower)? Hint: the estimates in a logistic regression model are in logits, which is the natural logarithm of the odds. To back transform to odds, you’ll need to invert the logartithm (how?). Add the R code which calculate these odds. Also calculate the probability of the outcome being r when status is at its reference level (lower).

exp(coef(m)["(Intercept)"])  # odds of r instead of schwa in low status
# (Intercept) 
#        0.08
plogis(coef(m)["(Intercept)"])  # probability of r instead of schwa in low status
# (Intercept) 
#      0.0741

Which status is associated with a greater probability of pronouncing the [r], instead of [ə]? How many times higher are the odds compared to the reference level? Include the R code to calculate this.

exp(coef(m)["statusupper"])  # highest value: odds of r vs. schwa in high compared to low status
# statusupper 
#        62.5

3 Replication

From within RStudio, you can simply download this file using the commands:

# download original file if not already exists (to prevent overwriting)
if (!file.exists('answers.Rmd')) {
download.file('http://www.let.rug.nl/wieling/Statistics/Regression/lab/answers/answers.Rmd', 
              'answers.Rmd')
}

Subsequently, open it in the editor and use the Knit HMTL button to generate the html file.

If you use plain R, you first have to install Pandoc. Then copy the following lines to the most recent version of R.

# install rmarkdown package if not installed
if(!"rmarkdown" %in% rownames(installed.packages())) {
   install.packages("rmarkdown")
}
library(rmarkdown) # load rmarkdown package

# download original file if not already exists (to prevent overwriting)
if (!file.exists('answers.Rmd')) { 
download.file('http://www.let.rug.nl/wieling/Statistics/Regression/lab/answers/answers.Rmd', 
              'answers.Rmd')
} 

# generate output
render('answers.Rmd') # generates html file with results

# view output in browser
browseURL(paste('file://', file.path(getwd(),'answers.html'), sep='')) # shows result