This file contains the answers of the lab session: https://www.let.rug.nl/wieling/Statistics/Regression/lab.
We will first download the data (which was saved as a dataframe in R:
an .rda
file), and load it into R.
download.file("http://www.let.rug.nl/wieling/Statistics/Regression/lab/tv.rda", "tv.rda")
load("tv.rda") # an rda file can be loaded with the command load
Investigate the data with descriptive statistics and plots. Also calculate the correlations between the numerical variables.
str(tv)
# 'data.frame': 80 obs. of 7 variables:
# $ subj : Factor w/ 80 levels "1","2","3","4",..: 34 23 69 73 77 15 58 60 76 79 ...
# $ mot.education: num 8 17 16 16 16 12 16 20 16 16 ...
# $ tv.hours : num 18 15 12 23 9 15 10 10 17 10 ...
# $ daycare.hours: num 0 0 0 0 0 0 0 0 0 0 ...
# $ book.reading : num 2 4 5 5 5 3 5 3 4 3 ...
# $ cdi : num 97 99 99 99 99 100 100 100 100 100 ...
# $ sex : Factor w/ 2 levels "female","male": 2 2 2 2 2 1 2 2 1 1 ...
summary(tv)
# subj mot.education tv.hours daycare.hours book.reading
# 1 : 1 Min. : 8.0 Min. : 2.0 Min. : 0.0 Min. :1.00
# 2 : 1 1st Qu.:12.0 1st Qu.: 9.0 1st Qu.:10.0 1st Qu.:3.00
# 3 : 1 Median :16.0 Median :11.0 Median :20.0 Median :4.00
# 4 : 1 Mean :14.8 Mean :11.7 Mean :20.4 Mean :3.94
# 5 : 1 3rd Qu.:16.0 3rd Qu.:14.0 3rd Qu.:31.2 3rd Qu.:5.00
# 6 : 1 Max. :22.0 Max. :23.0 Max. :41.0 Max. :6.00
# (Other):74
# cdi sex
# Min. : 97.0 female:40
# 1st Qu.: 99.0 male :40
# Median :100.0
# Mean : 99.9
# 3rd Qu.:101.0
# Max. :103.0
#
head(tv)
# subj mot.education tv.hours daycare.hours book.reading cdi sex
# 34 34 8 18 0 2 97 male
# 23 23 17 15 0 4 99 male
# 69 69 16 12 0 5 99 male
# 73 73 16 23 0 5 99 male
# 77 77 16 9 0 5 99 male
# 15 15 12 15 0 3 100 female
table(tv$sex)
#
# female male
# 40 40
cor(tv[, c("tv.hours", "mot.education", "book.reading", "daycare.hours")])
# tv.hours mot.education book.reading daycare.hours
# tv.hours 1.000 -0.292 -0.1860 -0.1071
# mot.education -0.292 1.000 0.4066 -0.0910
# book.reading -0.186 0.407 1.0000 -0.0852
# daycare.hours -0.107 -0.091 -0.0852 1.0000
par(mfrow = c(1, 2))
hist(tv$tv.hours)
boxplot(tv$tv.hours)
hist(tv$mot.education)
boxplot(tv$mot.education)
hist(tv$book.reading)
boxplot(tv$book.reading)
hist(tv$daycare.hours)
boxplot(tv$daycare.hours)
In this section, we will fit the best model for the data, predicting
the cdi
language development score on the basis of various
predictors.
Fit the best model without taking into account interactions.
summary(m0 <- lm(cdi ~ tv.hours, data = tv))
#
# Call:
# lm(formula = cdi ~ tv.hours, data = tv)
#
# Residuals:
# Min 1Q Median 3Q Max
# -2.6907 -0.6907 -0.0407 0.6343 3.1594
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 101.6400 0.3863 263.1 < 2e-16 ***
# tv.hours -0.1500 0.0312 -4.8 7.4e-06 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 1.13 on 78 degrees of freedom
# Multiple R-squared: 0.228, Adjusted R-squared: 0.218
# F-statistic: 23.1 on 1 and 78 DF, p-value: 7.44e-06
summary(m1 <- lm(cdi ~ tv.hours + sex, data = tv))
#
# Call:
# lm(formula = cdi ~ tv.hours + sex, data = tv)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.9936 -0.5485 0.0416 0.5544 2.3397
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 101.5545 0.2937 345.83 < 2e-16 ***
# tv.hours -0.0745 0.0257 -2.90 0.0049 **
# sexmale -1.5921 0.2087 -7.63 5.2e-11 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 0.861 on 77 degrees of freedom
# Multiple R-squared: 0.56, Adjusted R-squared: 0.549
# F-statistic: 49.1 on 2 and 77 DF, p-value: 1.8e-14
AIC(m0) - AIC(m1)
# [1] 43
summary(m2 <- lm(cdi ~ tv.hours + sex + mot.education, data = tv))
#
# Call:
# lm(formula = cdi ~ tv.hours + sex + mot.education, data = tv)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.7963 -0.4824 0.0672 0.5274 2.2577
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 99.7868 0.5347 186.62 < 2e-16 ***
# tv.hours -0.0534 0.0243 -2.20 0.03114 *
# sexmale -1.4936 0.1940 -7.70 4.1e-11 ***
# mot.education 0.0998 0.0260 3.83 0.00026 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 0.794 on 76 degrees of freedom
# Multiple R-squared: 0.632, Adjusted R-squared: 0.617
# F-statistic: 43.4 on 3 and 76 DF, p-value: <2e-16
AIC(m1) - AIC(m2)
# [1] 12.1
summary(m3 <- lm(cdi ~ tv.hours + sex + mot.education + book.reading, data = tv))
#
# Call:
# lm(formula = cdi ~ tv.hours + sex + mot.education + book.reading,
# data = tv)
#
# Residuals:
# Min 1Q Median 3Q Max
# -2.076 -0.457 0.070 0.453 2.175
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 99.4418 0.5440 182.80 < 2e-16 ***
# tv.hours -0.0542 0.0237 -2.29 0.0250 *
# sexmale -1.3820 0.1957 -7.06 7.1e-10 ***
# mot.education 0.0783 0.0271 2.89 0.0051 **
# book.reading 0.1561 0.0702 2.22 0.0292 *
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 0.774 on 75 degrees of freedom
# Multiple R-squared: 0.654, Adjusted R-squared: 0.636
# F-statistic: 35.5 on 4 and 75 DF, p-value: <2e-16
AIC(m2) - AIC(m3)
# [1] 3.1
summary(m4 <- lm(cdi ~ tv.hours + sex + mot.education + book.reading + daycare.hours,
data = tv))
#
# Call:
# lm(formula = cdi ~ tv.hours + sex + mot.education + book.reading +
# daycare.hours, data = tv)
#
# Residuals:
# Min 1Q Median 3Q Max
# -2.028 -0.479 0.171 0.452 2.093
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 99.04027 0.58484 169.35 < 2e-16 ***
# tv.hours -0.04863 0.02362 -2.06 0.0430 *
# sexmale -1.38150 0.19315 -7.15 5.1e-10 ***
# mot.education 0.08281 0.02691 3.08 0.0029 **
# book.reading 0.16365 0.06945 2.36 0.0211 *
# daycare.hours 0.01182 0.00683 1.73 0.0876 .
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 0.764 on 74 degrees of freedom
# Multiple R-squared: 0.668, Adjusted R-squared: 0.645
# F-statistic: 29.8 on 5 and 74 DF, p-value: <2e-16
AIC(m3) - AIC(m4) # m3 is the best model
# [1] 1.18
library(visreg) # if not installed: install.packages('visreg')
par(mfrow = c(2, 2))
visreg(m3)
# linearity
library(car) # if not installed: install.packages('car')
par(mfrow = c(1, 3))
crPlot(m3, var = "tv.hours") # linearity not perfect
crPlot(m3, var = "mot.education")
crPlot(m3, var = "book.reading")
par(mfrow = c(1, 1))
# autocorrelation in residuals: OK
acf(resid(m3))
# multicollinearity: OK
car::vif(m3) # make sure car version is used, not rms version
# tv.hours sex mot.education book.reading
# 1.24 1.28 1.27 1.29
# homoscedasticity: OK
plot(fitted(m3), resid(m3))
ncvTest(m3)
# Non-constant Variance Score Test
# Variance formula: ~ fitted.values
# Chisquare = 0.00474, Df = 1, p = 0.9
ncvTest(m3, ~tv.hours)
# Non-constant Variance Score Test
# Variance formula: ~ tv.hours
# Chisquare = 0.0104, Df = 1, p = 0.9
ncvTest(m3, ~mot.education)
# Non-constant Variance Score Test
# Variance formula: ~ mot.education
# Chisquare = 0.103, Df = 1, p = 0.7
ncvTest(m3, ~book.reading)
# Non-constant Variance Score Test
# Variance formula: ~ book.reading
# Chisquare = 0.142, Df = 1, p = 0.7
ncvTest(m3, ~sex)
# Non-constant Variance Score Test
# Variance formula: ~ sex
# Chisquare = 0.0217, Df = 1, p = 0.9
# distribution of residuals: OK
qqnorm(resid(m3))
qqline(resid(m3))
Fit the best model while taking into account interactions. For simplicity and speed, we’ll only investigate potential interactions with sex. Also visualize the results if an interaction is necessary.
summary(m4 <- lm(cdi ~ tv.hours + mot.education + book.reading + daycare.hours * sex,
data = tv))
#
# Call:
# lm(formula = cdi ~ tv.hours + mot.education + book.reading +
# daycare.hours * sex, data = tv)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.999 -0.527 0.154 0.400 2.132
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 99.07330 0.58197 170.24 < 2e-16 ***
# tv.hours -0.04857 0.02348 -2.07 0.0421 *
# mot.education 0.08851 0.02708 3.27 0.0017 **
# book.reading 0.17872 0.06992 2.56 0.0127 *
# daycare.hours 0.00261 0.00957 0.27 0.7855
# sexmale -1.74356 0.32759 -5.32 1.1e-06 ***
# daycare.hours:sexmale 0.01893 0.01388 1.36 0.1767
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 0.759 on 73 degrees of freedom
# Multiple R-squared: 0.676, Adjusted R-squared: 0.65
# F-statistic: 25.4 on 6 and 73 DF, p-value: 4.55e-16
AIC(m3) - AIC(m4) # m3 is still the best model
# [1] 1.19
summary(m4 <- lm(cdi ~ tv.hours + mot.education + book.reading * sex, data = tv))
#
# Call:
# lm(formula = cdi ~ tv.hours + mot.education + book.reading *
# sex, data = tv)
#
# Residuals:
# Min 1Q Median 3Q Max
# -2.0420 -0.4569 0.0724 0.4699 2.1632
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 99.3498 0.6241 159.19 <2e-16 ***
# tv.hours -0.0544 0.0239 -2.28 0.0255 *
# mot.education 0.0789 0.0274 2.88 0.0051 **
# book.reading 0.1752 0.0943 1.86 0.0671 .
# sexmale -1.2218 0.5583 -2.19 0.0318 *
# book.reading:sexmale -0.0406 0.1323 -0.31 0.7600
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 0.779 on 74 degrees of freedom
# Multiple R-squared: 0.655, Adjusted R-squared: 0.632
# F-statistic: 28.1 on 5 and 74 DF, p-value: 7.74e-16
AIC(m3) - AIC(m4) # m3 is still the best model
# [1] -1.9
summary(m4 <- lm(cdi ~ tv.hours + mot.education * sex + book.reading, data = tv))
#
# Call:
# lm(formula = cdi ~ tv.hours + mot.education * sex + book.reading,
# data = tv)
#
# Residuals:
# Min 1Q Median 3Q Max
# -2.021 -0.498 0.027 0.438 2.191
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 99.9439 0.6962 143.56 <2e-16 ***
# tv.hours -0.0551 0.0237 -2.33 0.0226 *
# mot.education 0.0475 0.0381 1.25 0.2165
# sexmale -2.2277 0.7597 -2.93 0.0045 **
# book.reading 0.1533 0.0701 2.19 0.0319 *
# mot.education:sexmale 0.0571 0.0496 1.15 0.2531
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 0.772 on 74 degrees of freedom
# Multiple R-squared: 0.661, Adjusted R-squared: 0.638
# F-statistic: 28.8 on 5 and 74 DF, p-value: 4.26e-16
AIC(m3) - AIC(m4) # m3 is still the best model
# [1] -0.578
summary(m4 <- lm(cdi ~ tv.hours * sex + mot.education + book.reading, data = tv))
#
# Call:
# lm(formula = cdi ~ tv.hours * sex + mot.education + book.reading,
# data = tv)
#
# Residuals:
# Min 1Q Median 3Q Max
# -2.0725 -0.5370 0.0557 0.4938 2.1296
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 99.1812 0.6169 160.77 <2e-16 ***
# tv.hours -0.0296 0.0363 -0.82 0.4175
# sexmale -0.9001 0.5704 -1.58 0.1188
# mot.education 0.0799 0.0272 2.93 0.0044 **
# book.reading 0.1531 0.0704 2.17 0.0329 *
# tv.hours:sexmale -0.0422 0.0469 -0.90 0.3714
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 0.775 on 74 degrees of freedom
# Multiple R-squared: 0.658, Adjusted R-squared: 0.635
# F-statistic: 28.5 on 5 and 74 DF, p-value: 5.47e-16
AIC(m3) - AIC(m4) # m3 remains the best model
# [1] -1.13
Obtain the effect size, both of the full model and the individual predictors
library(lsr)
etaSquared(m3, type = 3)
# eta.sq eta.sq.part
# tv.hours 0.0241 0.0652
# sex 0.2297 0.3993
# mot.education 0.0384 0.1000
# book.reading 0.0228 0.0618
Apply model criticism by excluding observations with residuals outside 2.5 SD.
tv2 = tv[abs(scale(resid(m3))) < 2.5, ]
nrow(tv2) # 2 outliers excluded
# [1] 78
summary(m3.2 <- lm(cdi ~ tv.hours + sex + mot.education + book.reading, data = tv2))
#
# Call:
# lm(formula = cdi ~ tv.hours + sex + mot.education + book.reading,
# data = tv2)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.4693 -0.5174 0.0573 0.4393 1.5407
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 99.4963 0.4906 202.82 < 2e-16 ***
# tv.hours -0.0599 0.0214 -2.80 0.0066 **
# sexmale -1.2489 0.1791 -6.97 1.2e-09 ***
# mot.education 0.0686 0.0246 2.79 0.0068 **
# book.reading 0.1785 0.0643 2.78 0.0069 **
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 0.698 on 73 degrees of freedom
# Multiple R-squared: 0.683, Adjusted R-squared: 0.665
# F-statistic: 39.3 on 4 and 73 DF, p-value: <2e-16
par(mfrow = c(1, 2))
qqnorm(resid(m3), main = "m3")
qqline(resid(m3))
qqnorm(resid(m3.2), main = "m3.2")
qqline(resid(m3.2))
Conduct bootstrap sampling with 1000 repetitions. Use
set.seed(100)
to set the seed of the random number
generator.
set.seed(100)
m3.boot <- Boot(m3.2, R = 1000)
summary(m3.boot)
#
# Number of bootstrap replications R = 1000
# original bootBias bootSE bootMed
# (Intercept) 99.4963 -0.008852 0.4079 99.5114
# tv.hours -0.0599 -0.000706 0.0200 -0.0610
# sexmale -1.2489 0.008245 0.1519 -1.2425
# mot.education 0.0686 0.000323 0.0239 0.0679
# book.reading 0.1785 0.001639 0.0643 0.1810
confint(m3.boot, type = "norm")
# Bootstrap normal confidence intervals
#
# 2.5 % 97.5 %
# (Intercept) 98.7056 100.3047
# tv.hours -0.0983 -0.0201
# sexmale -1.5548 -0.9594
# mot.education 0.0215 0.1151
# book.reading 0.0508 0.3030
In this assignment we will use logistic regression to assess the differences in pronunciation at the end of syllables in New York during the 1950s and 1960s depending on the socioeconomic class (phoneme [r] vs [ə] (schwa)).
We use the following data:
outcome | status |
---|---|
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
schwa | upper |
schwa | upper |
schwa | upper |
schwa | upper |
schwa | upper |
schwa | upper |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
r | lower |
r | lower |
r | lower |
r | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
Use R code to generate this data without having to type all the
observations (there are 184 observations…). Hint: you can use the
function c()
to concatenate information and function
rep()
to repeat one element multiple times,
e.g. rep("upper",36)
will create a vector in which
"upper"
is repeated 36 times. In the end you should have a
data frame with 1 dependent variable (pronunciation
, with 2
levels: schwa
and r
) and 1 predictor
(status
, with 3 levels: lower
,
middle
and upper
).
status <- c(rep("upper", 36), rep("middle", 94), rep("lower", 54))
pronunciation <- c(rep("r", 30), rep("schwa", 6), rep("r", 20), rep("schwa", 74), rep("r",
4), rep("schwa", 50))
# create the data frame
dat <- data.frame(pronunciation, status)
Furthermore, make sure that both variables are coded as factors and
show that this indeed is the case. Set lower
as your
reference level for the predictor status
. Now add an
additional predictor IsR
which is equal to 1 if
pronunciation
equals [r] and 0 otherwise. This will be our
dependent variable.
dat$status <- as.factor(dat$status)
dat$pronunciation <- as.factor(dat$pronunciation)
str(dat)
# 'data.frame': 184 obs. of 2 variables:
# $ pronunciation: Factor w/ 2 levels "r","schwa": 1 1 1 1 1 1 1 1 1 1 ...
# $ status : Factor w/ 3 levels "lower","middle",..: 3 3 3 3 3 3 3 3 3 3 ...
dat$status <- relevel(dat$status, "lower")
dat$IsR <- (dat$pronunciation == "r") * 1
Now fit a logistic regression model to the data
m <- glm(IsR ~ status, data = dat, family = "binomial")
summary(m)
#
# Call:
# glm(formula = IsR ~ status, family = "binomial", data = dat)
#
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -2.526 0.520 -4.86 1.2e-06 ***
# statusmiddle 1.217 0.578 2.11 0.035 *
# statusupper 4.135 0.686 6.03 1.6e-09 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 222.73 on 183 degrees of freedom
# Residual deviance: 158.27 on 181 degrees of freedom
# AIC: 164.3
#
# Number of Fisher Scoring iterations: 5
Now visualize the results.
par(mfrow = c(1, 2))
visreg(m)
visreg(m, scale = "response")
Check the logistic regression assumptions.
# No quantitative predictors, so no linearity assumption to check
# Multicollinearity is only necessary to assess when there are multiple predictors
Evaluate how well the model performs, by calculating the index of concordance.
library(rms) # if not installed: install.packages('rms')
ml <- lrm(IsR ~ status, data = dat)
ml # index of concordance = 0.805: well-performing model
# Logistic Regression Model
#
# lrm(formula = IsR ~ status, data = dat)
#
# Model Likelihood Discrimination Rank Discrim.
# Ratio Test Indexes Indexes
# Obs 184 LR chi2 64.46 R2 0.421 C 0.805
# 0 130 d.f. 2 R2(2,184)0.288 Dxy 0.610
# 1 54 Pr(> chi2) <0.0001 R2(2,114.5)0.421 gamma 0.829
# max |deriv| 5e-05 Brier 0.133 tau-a 0.254
#
# Coef S.E. Wald Z Pr(>|Z|)
# Intercept -2.5257 0.5196 -4.86 <0.0001
# status=middle 1.2174 0.5775 2.11 0.0350
# status=upper 4.1352 0.6856 6.03 <0.0001
Given your model, what are the odds of r
to
schwa
when the predictor status
is at its
reference level (lower
)? Hint: the estimates in a logistic
regression model are in logits, which is the natural logarithm of the
odds. To back transform to odds, you’ll need to invert the logartithm
(how?). Add the R code which calculate these odds. Also calculate the
probability of the outcome being r
when status is
at its reference level (lower
).
exp(coef(m)["(Intercept)"]) # odds of r instead of schwa in low status
# (Intercept)
# 0.08
plogis(coef(m)["(Intercept)"]) # probability of r instead of schwa in low status
# (Intercept)
# 0.0741
Which status
is associated with a greater probability of
pronouncing the [r], instead of [ə]? How many times higher are the odds
compared to the reference level? Include the R code to calculate
this.
exp(coef(m)["statusupper"]) # highest value: odds of r vs. schwa in high compared to low status
# statusupper
# 62.5
From within RStudio, you can simply download this file using the commands:
# download original file if not already exists (to prevent overwriting)
if (!file.exists('answers.Rmd')) {
download.file('http://www.let.rug.nl/wieling/Statistics/Regression/lab/answers/answers.Rmd',
'answers.Rmd')
}
Subsequently, open it in the editor and use the Knit HMTL button to generate the html file.
If you use plain R, you first have to install Pandoc. Then copy the following lines to the most recent version of R.
# install rmarkdown package if not installed
if(!"rmarkdown" %in% rownames(installed.packages())) {
install.packages("rmarkdown")
}
library(rmarkdown) # load rmarkdown package
# download original file if not already exists (to prevent overwriting)
if (!file.exists('answers.Rmd')) {
download.file('http://www.let.rug.nl/wieling/Statistics/Regression/lab/answers/answers.Rmd',
'answers.Rmd')
}
# generate output
render('answers.Rmd') # generates html file with results
# view output in browser
browseURL(paste('file://', file.path(getwd(),'answers.html'), sep='')) # shows result