The data we will use in the first part is a hypothetical study on
child language acquisition (created by a colleague of
mine and slightly adapted by me). We want to investigate the effects
of amount of time spend in front of TV to two-year-old children’s
language development. The response variable in this data set,
cdi
, is a standard measure of children’s language abilities
based on parental reports. The predictor of interest is
tv.hours
, which is the weekly hours of TV time for each
child.
You will have to fill in most commands yourself, but this should be
feasible given the slides of the lecture, which can be viewed here: https://www.let.rug.nl/wieling/Statistics/Regression.
While you can just enter the commands in RStudio, it is also possible to
modify the source of this so-called R-markdown file directly in RStudio
and press the “Knit HTML” button to generate an html file which contains
both the commands you’ve used and their output. You can download the
file to your current working directory in R by pasting the following
command:
download.file('http://www.let.rug.nl/wieling/Statistics/Regression/lab/lab.Rmd', 'lab.Rmd')
.
You can then open this file in RStudio. In this file, all R commands
which are located within chunks (beginning and ending with three
backticks) will be evaluated. Creating an R markdown file is very useful
as your analysis becomes reproducible and easy to check for others. Note
that chunks have options, with which you can customize the output. See
for more information: https://raw.githubusercontent.com/rstudio/cheatsheets/master/rmarkdown-2.0.pdf
Note that the significance threshold \(\alpha\) is set to 0.05 as usual.
We will first download the data (which was saved as a dataframe in R:
an .rda
file), and load it into R.
download.file("http://www.let.rug.nl/wieling/Statistics/Regression/lab/tv.rda", "tv.rda")
load("tv.rda") # an rda file can be loaded with the command load
Investigate the data with descriptive statistics and plots. Also calculate the correlations between the numerical variables.
# your code goes here
In this section, we will fit the best model for the data, predicting
the cdi
language development score on the basis of various
predictors.
Fit the best model without taking into account interactions.
# your code goes here
# your code goes here
# linearity
# autocorrelation in residuals
# multicollinearity
# homoscedasticity
# distribution of residuals
Fit the best model while taking into account interactions. For simplicity and speed, we’ll only investigate potential interactions with sex. Also visualize the results if an interaction is necessary.
# your code goes here
Obtain the effect size, both of the full model and the individual predictors
# your code goes here
# Please note that installing 'lmSupport' using install.packages('lmSupport') does not
# work anymore, instead follow the following steps (If using Windows: download and
# install RTools from:
# https://cran.r-project.org/bin/windows/Rtools/rtools42/rtools.html) Then run the
# following commands from R:
# install.packages(c('psych','AICcmodavg','gplots','gvlma','pwr'))
# install.packages('https://cran.r-project.org/src/contrib/Archive/lmSupport/lmSupport_2.9.13.tar.gz',
# repos=NULL, type='source')
Apply model criticism by excluding observations with residuals outside 2.5 SD.
# your code goes here
Conduct bootstrap sampling with 1000 repetitions. Use
set.seed(100)
to set the seed of the random number
generator.
# your code goes here
In this assignment we will use logistic regression to assess the differences in pronunciation at the end of syllables in New York during the 1950s and 1960s depending on the socioeconomic class (phoneme [r] vs [ə] (schwa)).
We use the following data:
outcome | status |
---|---|
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
r | upper |
schwa | upper |
schwa | upper |
schwa | upper |
schwa | upper |
schwa | upper |
schwa | upper |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
r | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
schwa | middle |
r | lower |
r | lower |
r | lower |
r | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
schwa | lower |
Use R code to generate this data without having to type all the
observations (there are 184 observations…). Hint: you can use the
function c()
to concatenate information and function
rep()
to repeat one element multiple times,
e.g. rep("upper",36)
will create a vector in which
"upper"
is repeated 36 times. In the end you should have a
data frame with 1 dependent variable (pronunciation
, with 2
levels: schwa
and r
) and 1 predictor
(status
, with 3 levels: lower
,
middle
and upper
).
# your code goes here
Furthermore, make sure that both variables are coded as factors and
show that this indeed is the case. Set lower
as your
reference level for the predictor status
. Now add an
additional predictor IsR
which is equal to 1 if
pronunciation
equal [r] and 0 otherwise. This will be our
dependent variable.
# your code goes here
Now fit a binomial logistic regression model to the data.
# your code goes here
Now visualize the results.
# your code goes here
Check the logistic regression assumptions.
# your code goes here
Evaluate how well the model performs, by calculating the index of concordance.
# your code goes here
Given your model, what are the odds of r
to
schwa
when the predictor status
is at its
reference level (lower
)? Hint: the estimates in a logistic
regression model are in logits, which is the natural logarithm of the
odds. To back transform to odds, you’ll need to invert the logartithm
(how?). Add the R code which calculates these odds. Also calculate the
probability of the outcome being r
when status is
at its reference level (lower
).
# your code goes here
Which status
is associated with a greater probability of
pronouncing the [r], instead of [ə]? What are the odds of hearing [r]
with this status as opposed to the lower
status?
# your code goes here
The answers to the questions in this file can be viewed here: https://www.let.rug.nl/wieling/Statistics/Regression/lab/answers. The associated R markdown file can be downloaded here: https://www.let.rug.nl/wieling/Statistics/Regression/lab/answers/answers.Rmd
From within RStudio, you can simply download this file using the commands:
# download original file if not already exists (to prevent overwriting)
if (!file.exists("lab.Rmd")) {
download.file("http://www.let.rug.nl/wieling/Statistics/Regression/lab/lab.Rmd", "lab.Rmd")
}
Subsequently, open it in the editor and use the Knit HMTL button to generate the html file.
If you use plain R, you first have to install Pandoc. Then copy the following lines to the most recent version of R.
# install rmarkdown package if not installed
if (!"rmarkdown" %in% rownames(installed.packages())) {
install.packages("rmarkdown")
}
library(rmarkdown) # load rmarkdown package
# download original file if not already exists (to prevent overwriting)
if (!file.exists("lab.Rmd")) {
download.file("http://www.let.rug.nl/wieling/Statistics/Regression/lab/lab.Rmd", "lab.Rmd")
}
# generate output
render("lab.Rmd") # generates html file with results
# view output in browser
browseURL(paste("file://", file.path(getwd(), "lab.html"), sep = "")) # shows result