1 Regression

1.1 Introduction

The data we will use in the first part is a hypothetical study on child language acquisition (created by a colleague of mine and slightly adapted by me). We want to investigate the effects of amount of time spend in front of TV to two-year-old children’s language development. The response variable in this data set, cdi, is a standard measure of children’s language abilities based on parental reports. The predictor of interest is tv.hours, which is the weekly hours of TV time for each child.

You will have to fill in most commands yourself, but this should be feasible given the slides of the lecture, which can be viewed here: https://www.let.rug.nl/wieling/Statistics/Regression. While you can just enter the commands in RStudio, it is also possible to modify the source of this so-called R-markdown file directly in RStudio and press the “Knit HTML” button to generate an html file which contains both the commands you’ve used and their output. You can download the file to your current working directory in R by pasting the following command: download.file('http://www.let.rug.nl/wieling/Statistics/Regression/lab/lab.Rmd', 'lab.Rmd'). You can then open this file in RStudio. In this file, all R commands which are located within chunks (beginning and ending with three backticks) will be evaluated. Creating an R markdown file is very useful as your analysis becomes reproducible and easy to check for others. Note that chunks have options, with which you can customize the output. See for more information: https://raw.githubusercontent.com/rstudio/cheatsheets/master/rmarkdown-2.0.pdf

Note that the significance threshold \(\alpha\) is set to 0.05 as usual.

1.2 Importing the data

We will first download the data (which was saved as a dataframe in R: an .rda file), and load it into R.

download.file("http://www.let.rug.nl/wieling/Statistics/Regression/lab/tv.rda", "tv.rda")
load("tv.rda")  # an rda file can be loaded with the command load 

1.3 Structure of the data

Investigate the data with descriptive statistics and plots. Also calculate the correlations between the numerical variables.

# your code goes here

1.4 Regression modeling

In this section, we will fit the best model for the data, predicting the cdi language development score on the basis of various predictors.

1.4.1 The best model without interactions

Fit the best model without taking into account interactions.

# your code goes here

1.4.2 Visualize the results

# your code goes here

1.4.3 Assess the regression assumptions discussed in the lecture

# linearity

# autocorrelation in residuals

# multicollinearity

# homoscedasticity

# distribution of residuals

1.4.4 Are interactions necessary?

Fit the best model while taking into account interactions. For simplicity and speed, we’ll only investigate potential interactions with sex. Also visualize the results if an interaction is necessary.

# your code goes here

1.4.5 Effect size

Obtain the effect size, both of the full model and the individual predictors

# your code goes here

# Please note that installing 'lmSupport' using install.packages('lmSupport') does not
# work anymore, instead follow the following steps (If using Windows: download and
# install RTools from:
# https://cran.r-project.org/bin/windows/Rtools/rtools42/rtools.html) Then run the
# following commands from R:
# install.packages(c('psych','AICcmodavg','gplots','gvlma','pwr'))
# install.packages('https://cran.r-project.org/src/contrib/Archive/lmSupport/lmSupport_2.9.13.tar.gz',
# repos=NULL, type='source')

1.4.6 Model criticism

Apply model criticism by excluding observations with residuals outside 2.5 SD.

# your code goes here

1.4.7 Bootstrap sampling

Conduct bootstrap sampling with 1000 repetitions. Use set.seed(100) to set the seed of the random number generator.

# your code goes here

2 Logistic regression

2.1 Introduction

In this assignment we will use logistic regression to assess the differences in pronunciation at the end of syllables in New York during the 1950s and 1960s depending on the socioeconomic class (phoneme [r] vs [ə] (schwa)).

We use the following data:

outcome status
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
r upper
schwa upper
schwa upper
schwa upper
schwa upper
schwa upper
schwa upper
r middle
r middle
r middle
r middle
r middle
r middle
r middle
r middle
r middle
r middle
r middle
r middle
r middle
r middle
r middle
r middle
r middle
r middle
r middle
r middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
schwa middle
r lower
r lower
r lower
r lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower
schwa lower

2.2 Create data frame

Use R code to generate this data without having to type all the observations (there are 184 observations…). Hint: you can use the function c() to concatenate information and function rep() to repeat one element multiple times, e.g. rep("upper",36) will create a vector in which "upper" is repeated 36 times. In the end you should have a data frame with 1 dependent variable (pronunciation, with 2 levels: schwa and r) and 1 predictor (status, with 3 levels: lower, middle and upper).

# your code goes here

Furthermore, make sure that both variables are coded as factors and show that this indeed is the case. Set lower as your reference level for the predictor status. Now add an additional predictor IsR which is equal to 1 if pronunciation equal [r] and 0 otherwise. This will be our dependent variable.

# your code goes here

2.3 Logistic regression model

Now fit a binomial logistic regression model to the data.

# your code goes here

2.4 Visualize the results

Now visualize the results.

# your code goes here

2.5 Assumptions

Check the logistic regression assumptions.

# your code goes here

2.6 Goodness of fit

Evaluate how well the model performs, by calculating the index of concordance.

# your code goes here

2.7 Interpretation

Given your model, what are the odds of r to schwa when the predictor status is at its reference level (lower)? Hint: the estimates in a logistic regression model are in logits, which is the natural logarithm of the odds. To back transform to odds, you’ll need to invert the logartithm (how?). Add the R code which calculates these odds. Also calculate the probability of the outcome being r when status is at its reference level (lower).

# your code goes here

Which status is associated with a greater probability of pronouncing the [r], instead of [ə]? What are the odds of hearing [r] with this status as opposed to the lower status?

# your code goes here

3 Answers

The answers to the questions in this file can be viewed here: https://www.let.rug.nl/wieling/Statistics/Regression/lab/answers. The associated R markdown file can be downloaded here: https://www.let.rug.nl/wieling/Statistics/Regression/lab/answers/answers.Rmd

4 Replication

From within RStudio, you can simply download this file using the commands:

# download original file if not already exists (to prevent overwriting)
if (!file.exists("lab.Rmd")) {
    download.file("http://www.let.rug.nl/wieling/Statistics/Regression/lab/lab.Rmd", "lab.Rmd")
}

Subsequently, open it in the editor and use the Knit HMTL button to generate the html file.

If you use plain R, you first have to install Pandoc. Then copy the following lines to the most recent version of R.

# install rmarkdown package if not installed
if (!"rmarkdown" %in% rownames(installed.packages())) {
    install.packages("rmarkdown")
}
library(rmarkdown)  # load rmarkdown package

# download original file if not already exists (to prevent overwriting)
if (!file.exists("lab.Rmd")) {
    download.file("http://www.let.rug.nl/wieling/Statistics/Regression/lab/lab.Rmd", "lab.Rmd")
}

# generate output
render("lab.Rmd")  # generates html file with results

# view output in browser
browseURL(paste("file://", file.path(getwd(), "lab.html"), sep = ""))  # shows result