1 Introduction

This file contains the answers of the lab session: http://www.let.rug.nl/wieling/statscourse/CrashCourseR/lab.

2 Importing the data

We will first download a csv file generated in Excel and import it into R.

download.file("http://www.let.rug.nl/wieling/statscourse/CrashCourseR/lab/mtcars.csv", "mtcars.csv")

# now import the data yourself into an R data frame with the name dat using the function: read.csv2()
dat <- read.csv2("mtcars.csv")

3 Structure of the data

Note that this dataset is similar to the mtcars dataset standard available in R, so the description of the columns can be obtained with ?mtcars. There is one addition column ‘region’ which contains the region which the car maker originated from. In the following, you will look at the structure of the data using various functions.

# Look at the structure of the data using the functions: str, summary and head
str(dat)
# 'data.frame': 32 obs. of  13 variables:
#  $ model : Factor w/ 32 levels "AMC Javelin",..: 18 19 5 13 14 31 7 21 20 22 ...
#  $ mpg   : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#  $ cyl   : int  6 6 4 6 8 6 8 4 4 6 ...
#  $ disp  : num  160 160 108 258 360 ...
#  $ hp    : int  110 110 93 110 175 105 245 62 95 123 ...
#  $ drat  : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#  $ wt    : num  2.62 2.88 2.32 3.21 3.44 ...
#  $ qsec  : num  16.5 17 18.6 19.4 17 ...
#  $ vs    : int  0 0 1 1 0 1 0 1 1 1 ...
#  $ am    : int  1 1 1 0 0 0 0 0 0 0 ...
#  $ gear  : int  4 4 4 3 3 3 3 4 4 4 ...
#  $ carb  : int  4 4 1 1 2 1 4 2 2 4 ...
#  $ region: Factor w/ 3 levels "Asia","Europe",..: 1 1 1 3 3 3 3 2 2 2 ...
summary(dat)
#                 model         mpg            cyl            disp           hp           drat     
#  AMC Javelin       : 1   Min.   :10.4   Min.   :4.00   Min.   : 71   Min.   : 52   Min.   :2.76  
#  Cadillac Fleetwood: 1   1st Qu.:15.4   1st Qu.:4.00   1st Qu.:121   1st Qu.: 96   1st Qu.:3.08  
#  Camaro Z28        : 1   Median :19.2   Median :6.00   Median :196   Median :123   Median :3.69  
#  Chrysler Imperial : 1   Mean   :20.1   Mean   :6.19   Mean   :231   Mean   :147   Mean   :3.60  
#  Datsun 710        : 1   3rd Qu.:22.8   3rd Qu.:8.00   3rd Qu.:326   3rd Qu.:180   3rd Qu.:3.92  
#  Dodge Challenger  : 1   Max.   :33.9   Max.   :8.00   Max.   :472   Max.   :335   Max.   :4.93  
#  (Other)           :26                                                                           
#        wt            qsec            vs              am             gear           carb     
#  Min.   :1.51   Min.   :14.5   Min.   :0.000   Min.   :0.000   Min.   :3.00   Min.   :1.00  
#  1st Qu.:2.58   1st Qu.:16.9   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:3.00   1st Qu.:2.00  
#  Median :3.33   Median :17.7   Median :0.000   Median :0.000   Median :4.00   Median :2.00  
#  Mean   :3.22   Mean   :17.9   Mean   :0.438   Mean   :0.406   Mean   :3.69   Mean   :2.81  
#  3rd Qu.:3.61   3rd Qu.:18.9   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:4.00   3rd Qu.:4.00  
#  Max.   :5.42   Max.   :22.9   Max.   :1.000   Max.   :1.000   Max.   :5.00   Max.   :8.00  
#                                                                                             
#     region  
#  Asia  : 9  
#  Europe:14  
#  USA   : 9  
#             
#             
#             
# 
head(dat)
#               model  mpg cyl disp  hp drat   wt qsec vs am gear carb region
# 1         Mazda RX4 21.0   6  160 110 3.90 2.62 16.5  0  1    4    4   Asia
# 2     Mazda RX4 Wag 21.0   6  160 110 3.90 2.88 17.0  0  1    4    4   Asia
# 3        Datsun 710 22.8   4  108  93 3.85 2.32 18.6  1  1    4    1   Asia
# 4    Hornet 4 Drive 21.4   6  258 110 3.08 3.21 19.4  1  0    3    1    USA
# 5 Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.0  0  0    3    2    USA
# 6           Valiant 18.1   6  225 105 2.76 3.46 20.2  1  0    3    1    USA

4 Modifying the data

In this section, you will add two columns to the data.

# Add a column to the data relHP which should contain the hp of the car divided by the weight (column
# wt)
dat$relHP <- dat$hp/dat$wt

# Next, add a column to the data named sportscar which is TRUE when the relHP > 42 and FALSE
# otherwise
dat$sportscar <- FALSE
dat[dat$relHP > 42, ]$sportscar <- TRUE

# Look at the data using head
head(dat)
#               model  mpg cyl disp  hp drat   wt qsec vs am gear carb region relHP sportscar
# 1         Mazda RX4 21.0   6  160 110 3.90 2.62 16.5  0  1    4    4   Asia  42.0     FALSE
# 2     Mazda RX4 Wag 21.0   6  160 110 3.90 2.88 17.0  0  1    4    4   Asia  38.3     FALSE
# 3        Datsun 710 22.8   4  108  93 3.85 2.32 18.6  1  1    4    1   Asia  40.1     FALSE
# 4    Hornet 4 Drive 21.4   6  258 110 3.08 3.21 19.4  1  0    3    1    USA  34.2     FALSE
# 5 Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.0  0  0    3    2    USA  50.9      TRUE
# 6           Valiant 18.1   6  225 105 2.76 3.46 20.2  1  0    3    1    USA  30.3     FALSE

5 Investigating the data

In this section, we will look at the variables in more detail. Specifically, we will look at measures of spread and central tendency, and frequency tables for individual variables. Furthermore, we will investigate the relationship between pairs of variables.

# How many sportscars are there (according to our definition)? Hint: use table()
table(dat$sportscar)
# 
# FALSE  TRUE 
#    17    15
# What is the mean weight of the cars?
mean(dat$wt)
# [1] 3.22
# What is the standard deviation of the weight of the cars?
sd(dat$wt)
# [1] 0.978
# How many cars have 6 cylinders?
table(dat$cyl)
# 
#  4  6  8 
# 11  7 14
# What is the correlation between weight and horsepower?
cor(dat$wt, dat$hp)
# [1] 0.659
# How are being a sportscar and the number of gears related?
table(dat$sportscar, dat$gear)
#        
#          3  4  5
#   FALSE  5 12  0
#   TRUE  10  0  5

6 Visualizing the data

In this section, we will look at the variables in more detail through visualization.

# Create a boxplot with the weight for sportscars
boxplot(dat[dat$sportscar, ]$wt)

# Create a boxplot with the weight, separately for the number of cylinders Hint: boxplot can also be
# used with the formula interface: wt ~ cyl, data=dat
boxplot(wt ~ cyl, data = dat)

# Show the histogram for relHP
hist(dat$relHP)

# Show the histogram for wt and hp next to each other.  Set the color of the bars to 'red' for wt and
# 'blue' for hp.  Hint: use par() to place the graphs besides each other and use ?hist to see what
# parameter to use for the color
par(mfrow = c(1, 2))
hist(dat$wt, col = "red")
hist(dat$hp, col = "blue")

# Show the Q-Q plot of qsec (time for driving 1/4 mile)
par(mfrow = c(1, 1))
qqnorm(dat$qsec)
qqline(dat$qsec)

# Create a new data frame named 'tmp' excluding the outlier
tmp <- dat[!dat$qsec > 22, ]
dim(tmp)
# [1] 31 15
# Create a barplot contrasting automatic vs. manual transmission (column 'am') Give the plot a
# header: 'Transmission' and provide names below the bars: 'A' and 'M'
counts <- table(dat$am)
barplot(counts, main = "Transmission", names = c("A", "M"))

# Create a segmented barplot showing the relationship between being a sportscar and the type of
# transmission
counts <- table(dat$sportscar, dat$am)
barplot(counts, xlab = "Transmission", col = c("blue", "red"), legend = c("regular", "sport"), names = c("A", 
    "M"))

7 Basic statistical analyses

In this section, we will conduct basic statistical analyses of the data.

7.1 Comparing one or two means

# We will investigate if the weight in the sample significantly differs from 3 (x1000) lbs. However,
# before conducting a t-test we need to assess if the variable is normally distributed

# Assess if wt is normally distributed
qqnorm(dat$wt)
qqline(dat$wt)

shapiro.test(dat$wt)  # OK
# 
#   Shapiro-Wilk normality test
# 
# data:  dat$wt
# W = 0.9, p-value = 0.09
# Run the one-sample t-test
t.test(dat$wt, mu = 3, alternative = "two.sided")
# 
#   One Sample t-test
# 
# data:  dat$wt
# t = 1, df = 30, p-value = 0.2
# alternative hypothesis: true mean is not equal to 3
# 95 percent confidence interval:
#  2.86 3.57
# sample estimates:
# mean of x 
#      3.22
# Assess if the weight of sportscars differs significantly from non-sportscars.  First assess if the
# distribution of wt of both groups is approximately normal.  If not, use an appropriate
# non-parametric alternative
shapiro.test(dat[dat$sportscar, ]$wt)  # OK
# 
#   Shapiro-Wilk normality test
# 
# data:  dat[dat$sportscar, ]$wt
# W = 0.9, p-value = 0.1
shapiro.test(dat[!dat$sportscar, ]$wt)  # not OK
# 
#   Shapiro-Wilk normality test
# 
# data:  dat[!dat$sportscar, ]$wt
# W = 0.9, p-value = 0.04
wilcox.test(dat[dat$sportscar, ]$wt, dat[!dat$sportscar, ]$wt, alternative = "two.sided")
# Warning in wilcox.test.default(dat[dat$sportscar, ]$wt, dat[!dat$sportscar, : cannot compute exact
# p-value with ties
# 
#   Wilcoxon rank sum test with continuity correction
# 
# data:  dat[dat$sportscar, ]$wt and dat[!dat$sportscar, ]$wt
# W = 200, p-value = 0.05
# alternative hypothesis: true location shift is not equal to 0
# N.B. to get exact p-values in the presence of ties, the help indicates wilcox_ties of the package
# coin can be used:
library(coin)
wilcox_test(wt ~ sportscar, data = dat)
# Error in check(object): 'object' does not represent a two-sample problem (maybe the grouping variable is not a factor?)
# As sportscar is not a factor (it is a logical variable), it gives an error.  The solution is to
# convert it to a factor
dat$sportscar <- as.factor(dat$sportscar)
wilcox_test(wt ~ sportscar, data = dat)
# 
#   Asymptotic Wilcoxon-Mann-Whitney Test
# 
# data:  wt by sportscar (FALSE, TRUE)
# Z = -2, p-value = 0.04
# alternative hypothesis: true mu is not equal to 0

7.2 Assessing categorical dependence

# Assess if there is a dependency between transmission and sportscar.
chisq.test(table(dat$am, dat$sportscar))
# 
#   Pearson's Chi-squared test with Yates' continuity correction
# 
# data:  table(dat$am, dat$sportscar)
# X-squared = 0.2, df = 1, p-value = 0.7

7.3 Comparing three or more means

# Assess if the region of the car maker influences the weight.  If the region influences the weight
# also assess which regions differ.
summary(result <- aov(wt ~ region, data = dat))
#             Df Sum Sq Mean Sq F value Pr(>F)   
# region       2   8.81    4.41    6.13  0.006 **
# Residuals   29  20.86    0.72                  
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(result)
#   Tukey multiple comparisons of means
#     95% family-wise confidence level
# 
# Fit: aov(formula = wt ~ region, data = dat)
# 
# $region
#              diff    lwr  upr p adj
# Europe-Asia 0.218 -0.677 1.11 0.820
# USA-Asia    1.283  0.295 2.27 0.009
# USA-Europe  1.065  0.170 1.96 0.017
# Investigate if the region of the car maker and the type of car (sportscar or not) influence the
# weight. As we will use factorial anova, all predictors need to be factors. Currently sportscar is a
# logical predictor (TRUE or FALSE), so we need to convert it to a factor with: dat$sportscar <-
# as.factor(dat$sportscar).
dat$sportscar <- as.factor(dat$sportscar)

# Interaction plot:
with(dat, interaction.plot(sportscar, region, wt, col = c("blue", "red"), type = "b"))

# First we check if the data is balanced
table(dat$region, dat$sportscar)  # not balanced
#         
#          FALSE TRUE
#   Asia       6    3
#   Europe     7    7
#   USA        4    5
# This means that the SS type is important, so we use type 3:
library(car)
op <- options(contrasts = c("contr.sum", "contr.poly"))  # orthogonal contrasts
Anova(aov(wt ~ sportscar * region, data = dat), type = 3)
# Anova Table (Type III tests)
# 
# Response: wt
#                  Sum Sq Df F value Pr(>F)    
# (Intercept)         327  1  528.85 <2e-16 ***
# sportscar             1  1    1.57 0.2212    
# region                7  2    6.06 0.0069 ** 
# sportscar:region      4  2    3.28 0.0536 .  
# Residuals            16 26                   
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# interaction not significant, so exclude interaction
Anova(aov(wt ~ sportscar + region, data = dat), type = 3)
# Anova Table (Type III tests)
# 
# Response: wt
#             Sum Sq Df F value Pr(>F)    
# (Intercept)    327  1  454.16 <2e-16 ***
# sportscar        1  1    1.02 0.3207    
# region           8  2    5.58 0.0091 ** 
# Residuals       20 28                   
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# sportscar not significant, so the best model, is the one above (result)

# Finally assess if adding the number of carburators (carb) is a signiciant covariate.
Anova(aov(wt ~ carb + region, data = dat), type = 3)  # yes
# Anova Table (Type III tests)
# 
# Response: wt
#             Sum Sq Df F value  Pr(>F)    
# (Intercept)   50.7  1   88.30 3.7e-10 ***
# carb           4.8  1    8.31  0.0075 ** 
# region         8.2  2    7.10  0.0032 ** 
# Residuals     16.1 28                    
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
options(op)  # reset contrasts to treatment coding

7.4 Multiple predictors

# Assess how well the weight of a sportscar can be predicted by the number of cylinders, the region,
# and the type of transmission.
summary(model <- lm(wt ~ cyl + region + am, data = dat))
# 
# Call:
# lm(formula = wt ~ cyl + region + am, data = dat)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -0.8106 -0.3082 -0.0127  0.2445  1.1734 
# 
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    1.4671     0.4862    3.02  0.00550 ** 
# cyl            0.2910     0.0668    4.35  0.00017 ***
# regionEurope   0.2393     0.2254    1.06  0.29773    
# regionUSA      0.4552     0.2744    1.66  0.10874    
# am            -0.6976     0.2254   -3.09  0.00455 ** 
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 0.525 on 27 degrees of freedom
# Multiple R-squared:  0.749,   Adjusted R-squared:  0.712 
# F-statistic: 20.2 on 4 and 27 DF,  p-value: 8.63e-08
# Is region necessary in the model?  Hint: use model comparison
m0 <- lm(wt ~ cyl + am, data = dat)
anova(m0, model)  # not necessary
# Analysis of Variance Table
# 
# Model 1: wt ~ cyl + am
# Model 2: wt ~ cyl + region + am
#   Res.Df  RSS Df Sum of Sq    F Pr(>F)
# 1     29 8.22                         
# 2     27 7.44  2     0.781 1.42   0.26
summary(m0)
# 
# Call:
# lm(formula = wt ~ cyl + am, data = dat)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -0.6676 -0.3343 -0.0581  0.1866  1.3214 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   1.5665     0.4532    3.46   0.0017 ** 
# cyl           0.3170     0.0628    5.05  2.2e-05 ***
# am           -0.7649     0.2248   -3.40   0.0020 ** 
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 0.533 on 29 degrees of freedom
# Multiple R-squared:  0.723,   Adjusted R-squared:  0.704 
# F-statistic: 37.8 on 2 and 29 DF,  p-value: 8.28e-09
# Interpretation: more cylinders => heavier, manual transmission (am=1): lighter

8 Replication

From within RStudio, you can simply download this file using the command download.file('http://www.let.rug.nl/wieling/statscourse/CrashCourseR/lab/answers/answers.Rmd', 'answers.Rmd'), open it in the editor and use the Knit HMTL button to generate the html file. If you use plain R, you first have to install Pandoc. Then copy the following lines to the most recent version of R. (Or use the Knit HTML button when viewing the file answers.Rmd in RStudio.)

# install rmarkdown package if not installed
if (!"rmarkdown" %in% rownames(installed.packages())) {
    install.packages("rmarkdown", repos="http://cran.us.r-project.org")
}
library(rmarkdown)  # load rmarkdown package

# download original file if not already exists (to prevent overwriting)
if (!file.exists("answers.Rmd")) {
    download.file("http://www.let.rug.nl/wieling/statscourse/CrashCourseR/lab/answers/answers.Rmd", "answers.Rmd")
}

# generate output
render("lab.Rmd")  # generates html file with results

# view output in browser
browseURL(paste("file://", file.path(getwd(), "answers.html"), sep = ""))  # shows result