This file contains the answers of the lab session: http://www.let.rug.nl/wieling/statscourse/CrashCourseR/lab.
We will first download a csv file generated in Excel and import it into R.
download.file("http://www.let.rug.nl/wieling/statscourse/CrashCourseR/lab/mtcars.csv", "mtcars.csv")
# now import the data yourself into an R data frame with the name dat using the function: read.csv2()
dat <- read.csv2("mtcars.csv")
Note that this dataset is similar to the mtcars dataset standard available in R, so the description of the columns can be obtained with ?mtcars
. There is one addition column ‘region’ which contains the region which the car maker originated from. In the following, you will look at the structure of the data using various functions.
# Look at the structure of the data using the functions: str, summary and head
str(dat)
# 'data.frame': 32 obs. of 13 variables:
# $ model : Factor w/ 32 levels "AMC Javelin",..: 18 19 5 13 14 31 7 21 20 22 ...
# $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
# $ cyl : int 6 6 4 6 8 6 8 4 4 6 ...
# $ disp : num 160 160 108 258 360 ...
# $ hp : int 110 110 93 110 175 105 245 62 95 123 ...
# $ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
# $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
# $ qsec : num 16.5 17 18.6 19.4 17 ...
# $ vs : int 0 0 1 1 0 1 0 1 1 1 ...
# $ am : int 1 1 1 0 0 0 0 0 0 0 ...
# $ gear : int 4 4 4 3 3 3 3 4 4 4 ...
# $ carb : int 4 4 1 1 2 1 4 2 2 4 ...
# $ region: Factor w/ 3 levels "Asia","Europe",..: 1 1 1 3 3 3 3 2 2 2 ...
summary(dat)
# model mpg cyl disp hp drat
# AMC Javelin : 1 Min. :10.4 Min. :4.00 Min. : 71 Min. : 52 Min. :2.76
# Cadillac Fleetwood: 1 1st Qu.:15.4 1st Qu.:4.00 1st Qu.:121 1st Qu.: 96 1st Qu.:3.08
# Camaro Z28 : 1 Median :19.2 Median :6.00 Median :196 Median :123 Median :3.69
# Chrysler Imperial : 1 Mean :20.1 Mean :6.19 Mean :231 Mean :147 Mean :3.60
# Datsun 710 : 1 3rd Qu.:22.8 3rd Qu.:8.00 3rd Qu.:326 3rd Qu.:180 3rd Qu.:3.92
# Dodge Challenger : 1 Max. :33.9 Max. :8.00 Max. :472 Max. :335 Max. :4.93
# (Other) :26
# wt qsec vs am gear carb
# Min. :1.51 Min. :14.5 Min. :0.000 Min. :0.000 Min. :3.00 Min. :1.00
# 1st Qu.:2.58 1st Qu.:16.9 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:3.00 1st Qu.:2.00
# Median :3.33 Median :17.7 Median :0.000 Median :0.000 Median :4.00 Median :2.00
# Mean :3.22 Mean :17.9 Mean :0.438 Mean :0.406 Mean :3.69 Mean :2.81
# 3rd Qu.:3.61 3rd Qu.:18.9 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:4.00 3rd Qu.:4.00
# Max. :5.42 Max. :22.9 Max. :1.000 Max. :1.000 Max. :5.00 Max. :8.00
#
# region
# Asia : 9
# Europe:14
# USA : 9
#
#
#
#
head(dat)
# model mpg cyl disp hp drat wt qsec vs am gear carb region
# 1 Mazda RX4 21.0 6 160 110 3.90 2.62 16.5 0 1 4 4 Asia
# 2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 17.0 0 1 4 4 Asia
# 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 Asia
# 4 Hornet 4 Drive 21.4 6 258 110 3.08 3.21 19.4 1 0 3 1 USA
# 5 Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 USA
# 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 USA
In this section, you will add two columns to the data.
# Add a column to the data relHP which should contain the hp of the car divided by the weight (column
# wt)
dat$relHP <- dat$hp/dat$wt
# Next, add a column to the data named sportscar which is TRUE when the relHP > 42 and FALSE
# otherwise
dat$sportscar <- FALSE
dat[dat$relHP > 42, ]$sportscar <- TRUE
# Look at the data using head
head(dat)
# model mpg cyl disp hp drat wt qsec vs am gear carb region relHP sportscar
# 1 Mazda RX4 21.0 6 160 110 3.90 2.62 16.5 0 1 4 4 Asia 42.0 FALSE
# 2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 17.0 0 1 4 4 Asia 38.3 FALSE
# 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 Asia 40.1 FALSE
# 4 Hornet 4 Drive 21.4 6 258 110 3.08 3.21 19.4 1 0 3 1 USA 34.2 FALSE
# 5 Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 USA 50.9 TRUE
# 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 USA 30.3 FALSE
In this section, we will look at the variables in more detail. Specifically, we will look at measures of spread and central tendency, and frequency tables for individual variables. Furthermore, we will investigate the relationship between pairs of variables.
# How many sportscars are there (according to our definition)? Hint: use table()
table(dat$sportscar)
#
# FALSE TRUE
# 17 15
# What is the mean weight of the cars?
mean(dat$wt)
# [1] 3.22
# What is the standard deviation of the weight of the cars?
sd(dat$wt)
# [1] 0.978
# How many cars have 6 cylinders?
table(dat$cyl)
#
# 4 6 8
# 11 7 14
# What is the correlation between weight and horsepower?
cor(dat$wt, dat$hp)
# [1] 0.659
# How are being a sportscar and the number of gears related?
table(dat$sportscar, dat$gear)
#
# 3 4 5
# FALSE 5 12 0
# TRUE 10 0 5
In this section, we will look at the variables in more detail through visualization.
# Create a boxplot with the weight for sportscars
boxplot(dat[dat$sportscar, ]$wt)
# Create a boxplot with the weight, separately for the number of cylinders Hint: boxplot can also be
# used with the formula interface: wt ~ cyl, data=dat
boxplot(wt ~ cyl, data = dat)
# Show the histogram for relHP
hist(dat$relHP)
# Show the histogram for wt and hp next to each other. Set the color of the bars to 'red' for wt and
# 'blue' for hp. Hint: use par() to place the graphs besides each other and use ?hist to see what
# parameter to use for the color
par(mfrow = c(1, 2))
hist(dat$wt, col = "red")
hist(dat$hp, col = "blue")
# Show the Q-Q plot of qsec (time for driving 1/4 mile)
par(mfrow = c(1, 1))
qqnorm(dat$qsec)
qqline(dat$qsec)
# Create a new data frame named 'tmp' excluding the outlier
tmp <- dat[!dat$qsec > 22, ]
dim(tmp)
# [1] 31 15
# Create a barplot contrasting automatic vs. manual transmission (column 'am') Give the plot a
# header: 'Transmission' and provide names below the bars: 'A' and 'M'
counts <- table(dat$am)
barplot(counts, main = "Transmission", names = c("A", "M"))
# Create a segmented barplot showing the relationship between being a sportscar and the type of
# transmission
counts <- table(dat$sportscar, dat$am)
barplot(counts, xlab = "Transmission", col = c("blue", "red"), legend = c("regular", "sport"), names = c("A",
"M"))
In this section, we will conduct basic statistical analyses of the data.
# We will investigate if the weight in the sample significantly differs from 3 (x1000) lbs. However,
# before conducting a t-test we need to assess if the variable is normally distributed
# Assess if wt is normally distributed
qqnorm(dat$wt)
qqline(dat$wt)
shapiro.test(dat$wt) # OK
#
# Shapiro-Wilk normality test
#
# data: dat$wt
# W = 0.9, p-value = 0.09
# Run the one-sample t-test
t.test(dat$wt, mu = 3, alternative = "two.sided")
#
# One Sample t-test
#
# data: dat$wt
# t = 1, df = 30, p-value = 0.2
# alternative hypothesis: true mean is not equal to 3
# 95 percent confidence interval:
# 2.86 3.57
# sample estimates:
# mean of x
# 3.22
# Assess if the weight of sportscars differs significantly from non-sportscars. First assess if the
# distribution of wt of both groups is approximately normal. If not, use an appropriate
# non-parametric alternative
shapiro.test(dat[dat$sportscar, ]$wt) # OK
#
# Shapiro-Wilk normality test
#
# data: dat[dat$sportscar, ]$wt
# W = 0.9, p-value = 0.1
shapiro.test(dat[!dat$sportscar, ]$wt) # not OK
#
# Shapiro-Wilk normality test
#
# data: dat[!dat$sportscar, ]$wt
# W = 0.9, p-value = 0.04
wilcox.test(dat[dat$sportscar, ]$wt, dat[!dat$sportscar, ]$wt, alternative = "two.sided")
# Warning in wilcox.test.default(dat[dat$sportscar, ]$wt, dat[!dat$sportscar, : cannot compute exact
# p-value with ties
#
# Wilcoxon rank sum test with continuity correction
#
# data: dat[dat$sportscar, ]$wt and dat[!dat$sportscar, ]$wt
# W = 200, p-value = 0.05
# alternative hypothesis: true location shift is not equal to 0
# N.B. to get exact p-values in the presence of ties, the help indicates wilcox_ties of the package
# coin can be used:
library(coin)
wilcox_test(wt ~ sportscar, data = dat)
# Error in check(object): 'object' does not represent a two-sample problem (maybe the grouping variable is not a factor?)
# As sportscar is not a factor (it is a logical variable), it gives an error. The solution is to
# convert it to a factor
dat$sportscar <- as.factor(dat$sportscar)
wilcox_test(wt ~ sportscar, data = dat)
#
# Asymptotic Wilcoxon-Mann-Whitney Test
#
# data: wt by sportscar (FALSE, TRUE)
# Z = -2, p-value = 0.04
# alternative hypothesis: true mu is not equal to 0
# Assess if there is a dependency between transmission and sportscar.
chisq.test(table(dat$am, dat$sportscar))
#
# Pearson's Chi-squared test with Yates' continuity correction
#
# data: table(dat$am, dat$sportscar)
# X-squared = 0.2, df = 1, p-value = 0.7
# Assess if the region of the car maker influences the weight. If the region influences the weight
# also assess which regions differ.
summary(result <- aov(wt ~ region, data = dat))
# Df Sum Sq Mean Sq F value Pr(>F)
# region 2 8.81 4.41 6.13 0.006 **
# Residuals 29 20.86 0.72
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(result)
# Tukey multiple comparisons of means
# 95% family-wise confidence level
#
# Fit: aov(formula = wt ~ region, data = dat)
#
# $region
# diff lwr upr p adj
# Europe-Asia 0.218 -0.677 1.11 0.820
# USA-Asia 1.283 0.295 2.27 0.009
# USA-Europe 1.065 0.170 1.96 0.017
# Investigate if the region of the car maker and the type of car (sportscar or not) influence the
# weight. As we will use factorial anova, all predictors need to be factors. Currently sportscar is a
# logical predictor (TRUE or FALSE), so we need to convert it to a factor with: dat$sportscar <-
# as.factor(dat$sportscar).
dat$sportscar <- as.factor(dat$sportscar)
# Interaction plot:
with(dat, interaction.plot(sportscar, region, wt, col = c("blue", "red"), type = "b"))
# First we check if the data is balanced
table(dat$region, dat$sportscar) # not balanced
#
# FALSE TRUE
# Asia 6 3
# Europe 7 7
# USA 4 5
# This means that the SS type is important, so we use type 3:
library(car)
op <- options(contrasts = c("contr.sum", "contr.poly")) # orthogonal contrasts
Anova(aov(wt ~ sportscar * region, data = dat), type = 3)
# Anova Table (Type III tests)
#
# Response: wt
# Sum Sq Df F value Pr(>F)
# (Intercept) 327 1 528.85 <2e-16 ***
# sportscar 1 1 1.57 0.2212
# region 7 2 6.06 0.0069 **
# sportscar:region 4 2 3.28 0.0536 .
# Residuals 16 26
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# interaction not significant, so exclude interaction
Anova(aov(wt ~ sportscar + region, data = dat), type = 3)
# Anova Table (Type III tests)
#
# Response: wt
# Sum Sq Df F value Pr(>F)
# (Intercept) 327 1 454.16 <2e-16 ***
# sportscar 1 1 1.02 0.3207
# region 8 2 5.58 0.0091 **
# Residuals 20 28
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# sportscar not significant, so the best model, is the one above (result)
# Finally assess if adding the number of carburators (carb) is a signiciant covariate.
Anova(aov(wt ~ carb + region, data = dat), type = 3) # yes
# Anova Table (Type III tests)
#
# Response: wt
# Sum Sq Df F value Pr(>F)
# (Intercept) 50.7 1 88.30 3.7e-10 ***
# carb 4.8 1 8.31 0.0075 **
# region 8.2 2 7.10 0.0032 **
# Residuals 16.1 28
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
options(op) # reset contrasts to treatment coding
# Assess how well the weight of a sportscar can be predicted by the number of cylinders, the region,
# and the type of transmission.
summary(model <- lm(wt ~ cyl + region + am, data = dat))
#
# Call:
# lm(formula = wt ~ cyl + region + am, data = dat)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.8106 -0.3082 -0.0127 0.2445 1.1734
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 1.4671 0.4862 3.02 0.00550 **
# cyl 0.2910 0.0668 4.35 0.00017 ***
# regionEurope 0.2393 0.2254 1.06 0.29773
# regionUSA 0.4552 0.2744 1.66 0.10874
# am -0.6976 0.2254 -3.09 0.00455 **
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 0.525 on 27 degrees of freedom
# Multiple R-squared: 0.749, Adjusted R-squared: 0.712
# F-statistic: 20.2 on 4 and 27 DF, p-value: 8.63e-08
# Is region necessary in the model? Hint: use model comparison
m0 <- lm(wt ~ cyl + am, data = dat)
anova(m0, model) # not necessary
# Analysis of Variance Table
#
# Model 1: wt ~ cyl + am
# Model 2: wt ~ cyl + region + am
# Res.Df RSS Df Sum of Sq F Pr(>F)
# 1 29 8.22
# 2 27 7.44 2 0.781 1.42 0.26
summary(m0)
#
# Call:
# lm(formula = wt ~ cyl + am, data = dat)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.6676 -0.3343 -0.0581 0.1866 1.3214
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 1.5665 0.4532 3.46 0.0017 **
# cyl 0.3170 0.0628 5.05 2.2e-05 ***
# am -0.7649 0.2248 -3.40 0.0020 **
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 0.533 on 29 degrees of freedom
# Multiple R-squared: 0.723, Adjusted R-squared: 0.704
# F-statistic: 37.8 on 2 and 29 DF, p-value: 8.28e-09
# Interpretation: more cylinders => heavier, manual transmission (am=1): lighter
From within RStudio, you can simply download this file using the command download.file('http://www.let.rug.nl/wieling/statscourse/CrashCourseR/lab/answers/answers.Rmd', 'answers.Rmd')
, open it in the editor and use the Knit HMTL button to generate the html file. If you use plain R, you first have to install Pandoc. Then copy the following lines to the most recent version of R. (Or use the Knit HTML
button when viewing the file answers.Rmd
in RStudio.)
# install rmarkdown package if not installed
if (!"rmarkdown" %in% rownames(installed.packages())) {
install.packages("rmarkdown", repos="http://cran.us.r-project.org")
}
library(rmarkdown) # load rmarkdown package
# download original file if not already exists (to prevent overwriting)
if (!file.exists("answers.Rmd")) {
download.file("http://www.let.rug.nl/wieling/statscourse/CrashCourseR/lab/answers/answers.Rmd", "answers.Rmd")
}
# generate output
render("lab.Rmd") # generates html file with results
# view output in browser
browseURL(paste("file://", file.path(getwd(), "answers.html"), sep = "")) # shows result