This file contains the answers of the lab session: https://www.let.rug.nl/wieling/Statistics/Intro-R/lab.
We will first download a csv file generated in Excel and import it into R.
download.file("http://www.let.rug.nl/wieling/Statistics/Intro-R/lab/mtcars.csv",
"mtcars.csv")
# now import the data yourself into an R data frame with the name dat
# using the function: read.csv2()
dat <- read.csv2("mtcars.csv")
Note that this dataset is similar to the mtcars dataset standard
available in R, so the description of the columns can be obtained with
?mtcars
. There is one addition column ‘region’ which
contains the region which the car maker originated from. In the
following, you will look at the structure of the data using various
functions.
# Look at the structure of the data using the functions: str, summary and
# head
str(dat)
# 'data.frame': 32 obs. of 13 variables:
# $ model : chr "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
# $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
# $ cyl : int 6 6 4 6 8 6 8 4 4 6 ...
# $ disp : num 160 160 108 258 360 ...
# $ hp : int 110 110 93 110 175 105 245 62 95 123 ...
# $ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
# $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
# $ qsec : num 16.5 17 18.6 19.4 17 ...
# $ vs : int 0 0 1 1 0 1 0 1 1 1 ...
# $ am : int 1 1 1 0 0 0 0 0 0 0 ...
# $ gear : int 4 4 4 3 3 3 3 4 4 4 ...
# $ carb : int 4 4 1 1 2 1 4 2 2 4 ...
# $ region: chr "Asia" "Asia" "Asia" "USA" ...
summary(dat)
# model mpg cyl disp
# Length:32 Min. :10.4 Min. :4.00 Min. : 71.1
# Class :character 1st Qu.:15.4 1st Qu.:4.00 1st Qu.:120.8
# Mode :character Median :19.2 Median :6.00 Median :196.3
# Mean :20.1 Mean :6.19 Mean :230.7
# 3rd Qu.:22.8 3rd Qu.:8.00 3rd Qu.:326.0
# Max. :33.9 Max. :8.00 Max. :472.0
# hp drat wt qsec vs
# Min. : 52.0 Min. :2.76 Min. :1.51 Min. :14.5 Min. :0.000
# 1st Qu.: 96.5 1st Qu.:3.08 1st Qu.:2.58 1st Qu.:16.9 1st Qu.:0.000
# Median :123.0 Median :3.69 Median :3.33 Median :17.7 Median :0.000
# Mean :146.7 Mean :3.60 Mean :3.22 Mean :17.8 Mean :0.438
# 3rd Qu.:180.0 3rd Qu.:3.92 3rd Qu.:3.61 3rd Qu.:18.9 3rd Qu.:1.000
# Max. :335.0 Max. :4.93 Max. :5.42 Max. :22.9 Max. :1.000
# am gear carb region
# Min. :0.000 Min. :3.00 Min. :1.00 Length:32
# 1st Qu.:0.000 1st Qu.:3.00 1st Qu.:2.00 Class :character
# Median :0.000 Median :4.00 Median :2.00 Mode :character
# Mean :0.406 Mean :3.69 Mean :2.81
# 3rd Qu.:1.000 3rd Qu.:4.00 3rd Qu.:4.00
# Max. :1.000 Max. :5.00 Max. :8.00
head(dat)
# model mpg cyl disp hp drat wt qsec vs am gear carb region
# 1 Mazda RX4 21.0 6 160 110 3.90 2.62 16.5 0 1 4 4 Asia
# 2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 17.0 0 1 4 4 Asia
# 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 Asia
# 4 Hornet 4 Drive 21.4 6 258 110 3.08 3.21 19.4 1 0 3 1 USA
# 5 Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 USA
# 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 USA
In this section, you will add two columns to the data.
# Add a column to the data relHP which should contain the hp of the car
# divided by the weight (column wt)
dat$relHP <- dat$hp/dat$wt
# Next, add a column to the data named sportscar which is TRUE when the
# relHP > 42 and FALSE otherwise
dat$sportscar <- FALSE
dat[dat$relHP > 42, ]$sportscar <- TRUE
# Look at the data using head
head(dat)
# model mpg cyl disp hp drat wt qsec vs am gear carb region
# 1 Mazda RX4 21.0 6 160 110 3.90 2.62 16.5 0 1 4 4 Asia
# 2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 17.0 0 1 4 4 Asia
# 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 Asia
# 4 Hornet 4 Drive 21.4 6 258 110 3.08 3.21 19.4 1 0 3 1 USA
# 5 Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 USA
# 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 USA
# relHP sportscar
# 1 42.0 FALSE
# 2 38.3 FALSE
# 3 40.1 FALSE
# 4 34.2 FALSE
# 5 50.9 TRUE
# 6 30.3 FALSE
In this section, we will look at the variables in more detail. Specifically, we will look at measures of spread and central tendency, and frequency tables for individual variables. Furthermore, we will investigate the relationship between pairs of variables.
# How many sportscars are there (according to our definition)? Hint: use
# table()
table(dat$sportscar)
#
# FALSE TRUE
# 17 15
# What is the mean weight of the cars?
mean(dat$wt)
# [1] 3.22
# What is the standard deviation of the weight of the cars?
sd(dat$wt)
# [1] 0.978
# How many cars have 6 cylinders?
table(dat$cyl)
#
# 4 6 8
# 11 7 14
# What is the correlation between weight and horsepower?
cor(dat$wt, dat$hp)
# [1] 0.659
# How are being a sportscar and the number of gears related?
table(dat$sportscar, dat$gear)
#
# 3 4 5
# FALSE 5 12 0
# TRUE 10 0 5
In this section, we will look at the variables in more detail through visualization.
# Create a boxplot with the weight for sportscars
boxplot(dat[dat$sportscar, ]$wt)
# Create a boxplot with the weight, separately for the number of cylinders
# Hint: boxplot can also be used with the formula interface: wt ~ cyl,
# data=dat
boxplot(wt ~ cyl, data = dat)
# Show the histogram for relHP
hist(dat$relHP)
# Show the histogram for wt and hp next to each other. Set the color of
# the bars to 'red' for wt and 'blue' for hp. Hint: use par() to place
# the graphs besides each other and use ?hist to see what parameter to use
# for the color
par(mfrow = c(1, 2))
hist(dat$wt, col = "red")
hist(dat$hp, col = "blue")
# Show the Q-Q plot of qsec (time for driving 1/4 mile)
par(mfrow = c(1, 1))
qqnorm(dat$qsec)
qqline(dat$qsec)
# Create a new data frame named 'tmp' excluding the outlier
tmp <- dat[!dat$qsec > 22, ]
dim(tmp)
# [1] 31 15
# Create a barplot contrasting automatic vs. manual transmission (column
# 'am') Give the plot a header: 'Transmission' and provide names below the
# bars: 'A' and 'M'
counts <- table(dat$am)
barplot(counts, main = "Transmission", names = c("A", "M"))
# Create a segmented barplot showing the relationship between being a
# sportscar and the type of transmission
counts <- table(dat$sportscar, dat$am)
barplot(counts, xlab = "Transmission", col = c("blue", "red"), legend = c("regular",
"sport"), names = c("A", "M"))
From within RStudio, you can simply download this file using the
command
download.file('http://www.let.rug.nl/wieling/Statistics/Intro-R/lab/answers/answers.Rmd', 'answers.Rmd')
,
open it in the editor and use the Knit HMTL button to generate the html
file. If you use plain R, you first have to install Pandoc. Then copy the following lines to
the most recent version of R.
# install rmarkdown package if not installed
if (!"rmarkdown" %in% rownames(installed.packages())) {
install.packages("rmarkdown")
}
library(rmarkdown) # load rmarkdown package
# download original file if not already exists (to prevent overwriting)
if (!file.exists("answers.Rmd")) {
download.file("http://www.let.rug.nl/wieling/Statistics/Intro-R/lab/answers/answers.Rmd",
"answers.Rmd")
}
# generate output
render("answers.Rmd") # generates html file with results
# view output in browser
browseURL(paste("file://", file.path(getwd(), "answers.html"), sep = "")) # shows result