1 Introduction

This file contains the answers of the lab session: https://www.let.rug.nl/wieling/Statistics/Intro-R/lab.

2 Importing the data

We will first download a csv file generated in Excel and import it into R.

download.file("http://www.let.rug.nl/wieling/Statistics/Intro-R/lab/mtcars.csv",
    "mtcars.csv")

# now import the data yourself into an R data frame with the name dat
# using the function: read.csv2()
dat <- read.csv2("mtcars.csv")

3 Exploring the structure of the data

Note that this dataset is similar to the mtcars dataset standard available in R, so the description of the columns can be obtained with ?mtcars. There is one addition column ‘region’ which contains the region which the car maker originated from. In the following, you will look at the structure of the data using various functions.

# Look at the structure of the data using the functions: str, summary and
# head
str(dat)
# 'data.frame': 32 obs. of  13 variables:
#  $ model : chr  "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#  $ mpg   : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#  $ cyl   : int  6 6 4 6 8 6 8 4 4 6 ...
#  $ disp  : num  160 160 108 258 360 ...
#  $ hp    : int  110 110 93 110 175 105 245 62 95 123 ...
#  $ drat  : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#  $ wt    : num  2.62 2.88 2.32 3.21 3.44 ...
#  $ qsec  : num  16.5 17 18.6 19.4 17 ...
#  $ vs    : int  0 0 1 1 0 1 0 1 1 1 ...
#  $ am    : int  1 1 1 0 0 0 0 0 0 0 ...
#  $ gear  : int  4 4 4 3 3 3 3 4 4 4 ...
#  $ carb  : int  4 4 1 1 2 1 4 2 2 4 ...
#  $ region: chr  "Asia" "Asia" "Asia" "USA" ...
summary(dat)
#     model                mpg            cyl            disp      
#  Length:32          Min.   :10.4   Min.   :4.00   Min.   : 71.1  
#  Class :character   1st Qu.:15.4   1st Qu.:4.00   1st Qu.:120.8  
#  Mode  :character   Median :19.2   Median :6.00   Median :196.3  
#                     Mean   :20.1   Mean   :6.19   Mean   :230.7  
#                     3rd Qu.:22.8   3rd Qu.:8.00   3rd Qu.:326.0  
#                     Max.   :33.9   Max.   :8.00   Max.   :472.0  
#        hp             drat            wt            qsec            vs       
#  Min.   : 52.0   Min.   :2.76   Min.   :1.51   Min.   :14.5   Min.   :0.000  
#  1st Qu.: 96.5   1st Qu.:3.08   1st Qu.:2.58   1st Qu.:16.9   1st Qu.:0.000  
#  Median :123.0   Median :3.69   Median :3.33   Median :17.7   Median :0.000  
#  Mean   :146.7   Mean   :3.60   Mean   :3.22   Mean   :17.8   Mean   :0.438  
#  3rd Qu.:180.0   3rd Qu.:3.92   3rd Qu.:3.61   3rd Qu.:18.9   3rd Qu.:1.000  
#  Max.   :335.0   Max.   :4.93   Max.   :5.42   Max.   :22.9   Max.   :1.000  
#        am             gear           carb         region         
#  Min.   :0.000   Min.   :3.00   Min.   :1.00   Length:32         
#  1st Qu.:0.000   1st Qu.:3.00   1st Qu.:2.00   Class :character  
#  Median :0.000   Median :4.00   Median :2.00   Mode  :character  
#  Mean   :0.406   Mean   :3.69   Mean   :2.81                     
#  3rd Qu.:1.000   3rd Qu.:4.00   3rd Qu.:4.00                     
#  Max.   :1.000   Max.   :5.00   Max.   :8.00
head(dat)
#               model  mpg cyl disp  hp drat   wt qsec vs am gear carb region
# 1         Mazda RX4 21.0   6  160 110 3.90 2.62 16.5  0  1    4    4   Asia
# 2     Mazda RX4 Wag 21.0   6  160 110 3.90 2.88 17.0  0  1    4    4   Asia
# 3        Datsun 710 22.8   4  108  93 3.85 2.32 18.6  1  1    4    1   Asia
# 4    Hornet 4 Drive 21.4   6  258 110 3.08 3.21 19.4  1  0    3    1    USA
# 5 Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.0  0  0    3    2    USA
# 6           Valiant 18.1   6  225 105 2.76 3.46 20.2  1  0    3    1    USA

4 Modifying the data

In this section, you will add two columns to the data.

# Add a column to the data relHP which should contain the hp of the car
# divided by the weight (column wt)
dat$relHP <- dat$hp/dat$wt

# Next, add a column to the data named sportscar which is TRUE when the
# relHP > 42 and FALSE otherwise
dat$sportscar <- FALSE
dat[dat$relHP > 42, ]$sportscar <- TRUE

# Look at the data using head
head(dat)
#               model  mpg cyl disp  hp drat   wt qsec vs am gear carb region
# 1         Mazda RX4 21.0   6  160 110 3.90 2.62 16.5  0  1    4    4   Asia
# 2     Mazda RX4 Wag 21.0   6  160 110 3.90 2.88 17.0  0  1    4    4   Asia
# 3        Datsun 710 22.8   4  108  93 3.85 2.32 18.6  1  1    4    1   Asia
# 4    Hornet 4 Drive 21.4   6  258 110 3.08 3.21 19.4  1  0    3    1    USA
# 5 Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.0  0  0    3    2    USA
# 6           Valiant 18.1   6  225 105 2.76 3.46 20.2  1  0    3    1    USA
#   relHP sportscar
# 1  42.0     FALSE
# 2  38.3     FALSE
# 3  40.1     FALSE
# 4  34.2     FALSE
# 5  50.9      TRUE
# 6  30.3     FALSE

5 Investigating the data

In this section, we will look at the variables in more detail. Specifically, we will look at measures of spread and central tendency, and frequency tables for individual variables. Furthermore, we will investigate the relationship between pairs of variables.

# How many sportscars are there (according to our definition)? Hint: use
# table()
table(dat$sportscar)
# 
# FALSE  TRUE 
#    17    15
# What is the mean weight of the cars?
mean(dat$wt)
# [1] 3.22
# What is the standard deviation of the weight of the cars?
sd(dat$wt)
# [1] 0.978
# How many cars have 6 cylinders?
table(dat$cyl)
# 
#  4  6  8 
# 11  7 14
# What is the correlation between weight and horsepower?
cor(dat$wt, dat$hp)
# [1] 0.659
# How are being a sportscar and the number of gears related?
table(dat$sportscar, dat$gear)
#        
#          3  4  5
#   FALSE  5 12  0
#   TRUE  10  0  5

6 Visualizing the data

In this section, we will look at the variables in more detail through visualization.

# Create a boxplot with the weight for sportscars
boxplot(dat[dat$sportscar, ]$wt)

# Create a boxplot with the weight, separately for the number of cylinders
# Hint: boxplot can also be used with the formula interface: wt ~ cyl,
# data=dat
boxplot(wt ~ cyl, data = dat)

# Show the histogram for relHP
hist(dat$relHP)

# Show the histogram for wt and hp next to each other.  Set the color of
# the bars to 'red' for wt and 'blue' for hp.  Hint: use par() to place
# the graphs besides each other and use ?hist to see what parameter to use
# for the color
par(mfrow = c(1, 2))
hist(dat$wt, col = "red")
hist(dat$hp, col = "blue")

# Show the Q-Q plot of qsec (time for driving 1/4 mile)
par(mfrow = c(1, 1))
qqnorm(dat$qsec)
qqline(dat$qsec)

# Create a new data frame named 'tmp' excluding the outlier
tmp <- dat[!dat$qsec > 22, ]
dim(tmp)
# [1] 31 15
# Create a barplot contrasting automatic vs. manual transmission (column
# 'am') Give the plot a header: 'Transmission' and provide names below the
# bars: 'A' and 'M'
counts <- table(dat$am)
barplot(counts, main = "Transmission", names = c("A", "M"))

# Create a segmented barplot showing the relationship between being a
# sportscar and the type of transmission
counts <- table(dat$sportscar, dat$am)
barplot(counts, xlab = "Transmission", col = c("blue", "red"), legend = c("regular",
    "sport"), names = c("A", "M"))

7 Replication

From within RStudio, you can simply download this file using the command download.file('http://www.let.rug.nl/wieling/Statistics/Intro-R/lab/answers/answers.Rmd', 'answers.Rmd'), open it in the editor and use the Knit HMTL button to generate the html file. If you use plain R, you first have to install Pandoc. Then copy the following lines to the most recent version of R.

# install rmarkdown package if not installed
if (!"rmarkdown" %in% rownames(installed.packages())) {
    install.packages("rmarkdown")
}
library(rmarkdown)  # load rmarkdown package

# download original file if not already exists (to prevent overwriting)
if (!file.exists("answers.Rmd")) {
    download.file("http://www.let.rug.nl/wieling/Statistics/Intro-R/lab/answers/answers.Rmd",
        "answers.Rmd")
}

# generate output
render("answers.Rmd")  # generates html file with results

# view output in browser
browseURL(paste("file://", file.path(getwd(), "answers.html"), sep = ""))  # shows result