# Statistiek I

## Introduction to R

Martijn Wieling
University of Groningen

## This lecture

• General information about the course
• Introduction to dataset used in the course
• Why use statistics?
• Introduction to RStudio and R
• R as calculator
• Variables
• Functions and help
• Importing data in R
• Viewing and modifying data
• Visualization in R
• Statistics in R

## General information about the course setup (2)

• 7 weekly live lectures / Q&A sessions:
• If no capacity restriction: in-person lectures
• If (25-100) capacity restriction: in-person Q&A sessions
• Otherwise: on-line Q&A sessions and online lab sessions
• Slides and lecture recordings (2021) are made available via Nestor
• 7 weekly online/in-person lab sessions
• You should have registered via Nestor for a lab session
• No switching when groups are full
• Attendance (if online: with enabled webcam/mic when talking to lab teacher) obligatory!
• Finishing lab exercises results in at most 1 bonus point
• Only when final test score $$T$$ at least 5.0
• Calculation: $$T + 0.5 \times P \textrm{(max 2)}$$

## Language policy in this course

• Lecture slides, book and lab sessions are in English. Why?
• Most statistics terminology you encounter will be in English
• Statistiek II will be completely in English (English teacher)
• You may choose the language of the lab reports (either Dutch or English)
• The final exam is in Dutch
• (but with English-Dutch translation of statistical terminology when necessary)

## Some remarks about the lab sessions

• Lab reports due one week after lab
• Late or sub-optimal lab report: (max.) 1 point (out of 2)
• Lab report over 1 week late or insufficient: 0 points
• Exam requirement: average lab score of 1 point, with at most two times 0 points
• Lab attendance required:
• You can miss at most 1 lab session, but timely lab report ($$\geq$$ 1 point) required
• If you miss 2 or more lab sessions: course failed
• Why required lab attendance?
• Actively participating in lab session increases chances of passing the exam!
• Lab score correlates significantly ($$r = .4, p < .001, N \approx 400$$) with exam grade
• Lab score of last year may be used, but only as a whole (same restrictions)

## Goals of this course

• Understand basics of descriptive and inferential statistics
• Emphasis on statistical reasoning
• Practical approach, but some mathematics to help understand the concepts
• Understand and apply basic statistical analyses
• Report on results of statistical analyses
• Understand reports using statistical analyses
• Conduct basic statistical analyses in R

## Statistics in Language Studies

• Many experiments are conducted in communication, information science and linguistics
• Effect of comics vs. normal text on understanding?
• Effect of algorithm on quality of automatically generated summary?
• Influence of gender on learning a second language?
• Availability of online data increases opportunities for statistical analysis (e.g., Digital Humanities)
• In this course we will mainly work with the data collected in the survey
• But other examples will be given as well

## The dataset used during this course

• Data on the basis of your answers to the initial survey
• Age, gender, handedness, study, etc.,
• We are interested in investigating which factors are related to English proficiency
• And via an automatically calculated approximate measure of proficiency (English score) based on your input
• Data also includes a subset of survey results from earlier years

## Some of our data (1)

participant year gender bl_edu study english_grade english_score
103 2017 M N IS 8 7.10
104 2017 F N OTHER 9 7.76
105 2017 F N OTHER 7 5.68
106 2017 F N CIS 7 7.31
107 2017 F N LING 7 7.95
108 2017 M N OTHER 7 7.51
109 2017 F N IS 7 6.97
110 2017 F N CIS 6 6.22
111 2017 M N OTHER 8 8.71
112 2017 F N LING 7 6.78
113 2017 F N CIS 6 5.94
114 2017 M N CIS 6 6.18
115 2017 F N LING 8 8.00

## Some of our data (2)

participant year gender bl_edu study english_grade english_score
198 2018 F N LING 6 5.19
199 2018 M N LING 7 6.82
200 2018 M N LING 8 8.21
201 2018 F N CIS 7 7.34
202 2018 F N LING 7 6.59
203 2018 F N LING 8 7.55
204 2018 F N LING 7 7.19
205 2018 F Y LING 8 7.63
206 2018 F N LING 6 6.58
207 2018 M N IS 8 8.89
208 2018 M N CIS 7 6.76
209 2018 F N OTHER 8 8.18
210 2018 M N CIS 6 6.33

## Some of our data (3)

participant year gender bl_edu study english_grade english_score
1 2019 M N LING 5 6.10
2 2019 F N CIS 6 6.67
3 2019 F N CIS 7 7.42
4 2019 F N LING 8 9.10
5 2019 F N CIS 7 7.47
6 2019 M N LING 8 8.14
7 2019 F N LING 8 7.65
8 2019 F N CIS 6 7.35
9 2019 F N LING 8 8.54
10 2019 M N IS 8 8.39
11 2019 F N LING 7 7.98
12 2019 M N OTHER 7 6.15
13 2019 F N LING 7 5.60

## Some of our data (4)

participant year gender bl_edu study english_grade english_score
413 2020 M N OTHER 6 6.86
414 2020 M N LING 8 8.07
415 2020 M N IS 8 7.72
416 2020 F N LING 7 7.53
417 2020 F Y CIS 8 9.23
418 2020 M N IS 7 7.64
419 2020 M N IS 7 7.82
420 2020 F N LING 8 8.65
421 2020 M N LING 9 9.09
422 2020 M N IS 6 7.61
423 2020 F N LING 8 8.26
424 2020 F N LING 7 6.67
425 2020 F N LING 8 8.38

## Some of our data (5)

participant year gender bl_edu study english_grade english_score
333 2021 F N LING 6 5.94
334 2021 F N LING 7 7.25
335 2021 F N LING 5 5.65
336 2021 M N IS 8 7.16
337 2021 F N LING 6 7.21
338 2021 F Y CIS 8 8.16
339 2021 F N LING 7 8.12
340 2021 F N CIS 6 5.66
341 2021 M Y IS 7 7.60
342 2021 F Y CIS 8 8.31
343 2021 F N LING 7 7.09
344 2021 M N LING 7 6.88
345 2021 F N LING 7 6.65

## We use statistics because...

• We would like to make sense of (in this course: your) data
• For this we need to:
• Summarize the data (descriptive statistics)
• Assess relationships in our data (inferential statistics)
• (During other courses, some of you will have already encountered inferential statistical tests such as the $$t$$-test or chi-square test which can be used for this)
• The requirement of the data is that it is variable (there must be variation)
• Note that statistics is not mathematics (it's data analysis)!

## What you will learn during this course (1)

• Descriptive statistics (describe data without conclusions)
• Measures of central tendency and spread
• E.g.: mean English grade (7.3), and range of English grades (5 - 9.5)
• Visualization
• E.g.: showing number of participants per study with a bar plot

## What you will learn during this course (2)

• Inferential statistics (link findings based on sample to population)
• Comparing 2 groups (or 1 group with a value)
• E.g.: pronunciation of women better than men?
• Associations between 2 variables
• E.g.: are gender and handedness related?
• Internal consistency of questions in a survey
• E.g.: does a group of questions measure a single construct?
• How to do statistics in R (this lecture)
• And how to make reproducible lab reports in R

## Why do we use R?

• Very nice reproducible lab reports (no copy-paste necessary!)
• Other advantages of R compared to (e.g.,) SPSS
• Free for everybody
• Customizable: people can create their own statistical functions
• State-of-the-art statistical methods are integrated very quickly
• No substantial graphical user interface: typing instead of clicking
• Takes more time to learn
• State-of-the-art statistical methods sometimes contain bugs

## Basic functionality: R as calculator

# Addition (this is a comment: preceded by '#')
5 + 5

# [1] 10

# Multiplication
5 * 3

# [1] 15

# Division
5/3

# [1] 1.6667


## Basic functionality: using variables

a <- 5  # store a single value; instead of '<-' you can also use '='
a  # display the value

# [1] 5

b <- a * a  # b contains the value of multiplying a with itself
b

# [1] 25

(d <- NA)  # set value of d to missing (NA) and show value

# [1] NA


## Storing multiple values in a variable

b <- c(2, 4, 6, 7, 8)  # store a series of values in a vector (reusing variable b)
b

# [1] 2 4 6 7 8

b[4] <- a  # assign value 5 (stored in 'a') to the 4th element of vector b
b

# [1] 2 4 6 5 8

b <- c(b, NA)  # add element NA to b
b

# [1]  2  4  6  5  8 NA


## Basic functionality: using functions

b  # show values in variable b (b contains a vector: a list of values)

# [1]  2  4  6  5  8 NA

mn <- mean(b)  # calculating the mean and storing in variable mn
mn

# [1] NA

# mn is NA (missing) as one of the values is missing
mean(b, na.rm = TRUE)  # we can use the function parameter na.rm to ignore NAs

# [1] 5

# But which parameters does a function have: use help!
help(mean)  # alternatively: ?mean


## Getting data into R: importing a data set

setwd("C:/Users/Martijn/Desktop/Statistiek-I/HC1")  # set working directory
dat <- read.csv("survey.csv", sep = ";", dec = ".")  # reads csv file from work dir
str(dat)  # shows structure of the data frame (i.e. table is 2-dimensional)

# 'data.frame': 500 obs. of  7 variables:
#  $participant : int 1 2 3 4 5 6 7 8 9 10 ... #$ year         : num  2017 2017 2017 2017 2017 ...
#  $gender : chr "F" "M" "M" "F" ... #$ bl_edu       : chr  "N" "N" "N" "N" ...
#  $study : chr "LING" "LING" "LING" "CIS" ... #$ english_grade: num  6 7 8 7 7 8 7 8 6 8 ...
#  $english_score: num 4.31 6.42 8.17 6.99 6.12 ...  dim(dat) # number of rows and columns of data set  # [1] 500 7  ## Investigating imported data set: using head head(dat) # show first few rows of dat  # participant year gender bl_edu study english_grade english_score # 1 1 2017 F N LING 6 4.3055 # 2 2 2017 M N LING 7 6.4152 # 3 3 2017 M N LING 8 8.1696 # 4 4 2017 F N CIS 7 6.9925 # 5 5 2017 F N LING 7 6.1165 # 6 6 2017 F N LING 8 7.3538  ## Investigating imported data set: RStudio viewer ## Accessing table data using numbers • Access parts of table by specifying row and/or column numbers • dat[a,b]: • a indicates the selected rows of dat • b indicates the selected columns of dat dat[1, ] # values in first row (dat[,1]: values in first column)  # participant year gender bl_edu study english_grade english_score # 1 1 2017 F N LING 6 4.3055  dat[c(1, 5), c(1, 2, 3)] # values in rows 1 and 5 and columns 1, 2 and 3  # participant year gender # 1 1 2017 F # 5 5 2017 F  ## Accessing table data using names (1) • Additionally, we can access parts of the table by specifying the names of the columns we want to look at dat[c(1, 3, 5), c("participant", "study")] # rows 1, 3 and 5, and two named columns  # participant study # 1 1 LING # 3 3 LING # 5 5 LING  ## Accessing table data using names (2) • We may also select a single column by its name (not number) using the $ operator
• E.g., dat$gender accesses the column gender of dat head(dat$gender, 200)  # show gender of first 200 students

#   [1] "F" "M" "M" "F" "F" "F" "F" "F" "F" "M" "M" "F" "M" "F" "F" "F" "F" "F" "F" "F"
#  [21] "F" "F" "F" "F" "M" "F" "M" "M" "F" "F" "F" "F" "F" "F" "F" "F" "M" "M" "F" "F"
#  [41] "F" "M" "F" "M" "F" "F" "M" "M" "M" "F" "F" "F" "M" "M" "F" "F" "F" "F" "F" "F"
#  [61] "F" "F" "M" "F" "M" "F" "F" "F" "M" "F" "F" "M" "M" "M" "M" "F" "F" "F" "F" "F"
#  [81] "F" "F" "F" "F" "F" "F" "M" "F" "F" "M" "M" "F" "M" "F" "M" "M" "F" "F" "F" "M"
# [101] "F" "F" "F" "F" "M" "M" "F" "F" "F" "F" "M" "F" "F" "F" "M" "F" "F" "M" "M" "F"
# [121] "M" "M" "F" "F" "F" "F" "M" "F" "F" "F" "M" "F" "F" "M" "M" "F" "M" "F" "F" "M"
# [141] "M" "F" "F" "M" "F" "F" "F" "F" "F" "F" "F" "M" "F" "F" "F" "F" "M" "F" "F" "F"
# [161] "F" "F" "F" "M" "M" "F" "F" "F" "M" "F" "F" "M" "F" "F" "F" "F" "F" "F" "F" "F"
# [181] "F" "F" "F" "M" "F" "F" "M" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F"


## Storing accessed table data

tmp <- dat[5:8, c(1, 3)]  # store columns 1 and 3 for rows 5 to 8 in variable tmp
tmp  # show what is stored in variable tmp

#   participant gender
# 5           5      F
# 6           6      F
# 7           7      F
# 8           8      F


## Accessing table data using conditional indexing (1)

• Conditional indexing allows us the select parts of the data on the basis of conditions
tmp <- dat[dat$gender == "M", ] # only observations for male participants head(tmp)  # participant year gender bl_edu study english_grade english_score # 2 2 2017 M N LING 7 6.4152 # 3 3 2017 M N LING 8 8.1696 # 10 10 2017 M N IS 8 8.9375 # 11 11 2017 M N CIS 7 6.2686 # 13 13 2017 M N CIS 6 5.7744 # 25 25 2017 M N OTHER 9 8.3227  ## Accessing table data using conditional indexing (2) • Methods to combine conditions: • and: & • or: | # only participants who study IS *and* are male tmp <- dat[dat$gender == "M" & dat$study == "IS", ] head(tmp)  # participant year gender bl_edu study english_grade english_score # 10 10 2017 M N IS 8.0 8.9375 # 27 27 2017 M N IS 8.0 9.0563 # 28 28 2017 M N IS 7.0 7.9433 # 37 37 2017 M N IS 8.1 8.6273 # 44 44 2017 M N IS 6.0 6.1372 # 47 47 2017 M N IS 9.0 9.0493  ## Accessing table data using conditional indexing (3) • Inverse a condition with ! (not) • is not equal to: != # only women (i.e. not men) *or* everybody with an English grade over 7 tmp <- dat[dat$gender != "M" | dat$english_grade > 7, ] tail(tmp) # tail shows final 6 rows  # participant year gender bl_edu study english_grade english_score # 495 405 2021 F N CIS 8.0 8.2097 # 496 406 2021 F N LING 7.0 7.7812 # 497 407 2021 F N OTHER 7.0 8.1330 # 498 408 2021 F N OTHER 8.0 9.3621 # 499 409 2021 M N IS 7.5 7.9784 # 500 410 2021 F N LING 7.0 7.6600  ## Question 6 ## Supplementing the data: adding columns (1) • Frequently, you need to add columns to the data, for example by computing a new value on the basis of the values in two columns • the operator $ helps us to do that
# new column 'diff': English grade - English proficiency score
dat$diff <- dat$english_grade - dat$english_score head(dat)  # participant year gender bl_edu study english_grade english_score diff # 1 1 2017 F N LING 6 4.3055 1.6945488 # 2 2 2017 M N LING 7 6.4152 0.5847601 # 3 3 2017 M N LING 8 8.1696 -0.1695835 # 4 4 2017 F N CIS 7 6.9925 0.0075054 # 5 5 2017 F N LING 7 6.1165 0.8835361 # 6 6 2017 F N LING 8 7.3538 0.6461548  ## Supplementing the data: adding columns (2) • Conditional indexing allows us additional flexibility dat$pass_fail <- "PASS"  # new column, initially PASS for everybody
dat[dat$english_grade < 5.5, ]$pass_fail <- "FAIL"  # if grade too low, then FAIL
tail(dat[dat$english_grade > 4 & dat$english_grade < 6, 2:9])  # show subset of data

#     year gender bl_edu study english_grade english_score     diff pass_fail
# 363 2020      F      N  LING           5.0        6.1166 -1.11660      FAIL
# 377 2020      F      Y   CIS           5.0        4.3000  0.70000      FAIL
# 399 2020      F      N  LING           5.8        6.0576 -0.25764      PASS
# 403 2020      F      N  LING           5.8        5.1720  0.62797      PASS
# 425 2021      F      N  LING           5.0        5.6488 -0.64878      FAIL
# 482 2021      F      N   CIS           5.6        5.9877 -0.38772      PASS


## Visualization in R

• Many basic visualization options are available in R
• In this course, we will learn how to use the following functions (list not exhaustive):
• barplot() (illustrated in the following)
• plot()
• boxplot()
• hist()
• qqnorm() and qqline()

## Example of using a plotting function: barplot()

• The bar plot is used to visualize frequencies of categorical variables
(counts <- table(dat$gender)) # first create frequency table  # # F M # 350 150  barplot(counts)  ## Graphical parameters • There are various graphical parameters to allow you to customize your graphics barplot(counts, col = c("pink", "lightblue"), ylim = c(0, 350), main = "My barplot", xlab = "Gender", ylab = "Frequency")  ## Question 8 ## Another example: segmented bar plot (counts <- table(dat$gender, dat$study))  # # CIS IS LING OTHER # F 101 26 180 43 # M 23 96 16 15  barplot(counts, col = c("pink", "lightblue"), legend = c("F", "M"), ylim = c(0, 185))  ## Statistics in R • The main purpose of R is to conduct statistical analyses • Many different functions to obtain descriptive and inferential statistics are available in R • The following examples illustrate how to conduct some statistical analyses in R • The why and what is covered in the next lectures ## Descriptive statistics: central tendency mean(dat$english_score)  # mean of all people for English score

# [1] 7.466

mean(dat[dat$gender == "F", ]$english_score)  # mean of women for English score

# [1] 7.3613

median(dat$english_score) # median of all people for English score  # [1] 7.5227  ## Descriptive statistics: spread min(dat$english_score)  # minimum value

# [1] 4.3

max(dat$english_score) # maximum value  # [1] 10  var(dat$english_score)  # variance: average squared deviation from mean

# [1] 1.1671

sd(dat$english_score) # standard deviation (square root of variance)  # [1] 1.0803  ## Descriptive statistics: frequency tables table(dat$gender)

#
#   F   M
# 350 150

table(dat$study)  # # CIS IS LING OTHER # 124 122 196 58  ## Question 9 ## Descriptive statistics: cross-tables table(dat$gender, dat$study)  # # CIS IS LING OTHER # F 101 26 180 43 # M 23 96 16 15  table(dat$gender, dat$bl_edu)  # # N Y # F 316 34 # M 136 14  ## Inferential statistics • A large number of statistical inference functions are available in R • In this course we will cover the following functions: • t.test() for a $$t$$-test (single sample, paired, independent) • wilcox.test() for non-parametric alternatives to the $$t$$-test (Mann Whitney U test, Wilcoxon signed-rank test) • binom.test() for the sign test • chisq.test() for the chi-square test • cor() for the correlation • alpha() (from package psych) for Cronbach's $$\alpha$$ ## Example of inferential statistics: $$t$$-test • Assessing average group differences for a numerical variable (lecture 4) t.test(english_grade ~ bl_edu, data = dat)  # # Welch Two Sample t-test # # data: english_grade by bl_edu # t = -3.21, df = 60, p-value = 0.0021 # alternative hypothesis: true difference in means between group N and group Y is not equal to 0 # 95 percent confidence interval: # -0.62390 -0.14531 # sample estimates: # mean in group N mean in group Y # 7.2425 7.6271  ## Example of inferential statistics: correlation • Assessing strength of the relationship between two numerical variables (lecture 6) cor(dat$english_score, dat\$english_grade)

# [1] 0.72543


## Recap

• In this lecture, we've covered the basics of R
• R as calculator
• Variables
• Functions and help
• Importing data in R
• Viewing and modifying data
• Some examples of visualizations
• Some examples of descriptive and inferential statistics
• See Levshina (Ch. 2) for more information about the functionality of R
• In the lab session, you will experiment with using R
• Next lecture: Descriptive statistics
• Evaluation: if at least 50 people evaluate the lecture, exam-like question visible