Martijn Wieling

University of Groningen

- General information about the course
- Introduction to dataset used in the course
- Why use statistics?
- Introduction to RStudio and
`R`

`R`

as calculator- Variables
- Functions and help
- Importing data in
`R`

- Viewing and modifying data
- Visualization in
`R`

- Statistics in
`R`

- Important information (including
*studiehandleiding*and*FAQ*) on Nestor! - Teacher:
- Martijn Wieling, m.b.wieling@rug.nl

- Teaching assistants (lab sessions):
- Sanne oud, s.a.oud.1@student.rug.nl
- Anne Krakers, a.m.c.krakers.1@student.rug.nl

- 7 weekly live lectures / Q&A sessions:
- If no capacity restriction: in-person lectures
- If (25-100) capacity restriction: in-person Q&A sessions
- Otherwise: on-line Q&A sessions and online lab sessions
- Slides and lecture recordings (2021) are made available via Nestor

- 7 weekly online/in-person lab sessions
- You should have registered via Nestor for a lab session
- If problems, contact secretary: sec-MILLC@rug.nl

- No switching when groups are full
- Attendance (if online: with enabled webcam/mic when talking to lab teacher)
**obligatory**! - Finishing lab exercises results in at most
**1 bonus point**- Only when final test score \(T\) at least 5.0
- Calculation: \(T + 0.5 \times P \textrm{(max 2)}\)

- You should have registered via Nestor for a lab session

- Lecture slides, book and lab sessions are in
*English*.**Why?**- Most statistics terminology you encounter will be in English
*Statistiek II*will be completely in English (English teacher)

- You may choose the language of the lab reports (either Dutch or English)
- The final exam is in
**Dutch**- (but with English-Dutch translation of statistical terminology when necessary)

- Lab reports due one week after lab
- Late or sub-optimal lab report: (max.) 1 point (out of 2)
- Lab report over 1 week late or insufficient: 0 points
- Exam requirement: average lab score of 1 point, with at most
*two*times 0 points - Lab attendance required:
- You can miss at most 1 lab session, but timely lab report (\(\geq\) 1 point)
**required** - If you miss 2 or more lab sessions:
**course failed**

- You can miss at most 1 lab session, but timely lab report (\(\geq\) 1 point)
- Why required lab attendance?
**Actively participating in lab session increases chances of passing the exam!**- Lab score correlates significantly (\(r = .4, p < .001, N \approx 400\)) with exam grade

- Lab score of last year may be used, but only as a whole (same restrictions)

- Understand basics of descriptive and inferential statistics
- Emphasis on statistical
**reasoning** - Practical approach, but some mathematics to help understand the concepts

- Emphasis on statistical
- Understand and apply basic statistical analyses
- Report on results of statistical analyses
- Understand reports using statistical analyses
- Conduct basic statistical analyses in
`R`

- Many experiments are conducted in communication, information science and linguistics
- Effect of comics vs. normal text on understanding?
- Effect of algorithm on quality of automatically generated summary?
- Influence of gender on learning a second language?

- Availability of online data increases opportunities for statistical analysis (e.g., Digital Humanities)
- In this course we will mainly work with the data collected in the survey
- But other examples will be given as well

- Data on the basis of
*your*answers to the initial survey- Age, gender, handedness, study, etc.,
- Information about your language history
- Information about your English use and (subjective) proficiency

- We are interested in investigating which factors are related to English proficiency
- Measured by your English grade in high school
- And via an automatically calculated approximate measure of proficiency (English score) based on your input

- Data also includes a subset of survey results from earlier years

participant | year | gender | bl_edu | study | english_grade | english_score |
---|---|---|---|---|---|---|

103 | 2017 | M | N | IS | 8 | 7.10 |

104 | 2017 | F | N | OTHER | 9 | 7.76 |

105 | 2017 | F | N | OTHER | 7 | 5.68 |

106 | 2017 | F | N | CIS | 7 | 7.31 |

107 | 2017 | F | N | LING | 7 | 7.95 |

108 | 2017 | M | N | OTHER | 7 | 7.51 |

109 | 2017 | F | N | IS | 7 | 6.97 |

110 | 2017 | F | N | CIS | 6 | 6.22 |

111 | 2017 | M | N | OTHER | 8 | 8.71 |

112 | 2017 | F | N | LING | 7 | 6.78 |

113 | 2017 | F | N | CIS | 6 | 5.94 |

114 | 2017 | M | N | CIS | 6 | 6.18 |

115 | 2017 | F | N | LING | 8 | 8.00 |

participant | year | gender | bl_edu | study | english_grade | english_score |
---|---|---|---|---|---|---|

198 | 2018 | F | N | LING | 6 | 5.19 |

199 | 2018 | M | N | LING | 7 | 6.82 |

200 | 2018 | M | N | LING | 8 | 8.21 |

201 | 2018 | F | N | CIS | 7 | 7.34 |

202 | 2018 | F | N | LING | 7 | 6.59 |

203 | 2018 | F | N | LING | 8 | 7.55 |

204 | 2018 | F | N | LING | 7 | 7.19 |

205 | 2018 | F | Y | LING | 8 | 7.63 |

206 | 2018 | F | N | LING | 6 | 6.58 |

207 | 2018 | M | N | IS | 8 | 8.89 |

208 | 2018 | M | N | CIS | 7 | 6.76 |

209 | 2018 | F | N | OTHER | 8 | 8.18 |

210 | 2018 | M | N | CIS | 6 | 6.33 |

participant | year | gender | bl_edu | study | english_grade | english_score |
---|---|---|---|---|---|---|

1 | 2019 | M | N | LING | 5 | 6.10 |

2 | 2019 | F | N | CIS | 6 | 6.67 |

3 | 2019 | F | N | CIS | 7 | 7.42 |

4 | 2019 | F | N | LING | 8 | 9.10 |

5 | 2019 | F | N | CIS | 7 | 7.47 |

6 | 2019 | M | N | LING | 8 | 8.14 |

7 | 2019 | F | N | LING | 8 | 7.65 |

8 | 2019 | F | N | CIS | 6 | 7.35 |

9 | 2019 | F | N | LING | 8 | 8.54 |

10 | 2019 | M | N | IS | 8 | 8.39 |

11 | 2019 | F | N | LING | 7 | 7.98 |

12 | 2019 | M | N | OTHER | 7 | 6.15 |

13 | 2019 | F | N | LING | 7 | 5.60 |

participant | year | gender | bl_edu | study | english_grade | english_score |
---|---|---|---|---|---|---|

413 | 2020 | M | N | OTHER | 6 | 6.86 |

414 | 2020 | M | N | LING | 8 | 8.07 |

415 | 2020 | M | N | IS | 8 | 7.72 |

416 | 2020 | F | N | LING | 7 | 7.53 |

417 | 2020 | F | Y | CIS | 8 | 9.23 |

418 | 2020 | M | N | IS | 7 | 7.64 |

419 | 2020 | M | N | IS | 7 | 7.82 |

420 | 2020 | F | N | LING | 8 | 8.65 |

421 | 2020 | M | N | LING | 9 | 9.09 |

422 | 2020 | M | N | IS | 6 | 7.61 |

423 | 2020 | F | N | LING | 8 | 8.26 |

424 | 2020 | F | N | LING | 7 | 6.67 |

425 | 2020 | F | N | LING | 8 | 8.38 |

participant | year | gender | bl_edu | study | english_grade | english_score |
---|---|---|---|---|---|---|

333 | 2021 | F | N | LING | 6 | 5.94 |

334 | 2021 | F | N | LING | 7 | 7.25 |

335 | 2021 | F | N | LING | 5 | 5.65 |

336 | 2021 | M | N | IS | 8 | 7.16 |

337 | 2021 | F | N | LING | 6 | 7.21 |

338 | 2021 | F | Y | CIS | 8 | 8.16 |

339 | 2021 | F | N | LING | 7 | 8.12 |

340 | 2021 | F | N | CIS | 6 | 5.66 |

341 | 2021 | M | Y | IS | 7 | 7.60 |

342 | 2021 | F | Y | CIS | 8 | 8.31 |

343 | 2021 | F | N | LING | 7 | 7.09 |

344 | 2021 | M | N | LING | 7 | 6.88 |

345 | 2021 | F | N | LING | 7 | 6.65 |

- We would like to make sense of (in this course:
**your**) data - For this we need to:
- Summarize the data (
*descriptive statistics*) - Assess relationships in our data (
*inferential statistics*)- (During other courses, some of you will have already encountered inferential statistical tests such as the \(t\)-test or chi-square test which can be used for this)

- Summarize the data (
- The requirement of the data is that it is
**variable**(there must be variation) - Note that statistics is not mathematics (it's data analysis)!

**Descriptive statistics**(describe data without conclusions)- Measures of central tendency and spread
- E.g.: mean English grade (7.3), and range of English grades (5 - 9.5)

- Visualization
- E.g.: showing number of participants per study with a bar plot

- Measures of central tendency and spread

**Inferential statistics**(link findings based on*sample*to*population*)- Comparing 2 groups (or 1 group with a value)
- E.g.: pronunciation of women better than men?

- Associations between 2 variables
- E.g.: are gender and handedness related?

- Internal consistency of questions in a survey
- E.g.: does a group of questions measure a single construct?

- Comparing 2 groups (or 1 group with a value)
**How to do statistics in**(this lecture)`R`

- And how to make reproducible lab reports in
`R`

- And how to make reproducible lab reports in

`R`

?*Very nice*reproducible lab reports (no copy-paste necessary!)- Other advantages of
`R`

compared to (e.g.,) SPSS**Free**for everybody**Customizable**: people can create their own statistical functions**State-of-the-art**statistical methods are integrated very quickly

- Also some disadvantages:
- No substantial graphical user interface:
**typing**instead of clicking - Takes more time to learn
- State-of-the-art statistical methods sometimes contain bugs

- No substantial graphical user interface:

`R`

)`R`

as calculator```
# Addition (this is a comment: preceded by '#')
5 + 5
```

```
# [1] 10
```

```
# Multiplication
5 * 3
```

```
# [1] 15
```

```
# Division
5/3
```

```
# [1] 1.6667
```

```
a <- 5 # store a single value; instead of '<-' you can also use '='
a # display the value
```

```
# [1] 5
```

```
b <- a * a # b contains the value of multiplying a with itself
b
```

```
# [1] 25
```

```
(d <- NA) # set value of d to missing (NA) and show value
```

```
# [1] NA
```

```
b <- c(2, 4, 6, 7, 8) # store a series of values in a vector (reusing variable b)
b
```

```
# [1] 2 4 6 7 8
```

```
b[4] <- a # assign value 5 (stored in 'a') to the 4th element of vector b
b
```

```
# [1] 2 4 6 5 8
```

```
b <- c(b, NA) # add element NA to b
b
```

```
# [1] 2 4 6 5 8 NA
```

```
b # show values in variable b (b contains a vector: a list of values)
```

```
# [1] 2 4 6 5 8 NA
```

```
mn <- mean(b) # calculating the mean and storing in variable mn
mn
```

```
# [1] NA
```

```
# mn is NA (missing) as one of the values is missing
mean(b, na.rm = TRUE) # we can use the function parameter na.rm to ignore NAs
```

```
# [1] 5
```

```
# But which parameters does a function have: use help!
help(mean) # alternatively: ?mean
```

`R`

: exporting a data set`R`

: importing a data set```
setwd("C:/Users/Martijn/Desktop/Statistiek-I/HC1") # set working directory
dat <- read.csv("survey.csv", sep = ";", dec = ".") # reads csv file from work dir
str(dat) # shows structure of the data frame (i.e. table is 2-dimensional)
```

```
# 'data.frame': 500 obs. of 7 variables:
# $ participant : int 1 2 3 4 5 6 7 8 9 10 ...
# $ year : num 2017 2017 2017 2017 2017 ...
# $ gender : chr "F" "M" "M" "F" ...
# $ bl_edu : chr "N" "N" "N" "N" ...
# $ study : chr "LING" "LING" "LING" "CIS" ...
# $ english_grade: num 6 7 8 7 7 8 7 8 6 8 ...
# $ english_score: num 4.31 6.42 8.17 6.99 6.12 ...
```

```
dim(dat) # number of rows and columns of data set
```

```
# [1] 500 7
```

`head`

```
head(dat) # show first few rows of dat
```

```
# participant year gender bl_edu study english_grade english_score
# 1 1 2017 F N LING 6 4.3055
# 2 2 2017 M N LING 7 6.4152
# 3 3 2017 M N LING 8 8.1696
# 4 4 2017 F N CIS 7 6.9925
# 5 5 2017 F N LING 7 6.1165
# 6 6 2017 F N LING 8 7.3538
```

- Access parts of table by specifying row and/or column numbers
`dat[a,b]`

:`a`

indicates the selected rows of`dat`

`b`

indicates the selected columns of`dat`

```
dat[1, ] # values in first row (dat[,1]: values in first column)
```

```
# participant year gender bl_edu study english_grade english_score
# 1 1 2017 F N LING 6 4.3055
```

```
dat[c(1, 5), c(1, 2, 3)] # values in rows 1 and 5 and columns 1, 2 and 3
```

```
# participant year gender
# 1 1 2017 F
# 5 5 2017 F
```

- Additionally, we can access parts of the table by specifying the names of the columns we want to look at

```
dat[c(1, 3, 5), c("participant", "study")] # rows 1, 3 and 5, and two named columns
```

```
# participant study
# 1 1 LING
# 3 3 LING
# 5 5 LING
```

- We may also select a single column by its name (not number) using the
`$`

operator- E.g.,
`dat$gender`

accesses the column`gender`

of`dat`

- E.g.,

```
head(dat$gender, 200) # show gender of first 200 students
```

```
# [1] "F" "M" "M" "F" "F" "F" "F" "F" "F" "M" "M" "F" "M" "F" "F" "F" "F" "F" "F" "F"
# [21] "F" "F" "F" "F" "M" "F" "M" "M" "F" "F" "F" "F" "F" "F" "F" "F" "M" "M" "F" "F"
# [41] "F" "M" "F" "M" "F" "F" "M" "M" "M" "F" "F" "F" "M" "M" "F" "F" "F" "F" "F" "F"
# [61] "F" "F" "M" "F" "M" "F" "F" "F" "M" "F" "F" "M" "M" "M" "M" "F" "F" "F" "F" "F"
# [81] "F" "F" "F" "F" "F" "F" "M" "F" "F" "M" "M" "F" "M" "F" "M" "M" "F" "F" "F" "M"
# [101] "F" "F" "F" "F" "M" "M" "F" "F" "F" "F" "M" "F" "F" "F" "M" "F" "F" "M" "M" "F"
# [121] "M" "M" "F" "F" "F" "F" "M" "F" "F" "F" "M" "F" "F" "M" "M" "F" "M" "F" "F" "M"
# [141] "M" "F" "F" "M" "F" "F" "F" "F" "F" "F" "F" "M" "F" "F" "F" "F" "M" "F" "F" "F"
# [161] "F" "F" "F" "M" "M" "F" "F" "F" "M" "F" "F" "M" "F" "F" "F" "F" "F" "F" "F" "F"
# [181] "F" "F" "F" "M" "F" "F" "M" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F"
```

```
tmp <- dat[5:8, c(1, 3)] # store columns 1 and 3 for rows 5 to 8 in variable tmp
tmp # show what is stored in variable tmp
```

```
# participant gender
# 5 5 F
# 6 6 F
# 7 7 F
# 8 8 F
```

- Conditional indexing allows us the select parts of the data on the basis of
**conditions**

```
tmp <- dat[dat$gender == "M", ] # only observations for male participants
head(tmp)
```

```
# participant year gender bl_edu study english_grade english_score
# 2 2 2017 M N LING 7 6.4152
# 3 3 2017 M N LING 8 8.1696
# 10 10 2017 M N IS 8 8.9375
# 11 11 2017 M N CIS 7 6.2686
# 13 13 2017 M N CIS 6 5.7744
# 25 25 2017 M N OTHER 9 8.3227
```

- Methods to combine conditions:
**and**:`&`

**or**:`|`

```
# only participants who study IS *and* are male
tmp <- dat[dat$gender == "M" & dat$study == "IS", ]
head(tmp)
```

```
# participant year gender bl_edu study english_grade english_score
# 10 10 2017 M N IS 8.0 8.9375
# 27 27 2017 M N IS 8.0 9.0563
# 28 28 2017 M N IS 7.0 7.9433
# 37 37 2017 M N IS 8.1 8.6273
# 44 44 2017 M N IS 6.0 6.1372
# 47 47 2017 M N IS 9.0 9.0493
```

- Inverse a condition with
`!`

(not)**is not equal to**:`!=`

```
# only women (i.e. not men) *or* everybody with an English grade over 7
tmp <- dat[dat$gender != "M" | dat$english_grade > 7, ]
tail(tmp) # tail shows final 6 rows
```

```
# participant year gender bl_edu study english_grade english_score
# 495 405 2021 F N CIS 8.0 8.2097
# 496 406 2021 F N LING 7.0 7.7812
# 497 407 2021 F N OTHER 7.0 8.1330
# 498 408 2021 F N OTHER 8.0 9.3621
# 499 409 2021 M N IS 7.5 7.9784
# 500 410 2021 F N LING 7.0 7.6600
```

- Frequently, you need to add columns to the data, for example by computing a new value on the basis of the values in two columns
- the operator
`$`

helps us to do that

- the operator

```
# new column 'diff': English grade - English proficiency score
dat$diff <- dat$english_grade - dat$english_score
head(dat)
```

```
# participant year gender bl_edu study english_grade english_score diff
# 1 1 2017 F N LING 6 4.3055 1.6945488
# 2 2 2017 M N LING 7 6.4152 0.5847601
# 3 3 2017 M N LING 8 8.1696 -0.1695835
# 4 4 2017 F N CIS 7 6.9925 0.0075054
# 5 5 2017 F N LING 7 6.1165 0.8835361
# 6 6 2017 F N LING 8 7.3538 0.6461548
```

- Conditional indexing allows us additional flexibility

```
dat$pass_fail <- "PASS" # new column, initially PASS for everybody
dat[dat$english_grade < 5.5, ]$pass_fail <- "FAIL" # if grade too low, then FAIL
tail(dat[dat$english_grade > 4 & dat$english_grade < 6, 2:9]) # show subset of data
```

```
# year gender bl_edu study english_grade english_score diff pass_fail
# 363 2020 F N LING 5.0 6.1166 -1.11660 FAIL
# 377 2020 F Y CIS 5.0 4.3000 0.70000 FAIL
# 399 2020 F N LING 5.8 6.0576 -0.25764 PASS
# 403 2020 F N LING 5.8 5.1720 0.62797 PASS
# 425 2021 F N LING 5.0 5.6488 -0.64878 FAIL
# 482 2021 F N CIS 5.6 5.9877 -0.38772 PASS
```

`R`

- Many basic visualization options are available in
`R`

- In this course, we will learn how to use the following functions (list not exhaustive):
`barplot()`

(illustrated in the following)`plot()`

`boxplot()`

`hist()`

`qqnorm()`

and`qqline()`

- The bar plot is used to visualize frequencies of categorical variables

```
(counts <- table(dat$gender)) # first create frequency table
```

```
#
# F M
# 350 150
```

```
barplot(counts)
```

- There are various graphical parameters to allow you to customize your graphics

```
barplot(counts, col = c("pink", "lightblue"), ylim = c(0, 350), main = "My barplot",
xlab = "Gender", ylab = "Frequency")
```

```
(counts <- table(dat$gender, dat$study))
```

```
#
# CIS IS LING OTHER
# F 101 26 180 43
# M 23 96 16 15
```

```
barplot(counts, col = c("pink", "lightblue"), legend = c("F", "M"), ylim = c(0, 185))
```

`R`

- The main purpose of
`R`

is to conduct statistical analyses - Many different functions to obtain descriptive and inferential statistics are available in
`R`

- The following examples illustrate
**how**to conduct some statistical analyses in`R`

- The
**why**and**what**is covered in the next lectures

- The

```
mean(dat$english_score) # mean of all people for English score
```

```
# [1] 7.466
```

```
mean(dat[dat$gender == "F", ]$english_score) # mean of women for English score
```

```
# [1] 7.3613
```

```
median(dat$english_score) # median of all people for English score
```

```
# [1] 7.5227
```

```
min(dat$english_score) # minimum value
```

```
# [1] 4.3
```

```
max(dat$english_score) # maximum value
```

```
# [1] 10
```

```
var(dat$english_score) # variance: average squared deviation from mean
```

```
# [1] 1.1671
```

```
sd(dat$english_score) # standard deviation (square root of variance)
```

```
# [1] 1.0803
```

```
table(dat$gender)
```

```
#
# F M
# 350 150
```

```
table(dat$study)
```

```
#
# CIS IS LING OTHER
# 124 122 196 58
```

```
table(dat$gender, dat$study)
```

```
#
# CIS IS LING OTHER
# F 101 26 180 43
# M 23 96 16 15
```

```
table(dat$gender, dat$bl_edu)
```

```
#
# N Y
# F 316 34
# M 136 14
```

- A large number of statistical inference functions are available in
`R`

- In this course we will cover the following functions:
`t.test()`

for a \(t\)-test (single sample, paired, independent)`wilcox.test()`

for non-parametric alternatives to the \(t\)-test

(Mann Whitney U test, Wilcoxon signed-rank test)`binom.test()`

for the sign test`chisq.test()`

for the chi-square test`cor()`

for the correlation`alpha()`

(from package`psych`

) for Cronbach's \(\alpha\)

- Assessing average group differences for a numerical variable (lecture 4)

```
t.test(english_grade ~ bl_edu, data = dat)
```

```
#
# Welch Two Sample t-test
#
# data: english_grade by bl_edu
# t = -3.21, df = 60, p-value = 0.0021
# alternative hypothesis: true difference in means between group N and group Y is not equal to 0
# 95 percent confidence interval:
# -0.62390 -0.14531
# sample estimates:
# mean in group N mean in group Y
# 7.2425 7.6271
```

- Assessing strength of the relationship between two numerical variables (lecture 6)

```
cor(dat$english_score, dat$english_grade)
```

```
# [1] 0.72543
```

- In this lecture, we've covered the basics of
`R`

`R`

as calculator- Variables
- Functions and help
- Importing data in
`R`

- Viewing and modifying data
- Some examples of visualizations
- Some examples of descriptive and inferential statistics

- See Levshina (Ch. 2) for more information about the functionality of
`R`

- In the lab session, you will experiment with using
`R`

- Next lecture:
**Descriptive statistics** - Evaluation: if at least 50 people evaluate the lecture, exam-like question visible