Martijn Wieling

University of Groningen

- General information about the course
- Introduction to dataset used in the course
- Why use statistics?
- Introduction to RStudio and
`R`

`R`

as calculator- Variables
- Functions and help
- Importing data in
`R`

- Viewing and modifying data
- Visualization in
`R`

- Statistics in
`R`

- Important information (including
*studiehandleiding*and*FAQ*) on Nestor! - Teacher:
- Martijn Wieling, m.b.wieling@rug.nl, H1311.434

- Teaching assistants (lab sessions):
- Eveline Schmidt, e.l.schmidt@student.rug.nl (Thu. 11-13, Fri. CIW PM)
- Raoul Buurke, r.s.s.j.buurke@student.rug.nl (Thu. 13-15)
- Jeffry Frikken, j.k.frikken@student.rug.nl (Thu. 15-17)
- Tanya Palsma, t.palsma@student.rug.nl (Fri. IK2)
- Aline Oelen, a.i.oelen@student.rug.nl (Fr. 13-15)

- 7 weekly lectures
- Slides (and 2017-2018 recordings) will be made available via Nestor
- Interactive
*Mentimeter*questions during lecture

- 7 weekly lab sessions
- You should have registered via Nestor for a group associated with your study
- If problems, contact secretary: sec-MILLC@rug.nl

- Attendance
**obligatory**! - Finishing lab exercises results in at most
**1 bonus point**- Only when final test score \(T\) at least 5.0
- Calculation: \(T + 0.25 \times P \textrm{(max 2)} \times H \textrm{(max 2)}\)

- You should have registered via Nestor for a group associated with your study
- 3 optional lectures
- Feb. 19, March 5, March 19, 17:00 - 19:00, Zernikezaal, Academy building
- All exercises have to be made
**in advance**(see Nestor)!

- Lecture slides, book and lab sessions are in
*English*.**Why?**- Most statistics terminology you encounter will be in English
*Statistiek II*will be completely in English (English teacher)

- You may choose the language of the lab reports (either Dutch or English)
- The final exam is in
**Dutch**- (but with English-Dutch translation of statistical terminology when necessary)

- Lab reports due one week after lab
- Late or sub-optimal lab report: (max.) 1 point (out of 2)
- Lab report over 1 week late or insufficient: 0 points
- Exam requirement: average lab score of 1 point, with at most three times 0 points
- Lab attendance required:
- You can miss at most 1 lab session, but timely lab report (\(\geq\) 1 point)
**required** - If you miss 2 or more lab sessions:
**course failed**

- You can miss at most 1 lab session, but timely lab report (\(\geq\) 1 point)
- Why required lab attendance?
**Actively participating in lab session increases chances of passing the exam!**

- Lab score of last year may be used, but only as a whole (same restrictions)

- Understand basics of descriptive and inferential statistics
- Emphasis on statistical
**reasoning** - Practical approach, but some mathematics to help understand the concepts

- Emphasis on statistical
- Understand and apply basic statistical analyses
- Report on results of statistical analyses
- Understand reports using statistical analyses
- Conduct basic statistical analyses in
`R`

- Many experiments are conducted in communication, information science and linguistics
- Effect of comics vs. normal text on understanding?
- Effect of algorithm on quality of automatically generated summary?
- Influence of gender on learning a second language?

- Availability of online data increases opportunities for statistical analysis (e.g., Digital Humanities)
- In this course we will mainly work with the data collected in the survey
- But other examples will be given as well

- Data on the basis of
*your*answers to the initial survey- Age, gender, handedness, study, etc.,
- Information about your language history
- Information about your English use and (subjective) proficiency

- We are interested in investigating which factors are related to English proficiency
- Measured by your English grade in high school
- And via an automatically calculated approximate measure of proficiency (English score) based on your input

- Data also includes a subset of survey results from earlier years

participant | year | gender | bl_edu | study | english_grade | english_score | |
---|---|---|---|---|---|---|---|

1 | 1 | 2017 | F | N | LING | 6 | 4.31 |

2 | 2 | 2017 | M | N | LING | 7 | 6.42 |

3 | 3 | 2017 | M | N | LING | 8 | 8.17 |

4 | 4 | 2017 | F | N | CIS | 7 | 6.99 |

5 | 5 | 2017 | F | N | LING | 7 | 6.12 |

6 | 6 | 2017 | F | N | LING | 8 | 7.35 |

7 | 7 | 2017 | F | N | LING | 7 | 6.78 |

8 | 8 | 2017 | F | Y | LING | 8 | 7.43 |

9 | 9 | 2017 | F | N | LING | 6 | 6.04 |

10 | 10 | 2017 | M | N | IS | 8 | 8.94 |

11 | 11 | 2017 | M | N | CIS | 7 | 6.27 |

12 | 12 | 2017 | F | N | OTHER | 8 | 7.99 |

13 | 13 | 2017 | M | N | CIS | 6 | 5.77 |

participant | year | gender | bl_edu | study | english_grade | english_score | |
---|---|---|---|---|---|---|---|

14 | 14 | 2017 | F | N | OTHER | 7 | 6.78 |

15 | 15 | 2017 | F | N | OTHER | 8 | 8.45 |

16 | 16 | 2017 | F | N | CIS | 7 | 7.11 |

17 | 17 | 2017 | F | N | LING | 8 | 8.47 |

18 | 18 | 2017 | F | N | IS | 8 | 7.48 |

19 | 19 | 2017 | F | N | LING | 8 | 7.52 |

20 | 20 | 2017 | F | N | LING | 8 | 7.58 |

21 | 21 | 2017 | F | Y | OTHER | 7 | 8.99 |

22 | 22 | 2017 | F | N | CIS | 8 | 8.70 |

23 | 23 | 2017 | F | N | IS | 6 | 4.52 |

24 | 24 | 2017 | F | N | LING | 7 | 6.83 |

25 | 25 | 2017 | M | N | OTHER | 9 | 8.32 |

26 | 26 | 2017 | F | N | LING | 7 | 6.80 |

participant | year | gender | bl_edu | study | english_grade | english_score | |
---|---|---|---|---|---|---|---|

27 | 27 | 2018 | M | N | IS | 8 | 9.06 |

28 | 28 | 2018 | M | N | IS | 7 | 7.94 |

29 | 29 | 2018 | F | N | CIS | 7 | 7.23 |

30 | 30 | 2018 | F | N | CIS | 7 | 7.99 |

31 | 31 | 2018 | F | N | IS | 7 | 7.42 |

32 | 32 | 2018 | F | Y | IS | 8 | 6.50 |

33 | 33 | 2018 | F | N | LING | 7 | 7.34 |

34 | 34 | 2018 | F | N | OTHER | 8 | 8.82 |

35 | 35 | 2018 | F | N | LING | 9 | 8.63 |

36 | 36 | 2018 | F | N | LING | 8 | 7.94 |

37 | 37 | 2018 | M | N | IS | 8 | 8.63 |

38 | 38 | 2018 | M | N | LING | 9 | 7.89 |

39 | 39 | 2018 | F | N | CIS | 7 | 6.18 |

participant | year | gender | bl_edu | study | english_grade | english_score | |
---|---|---|---|---|---|---|---|

40 | 40 | 2018 | F | N | CIS | 7 | 7.09 |

41 | 41 | 2018 | F | Y | OTHER | 8 | 9.50 |

42 | 42 | 2018 | M | N | CIS | 7 | 7.79 |

43 | 43 | 2018 | F | N | LING | 7 | 6.25 |

44 | 44 | 2018 | M | N | IS | 6 | 6.14 |

45 | 45 | 2018 | F | N | LING | 7 | 6.77 |

46 | 46 | 2018 | F | N | LING | 9 | 6.71 |

47 | 47 | 2018 | M | N | IS | 9 | 9.05 |

48 | 48 | 2018 | M | N | IS | 7 | 7.06 |

49 | 49 | 2018 | M | N | IS | 6 | 6.74 |

50 | 50 | 2018 | F | N | LING | 7 | 7.27 |

51 | 51 | 2018 | F | N | LING | 6 | 4.36 |

52 | 52 | 2018 | F | N | LING | 8 | 7.72 |

participant | year | gender | bl_edu | study | english_grade | english_score | |
---|---|---|---|---|---|---|---|

53 | 53 | 2018 | M | N | IS | 8 | 8.15 |

54 | 54 | 2018 | M | N | IS | 7 | 6.73 |

55 | 55 | 2018 | F | N | LING | 7 | 7.36 |

56 | 56 | 2018 | F | Y | CIS | 8 | 8.28 |

57 | 57 | 2018 | F | N | IS | 8 | 7.43 |

58 | 58 | 2018 | F | N | CIS | 7 | 7.72 |

59 | 59 | 2018 | F | Y | CIS | 7 | 6.44 |

60 | 60 | 2018 | F | N | LING | 7 | 6.42 |

61 | 61 | 2018 | F | N | LING | 7 | 8.07 |

62 | 62 | 2018 | F | N | CIS | 6 | 6.33 |

63 | 63 | 2018 | M | N | IS | 8 | 7.56 |

64 | 64 | 2018 | F | Y | OTHER | 8 | 8.91 |

65 | 65 | 2018 | M | N | IS | 8 | 7.93 |

- We would like to make sense of (in this course:
**your**) data - For this we need to:
- Summarize the data (
*descriptive statistics*) - Assess relationships in our data (
*inferential statistics*)- (During other courses, some of you will have already encountered inferential statistical tests such as the \(t\)-test or chi-square test which can be used for this)

- Summarize the data (
- The requirement of the data is that it is
**variable**(there must be variation) - Note that statistics is not mathematics (it's data analysis)!

**Descriptive statistics**(describe data without conclusions)- Measures of central tendency and spread
- E.g.: mean English grade (7.3), and range of English grades (5 - 9.5)

- Visualization
- E.g.: showing number of participants per study with a bar plot

- Measures of central tendency and spread

**Inferential statistics**(link findings based on*sample*to*population*)- Comparing 2 groups (or 1 group with a value)
- E.g.: pronunciation of women better than men?

- Associations between 2 variables
- E.g.: are gender and handedness related?

- Internal consistency of questions in a survey
- E.g.: does a group of questions measure a single construct?

- Comparing 2 groups (or 1 group with a value)
**How to do statistics in**(this lecture)`R`

- And how to make reproducible lab reports in
`R`

- And how to make reproducible lab reports in

`R`

?*Very nice*reproducible lab reports (no copy-paste necessary!)- Other advantages of
`R`

compared to (e.g.,) SPSS**Free**for everybody**Customizable**: people can create their own statistical functions**State-of-the-art**statistical methods are integrated very quickly

- Also some disadvantages:
- No substantial graphical user interface:
**typing**instead of clicking - Takes more time to learn
- State-of-the-art statistical methods sometimes contain bugs

- No substantial graphical user interface:

`R`

)`R`

as calculator```
# Addition (this is a comment: preceded by '#')
5 + 5
```

```
# [1] 10
```

```
# Multiplication
5 * 3
```

```
# [1] 15
```

```
# Division
5/3
```

```
# [1] 1.6667
```

```
a <- 5 # store a single value; instead of '<-' you can also use '='
a # display the value
```

```
# [1] 5
```

```
b <- a * a # b contains the value of multiplying a with itself
b
```

```
# [1] 25
```

```
(d <- NA) # set value of d to missing (NA) and show value
```

```
# [1] NA
```

```
b <- c(2, 4, 6, 7, 8) # store a series of values in a vector (reusing variable b)
b
```

```
# [1] 2 4 6 7 8
```

```
b[4] <- a # assign value 5 (stored in 'a') to the 4th element of vector b
b
```

```
# [1] 2 4 6 5 8
```

```
b <- c(b, NA) # add element NA to b
b
```

```
# [1] 2 4 6 5 8 NA
```

```
b # show values in variable b (b contains a vector: a list of values)
```

```
# [1] 2 4 6 5 8 NA
```

```
mn <- mean(b) # calculating the mean and storing in variable mn
mn
```

```
# [1] NA
```

```
# mn is NA (missing) as one of the values is missing
mean(b, na.rm = TRUE) # we can use the function parameter na.rm to ignore NAs
```

```
# [1] 5
```

```
# But which parameters does a function have: use help!
help(mean) # alternatively: ?mean
```

`R`

: exporting a data set`R`

: importing a data set```
setwd("C:/Users/Martijn/Desktop/Statistiek-I/HC1/") # set working directory
dat <- read.csv("survey.csv", sep = ";", dec = ".") # reads csv file from work dir
str(dat) # shows structure of the data frame (i.e. table is 2-dimensional)
```

```
# 'data.frame': 315 obs. of 7 variables:
# $ participant : int 1 2 3 4 5 6 7 8 9 10 ...
# $ year : int 2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
# $ gender : Factor w/ 2 levels "F","M": 1 2 2 1 1 1 1 1 1 2 ...
# $ bl_edu : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 2 1 1 ...
# $ study : Factor w/ 4 levels "CIS","IS","LING",..: 3 3 3 1 3 3 3 3 3 2 ...
# $ english_grade: num 6 7 8 7 7 8 7 8 6 8 ...
# $ english_score: num 4.31 6.42 8.17 6.99 6.12 ...
```

```
dim(dat) # number of rows and columns of data set
```

```
# [1] 315 7
```

`head`

```
head(dat) # show first few rows of dat
```

```
# participant year gender bl_edu study english_grade english_score
# 1 1 2017 F N LING 6 4.3055
# 2 2 2017 M N LING 7 6.4152
# 3 3 2017 M N LING 8 8.1696
# 4 4 2017 F N CIS 7 6.9925
# 5 5 2017 F N LING 7 6.1165
# 6 6 2017 F N LING 8 7.3538
```

- Access parts of table by specifying row and/or column numbers
`dat[a,b]`

:`a`

indicates the selected rows of`dat`

`b`

indicates the selected columns of`dat`

```
dat[1, ] # values in first row (dat[,1]: values in first column)
```

```
# participant year gender bl_edu study english_grade english_score
# 1 1 2017 F N LING 6 4.3055
```

```
dat[c(1, 5), c(1, 2, 3)] # values in rows 1 and 5 and columns 1, 2 and 3
```

```
# participant year gender
# 1 1 2017 F
# 5 5 2017 F
```

- Additionally, we can access parts of the table by specifying the names of the columns we want to look at

```
dat[c(1, 3, 5), c("participant", "study")] # rows 1, 3 and 5, and 2 named columns
```

```
# participant study
# 1 1 LING
# 3 3 LING
# 5 5 LING
```

- We may also select a single column by its name (not number) using the
`$`

operator- E.g.,
`dat$gender`

accesses the column`gender`

of`dat`

- E.g.,

```
dat$gender
```

```
# [1] F M M F F F F F F M M F M F F F F F F F F F F F M F M M F F F F F F F F M M F F
# [41] F M F M F F M M M F F F M M F F F F F F F F M F M F F F M F F M M M M F F F F F
# [81] F F F F F F M F F M M F M F M M F F F M F F F F M M F F F F M F F F M F F M M F
# [121] M M F F F F M F F F M F F M M F M F F M M F F M F F F F F F F M F F F F M F F F
# [161] F F F M M F F F M F F M F F F F F F F F F F F M F F M F F F F F F F F F F F F F
# [201] F F M F F F F F F F M F F F M M F F F F M F F F F M F F M F F M F M F F F F F M
# [241] F F M F F F F M M F F F M F F M F F M F M M M F F F M F F M M F F F F F F M F F
# [281] F M M F F F F F F M F F F F M F F M F M M F M M F F M M M F M F F M M
# Levels: F M
```

```
tmp <- dat[5:8, c(1, 3)] # store columns 1 and 3 for rows 5 to 8 in variable tmp
tmp # show what is stored in variable tmp
```

```
# participant gender
# 5 5 F
# 6 6 F
# 7 7 F
# 8 8 F
```

- Conditional indexing allows us the select parts of the data on the basis of
**conditions**

```
tmp <- dat[dat$gender == "M", ] # only observations for male participants
head(tmp)
```

```
# participant year gender bl_edu study english_grade english_score
# 2 2 2017 M N LING 7 6.4152
# 3 3 2017 M N LING 8 8.1696
# 10 10 2017 M N IS 8 8.9375
# 11 11 2017 M N CIS 7 6.2686
# 13 13 2017 M N CIS 6 5.7744
# 25 25 2017 M N OTHER 9 8.3227
```

- Methods to combine conditions:
**and**:`&`

**or**:`|`

```
# only participants who study IS *and* are male
tmp <- dat[dat$gender == "M" & dat$study == "IS", ]
head(tmp)
```

```
# participant year gender bl_edu study english_grade english_score
# 10 10 2017 M N IS 8.0 8.9375
# 27 27 2017 M N IS 8.0 9.0563
# 28 28 2017 M N IS 7.0 7.9433
# 37 37 2017 M N IS 8.1 8.6273
# 44 44 2017 M N IS 6.0 6.1372
# 47 47 2017 M N IS 9.0 9.0493
```

- Inverse a condition with
`!`

(not)**is not equal to**:`!=`

```
# only women (i.e. not men) *or* everybody with an English grade over 7
tmp <- dat[dat$gender != "M" | dat$english_grade > 7, ]
tail(tmp) # tail shows final 6 rows
```

```
# participant year gender bl_edu study english_grade english_score
# 308 308 2016 M N IS 8.0 8.0876
# 309 309 2016 M N IS 8.0 7.8982
# 310 310 2016 F N OTHER 8.0 8.4761
# 312 312 2016 F N OTHER 7.0 6.8564
# 313 313 2016 F N LING 6.0 6.9069
# 315 315 2016 M N IS 7.5 5.7257
```

- Frequently, you need to add columns to the data, for example by computing a new value on the basis of the values in two columns
- the operator
`$`

helps us to do that

- the operator

```
# new column 'diff': English grade - English proficiency score
dat$diff <- dat$english_grade - dat$english_score
head(dat)
```

```
# participant year gender bl_edu study english_grade english_score diff
# 1 1 2017 F N LING 6 4.3055 1.6945488
# 2 2 2017 M N LING 7 6.4152 0.5847601
# 3 3 2017 M N LING 8 8.1696 -0.1695835
# 4 4 2017 F N CIS 7 6.9925 0.0075054
# 5 5 2017 F N LING 7 6.1165 0.8835361
# 6 6 2017 F N LING 8 7.3538 0.6461548
```

- Conditional indexing allows us additional flexibility

```
dat$pass_fail <- "PASS" # new column, initially PASS for everybody
dat[dat$english_grade < 5.5, ]$pass_fail <- "FAIL" # if grade too low, then FAIL
head(dat[dat$english_grade > 4 & dat$english_grade < 6, 2:9]) # show subset of data
```

```
# year gender bl_edu study english_grade english_score diff pass_fail
# 78 2017 F N LING 5.0 4.3000 0.70000 FAIL
# 122 2016 M N LING 5.0 5.5963 -0.59625 FAIL
# 209 2016 F N LING 5.5 4.3196 1.18042 PASS
# 284 2016 F N LING 5.0 5.3275 -0.32755 FAIL
```

`R`

- Many basic visualization options are available in
`R`

- In this course, we will learn how to use the following functions (list not exhaustive):
`barplot()`

(illustrated in the following)`plot()`

`boxplot()`

`hist()`

`qqnorm()`

and`qqline()`

- The bar plot is used to visualize frequencies of categorical variables

```
(counts <- table(dat$gender)) # first create frequency table
```

```
#
# F M
# 222 93
```

```
barplot(counts)
```

- There are various graphical parameters to allow you to customize your graphics

```
barplot(counts, col = c("pink", "lightblue"), ylim = c(0, 250), main = "My barplot",
xlab = "Gender", ylab = "Frequency")
```

```
(counts <- table(dat$gender, dat$study))
```

```
#
# CIS IS LING OTHER
# F 82 18 93 29
# M 19 56 9 9
```

```
barplot(counts, col = c("pink", "lightblue"), legend = c("F", "M"), ylim = c(0, 105))
```

`R`

- The main purpose of
`R`

is to conduct statistical analyses - Many different functions to obtain descriptive and inferential statistics are available in
`R`

- The following examples illustrate
**how**to conduct some statistical analyses in`R`

- The
**why**and**what**is covered in the next lectures

- The

```
mean(dat$english_score) # mean of all people for English score
```

```
# [1] 7.347
```

```
mean(dat[dat$gender == "F", ]$english_score) # mean of women for English score
```

```
# [1] 7.2036
```

```
median(dat$english_score) # median of all people for English score
```

```
# [1] 7.3538
```

```
min(dat$english_score) # minimum value
```

```
# [1] 4.3
```

```
max(dat$english_score) # maximum value
```

```
# [1] 10
```

```
var(dat$english_score) # variance: average squared deviation from mean
```

```
# [1] 1.287
```

```
sd(dat$english_score) # standard deviation (square root of variance)
```

```
# [1] 1.1345
```

```
table(dat$gender)
```

```
#
# F M
# 222 93
```

```
table(dat$study)
```

```
#
# CIS IS LING OTHER
# 101 74 102 38
```

```
table(dat$gender, dat$study)
```

```
#
# CIS IS LING OTHER
# F 82 18 93 29
# M 19 56 9 9
```

```
table(dat$gender, dat$bl_edu)
```

```
#
# N Y
# F 202 20
# M 83 10
```

- A large number of statistical inference functions are available in
`R`

- In this course we will cover the following functions:
`t.test()`

for a \(t\)-test (single sample, paired, independent)`wilcox.test()`

for non-parametric alternatives to the \(t\)-test

(Mann Whitney U test, Wilcoxon signed-rank test)`binom.test()`

for the sign test`chisq.test()`

for the chi-square test`cor()`

for the correlation`alpha()`

(from package`psych`

) for Cronbach's \(\alpha\)

- Assessing average group differences for a numerical variable (lecture 4)

```
t.test(english_grade ~ bl_edu, data = dat)
```

```
#
# Welch Two Sample t-test
#
# data: english_grade by bl_edu
# t = -3.56, df = 37.4, p-value = 0.001
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -0.80665 -0.22212
# sample estimates:
# mean in group N mean in group Y
# 7.2323 7.7467
```

- Assessing strength of the relationship between two numerical variables (lecture 6)

```
cor(dat$english_score, dat$english_grade)
```

```
# [1] 0.74346
```

- In this lecture, we've covered the basics of
`R`

`R`

as calculator- Variables
- Functions and help
- Importing data in
`R`

- Viewing and modifying data
- Some examples of visualizations
- Some examples of descriptive and inferential statistics

- See Levshina (Ch. 2) for more information about the functionality of
`R`

- In the lab session, you will experiment with using
`R`

- Next lecture:
**Descriptive statistics** - Evaluation: if at least 50 people evaluate the lecture, exam-like question visible

Thank you for your attention!