Statistiek I

Introduction to R

Martijn Wieling

This lecture

General information about the course
Introduction to dataset used in the course
Why use statistics?
Introduction to RStudio and R
- R as calculator
- Variables
- Functions and help
- Importing data in R
- Viewing and modifying data
- Visualization in R
- Statistics in R

General information about the course setup (1)

Important information (including studiehandleiding and FAQ) on Brightspace!
Teacher of lectures:
- Martijn Wieling, m.b.wieling@rug.nl, 1311.434
Teachers for lab sessions:
- Janiek de Rijke (Wed. 9-11), j.de.rijke.1@student.rug.nl
- Mijke van Daal (Wed. 13-15), m.a.h.van.daal@student.rug.nl
- Manon Kooning (Fri. 9-11), m.r.kooning@student.rug.nl
- Lourens Visser (Fri. 13-15), l.j.visser@rug.nl

General information about the course setup (2)

7 weekly lectures:
- Slides will be made available via Brightspace
- Interactive questions during lecture
7 weekly lab sessions
- You should have registered via Brightspace for one of the groups
  - If problems, contact secretary: sec-MILLC@rug.nl
- No switching when groups are full
- Attendance obligatory!
- Finishing lab exercises results in at most 1 bonus point
  - Only when final test score $T$ at least 5.0
  - Calculation: $T + 0.25 \times P\, \textrm{(max 2)} \times H\, \textrm{(max 2)}$

Language policy in this course

Lecture slides, book and lab sessions are in English. Why?
- Most statistics terminology you encounter will be in English
- Statistiek II will be completely in English (English teacher)
You may choose the language of the lab reports (either Dutch or English)
The final exam is in Dutch
- (but with English-Dutch translation of statistical terminology when necessary)

Some remarks about the lab sessions

Lab reports due one week after lab
Late or sub-optimal lab report: (max.) 1 point (out of 2)
Lab report over 1 week late or insufficient: 0 points
Exam requirement: average lab score of 1 point, with at most two times 0 points
Lab attendance required:
- You can miss at most 1 lab session, but timely lab report ($\geq$ 1 point) required
- If you miss 2 or more lab sessions: course failed
Why required lab attendance?
- Actively participating in lab session increases chances of passing the exam!
- Lab score correlates significantly ($r = .4, p < .001$) with exam grade
Lab score of last year may not be used (due to change in contents)

Lab reports and plagiarism

Lab sessions are individual, and lab reports are to be made individually
Questions about lab exercises should be addressed to lab teachers
It is never allowed to copy from another student or external source (= fraud)
Allowing another student to copy from you is likewise not allowed (= fraud)
Using ChatGPT (etc.) to provide answers to exercises is not allowed (= fraud)
If fraud is suspected, the board of examiners are always notified
- Sanctions include a fraud registration and usually removal from the course
Better safe than sorry!

Goals of this course

Understand basics of descriptive and inferential statistics
- Emphasis on statistical reasoning
- Practical approach, but some mathematics to help understand the concepts
Understand and apply basic statistical analyses
Report on results of statistical analyses
Understand reports using statistical analyses
Conduct basic statistical analyses in R

Statistics in Language Studies

Many experiments are conducted in communication, information science and linguistics
- Effect of comics vs. normal text on understanding?
- Effect of algorithm on quality of an automatically generated summary?
- Influence of sex on learning a second language?
Availability of online data increases opportunities for statistical analysis (e.g., Digital Humanities)
In this course we will mainly work with the data collected in the survey
- But other examples will be given as well

The dataset used during this course

Data on the basis of your answers to the initial survey
- Age, sex, handedness, study, etc.,
- Information about your language history
- Information about your English use and (subjective) proficiency
We are interested in investigating which factors are related to English proficiency
- Measured by your English grade in high school
- And via an automatically calculated approximate measure of proficiency (English score) based on your input
Data also includes a subset of survey results from earlier years

Some of our data (1)

participant	year	sex	bl_edu	study	english_grade	english_score
1	2020	F	N	LING	6	5.19
2	2020	M	N	LING	7	6.82
3	2020	M	N	LING	8	8.21
4	2020	F	N	CIS	7	7.34
5	2020	F	N	LING	7	6.59
6	2020	F	N	LING	8	7.55
7	2020	F	N	LING	7	7.19
8	2020	F	Y	LING	8	7.63
9	2020	F	N	LING	6	6.58
10	2020	M	N	IS	8	8.89
11	2020	M	N	CIS	7	6.76

Some of our data (2)

participant	year	sex	bl_edu	study	english_grade	english_score
123	2021	M	N	LING	5.0	6.10
124	2021	F	N	CIS	6.0	6.67
125	2021	F	N	CIS	7.0	7.42
126	2021	F	N	LING	8.0	9.10
127	2021	F	N	CIS	7.0	7.47
128	2021	M	N	LING	8.4	8.14
129	2021	F	N	LING	8.0	7.65
130	2021	F	N	CIS	6.0	7.35
131	2021	F	N	LING	8.0	8.54
132	2021	M	N	IS	8.0	8.39
133	2021	F	N	LING	7.0	7.98

Some of our data (3)

participant	year	sex	bl_edu	study	english_grade	english_score
225	2022	M	N	IS	8	7.10
226	2022	F	N	OTHER	9	7.76
227	2022	F	N	OTHER	7	5.68
228	2022	F	N	CIS	7	7.31
229	2022	F	N	LING	7	7.95
230	2022	M	N	OTHER	7	7.51
231	2022	F	N	IS	7	6.97
232	2022	F	N	CIS	6	6.22
233	2022	M	N	OTHER	8	8.71
234	2022	F	N	LING	7	6.78
235	2022	F	N	CIS	6	5.94

Some of our data (4)

participant	year	sex	bl_edu	study	english_grade	english_score
320	2023	M	N	LING	8.0	9.02
321	2023	F	N	LING	8.0	7.44
322	2023	F	N	CIS	9.0	9.74
323	2023	F	N	CIS	7.0	9.06
324	2023	F	N	CIS	8.0	8.35
325	2023	F	N	LING	7.3	8.55
326	2023	F	N	CIS	6.0	6.51
327	2023	F	N	LING	7.0	7.87
328	2023	M	N	CIS	6.0	7.22
329	2023	F	N	LING	7.0	7.08
330	2023	F	N	OTHER	8.0	8.69

Some of our data (5)

participant	year	sex	bl_edu	study	english_grade	english_score
427	2024	M	N	IS	8.0	7.87
428	2024	F	N	OTHER	8.0	8.99
429	2024	F	N	OTHER	7.3	7.75
430	2024	F	N	OTHER	7.0	8.37
431	2024	F	N	LING	7.0	6.93
432	2024	M	N	IS	8.0	8.21
433	2024	F	N	LING	7.0	8.12
434	2024	F	N	LING	8.0	8.28
435	2024	F	N	LING	7.0	8.81
436	2024	F	N	IS	5.8	5.73
437	2024	F	N	LING	7.1	7.31

Question 1

Question 2

We use statistics because…

We would like to make sense of (in this course: your) data
For this we need to:
- Summarize the data (descriptive statistics)
- Assess relationships in our data (inferential statistics)
  - (During other courses, some of you will have already encountered inferential statistical tests, such as the $t$-test, which can be used for this)
The requirement of the data is that it is variable (there must be variation)
Note that statistics is not mathematics (it’s data analysis)!

What you will learn during this course (1)

Descriptive statistics (describe data without conclusions)
- Measures of central tendency and spread
  - E.g.: mean English grade (7.3), and range of English grades (5 - 9.5)
- Visualization
  - E.g.: showing number of participants per study with a bar plot

What you will learn during this course (2)

Inferential statistics (link findings based on sample to population)
- Comparing 2 groups
  - E.g.: is the pronunciation of females better than males?
- Associations between 2 numerical variables
  - E.g.: is the proficiency in English dependent on age?
- Internal consistency of questions in a survey
  - E.g.: does a group of questions measure a single construct?
How to do statistics in R (this lecture)
- And how to make reproducible lab reports in R

What your lab reports will look like

Why do we use `R`?

Very nice reproducible lab reports (no copy-paste necessary!)
Other advantages of R compared to (e.g.,) SPSS
- Free for everybody
- Customizable: people can create their own statistical functions
- State-of-the-art statistical methods are integrated very quickly
Also some disadvantages:
- No substantial graphical user interface: typing instead of clicking
- Takes more time to learn
- State-of-the-art statistical methods sometimes contain bugs

Our tool: RStudio (frontend to `R`)

RStudio: quick overview

Basic functionality: `R` as calculator

# Addition (this is a comment: preceded by '#')
5 + 5

[1] 10

# Multiplication
5 * 3

[1] 15

# Division
5 / 3

[1] 1.6667

Basic functionality: using variables

a <- 5 # store a single value; instead of "<-" you can also use "="
a # display the value

[1] 5

b <- a * a # b contains the value of multiplying a with itself
b

[1] 25

(d <- NA) # set value of d to missing (NA) and show value

[1] NA

Storing multiple values in a variable

b <- c(2,4,6,7,8) # store a series of values in a vector (reusing variable b)
b

[1] 2 4 6 7 8

b[4] <- a # assign value 5 (stored in 'a') to the 4th element of vector b
b

[1] 2 4 6 5 8

b <- c(b,NA) # add element NA to b
b

[1]  2  4  6  5  8 NA

Question 3

Basic functionality: using functions

b # show values in variable b (b contains a vector: a list of values)

[1]  2  4  6  5  8 NA

mn <- mean(b) # calculating the mean and storing in variable mn
mn

[1] NA

# mn is NA (missing) as one of the values is missing
mean(b, na.rm = TRUE) # we can use the function parameter na.rm to ignore NAs

[1] 5

# But which parameters does a function have: use help!
help(mean) # alternatively: ?mean

Basic functionality: a help file

Question 4

Getting data into `R`: exporting a data set

Getting data into `R`: importing a data set

setwd('C:/Users/Martijn/Desktop/Statistiek-I/HC1') # set working directory
dat <- read.csv('survey.csv',sep=',',dec='.') # reads csv file from work dir
str(dat) # shows structure of the data frame (i.e. table is 2-dimensional)

'data.frame':   500 obs. of  7 variables:
 $ participant  : int  1 2 3 4 5 6 7 8 9 10 ...
 $ year         : int  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
 $ sex          : chr  "F" "M" "M" "F" ...
 $ bl_edu       : chr  "N" "N" "N" "N" ...
 $ study        : chr  "LING" "LING" "LING" "CIS" ...
 $ english_grade: num  6 7 8 7 7 8 7 8 6 8 ...
 $ english_score: num  5.19 6.82 8.21 7.34 6.59 ...

dim(dat) # number of rows and columns of data set

[1] 500   7

Investigating our data: using `head`

head(dat) # show first few rows of dat

  participant year sex bl_edu study english_grade english_score
1           1 2020   F      N  LING             6        5.1902
2           2 2020   M      N  LING             7        6.8208
3           3 2020   M      N  LING             8        8.2118
4           4 2020   F      N   CIS             7        7.3397
5           5 2020   F      N  LING             7        6.5873
6           6 2020   F      N  LING             8        7.5489

Investigating our data: RStudio viewer

Accessing table data using numbers

Access parts of table by specifying row and/or column numbers
dat[a,b]:
- a indicates the selected rows of dat
- b indicates the selected columns of dat

dat[1,] # values in first row (dat[,1]: values in first column)

  participant year sex bl_edu study english_grade english_score
1           1 2020   F      N  LING             6        5.1902

dat[c(1,5),c(1,2,3)] # values in rows 1 and 5 and columns 1, 2 and 3

  participant year sex
1           1 2020   F
5           5 2020   F

Accessing table data using names (1)

Additionally, we can access parts of the table by specifying the names of the columns we want to look at

dat[c(1,3,5),c("participant","study")] # rows 1, 3 and 5, and 2 named columns

  participant study
1           1  LING
3           3  LING
5           5  LING

Accessing table data using names (2)

We may also select a single column by its name (not nr.) using the $ operator
- E.g., dat$sex accesses the column sex of dat

head(dat$sex,200) # show sex of first 200 students

  [1] "F" "M" "M" "F" "F" "F" "F" "F" "F" "M" "M" "F" "M" "F" "F" "F" "F" "F"
 [19] "F" "F" "F" "F" "F" "F" "M" "F" "M" "M" "F" "F" "F" "F" "F" "F" "F" "F"
 [37] "M" "M" "F" "F" "F" "M" "M" "F" "F" "F" "M" "M" "M" "F" "F" "F" "M" "M"
 [55] "F" "F" "F" "F" "F" "F" "F" "F" "M" "F" "M" "F" "F" "F" "M" "F" "F" "M"
 [73] "M" "M" "M" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "M" "F" "F" "M"
 [91] "M" "F" "M" "F" "M" "M" "F" "F" "F" "M" "F" "F" "F" "F" "M" "M" "F" "F"
[109] "F" "F" "M" "F" "F" "F" "M" "F" "F" "M" "M" "M" "M" "M" "M" "F" "F" "F"
[127] "F" "M" "F" "F" "F" "M" "F" "M" "F" "M" "M" "F" "M" "F" "F" "M" "M" "F"
[145] "F" "M" "F" "F" "F" "F" "F" "F" "F" "M" "F" "F" "F" "F" "M" "M" "F" "F"
[163] "F" "F" "F" "F" "F" "M" "M" "F" "F" "F" "M" "F" "F" "M" "F" "F" "F" "F"
[181] "F" "F" "F" "F" "F" "F" "F" "M" "F" "F" "M" "F" "F" "F" "F" "F" "F" "F"
[199] "F" "F"

Storing accessed table data

tmp <- dat[5:8,c(1,3)] # store columns 1 and 3 for rows 5 to 8 in variable tmp 
tmp # show what is stored in variable tmp

  participant sex
5           5   F
6           6   F
7           7   F
8           8   F

Question 5

Accessing table data using conditional indexing (1)

Conditional indexing allows us the select parts of the data on the basis of conditions

tmp <- dat[dat$sex == 'M',] # only observations for male participants
head(tmp)

   participant year sex bl_edu study english_grade english_score
2            2 2020   M      N  LING             7        6.8208
3            3 2020   M      N  LING             8        8.2118
10          10 2020   M      N    IS             8        8.8922
11          11 2020   M      N   CIS             7        6.7571
13          13 2020   M      N   CIS             6        6.3324
25          25 2020   M      N OTHER             9        8.3452

Accessing table data using conditional indexing (2)

Methods to combine conditions:
- and: &
- or: |

# only participants who study IS *and- are male
tmp <- dat[dat$sex == 'M' & dat$study == 'IS',] 
head(tmp)

   participant year sex bl_edu study english_grade english_score
10          10 2020   M      N    IS           8.0        8.8922
27          27 2020   M      N    IS           8.0        8.9217
28          28 2020   M      N    IS           7.0        8.0216
37          37 2020   M      N    IS           8.1        8.6534
43          43 2020   M      N    IS           6.0        6.6602
47          47 2020   M      N    IS           9.0        8.9312

Accessing table data using conditional indexing (3)

Invert a condition with ! (not)
- is not equal to: !=

# only females (i.e. not males) *or* everybody with an English grade over 7
tmp <- dat[dat$sex != 'M' | dat$english_grade > 7,] 
tail(tmp) # tail shows final 6 rows

    participant year sex bl_edu study english_grade english_score
494         494 2024   F      N  LING           5.8        5.1720
495         495 2024   F      N  LING           7.0        8.0231
496         496 2024   M      N    IS           8.0        7.5441
497         497 2024   F      N  LING           6.0        7.1884
498         498 2024   F      N  LING           6.5        6.4241
499         499 2024   M      N    IS           9.0        9.5693

Question 6

Supplementing the data: adding columns (1)

Frequently, you need to add columns to the data, for example by computing a new value on the basis of the values in two columns
- the operator $ helps us to do that

# new column 'diff': English grade - English proficiency score
dat$diff <- dat$english_grade - dat$english_score 
head(dat)

  participant year sex bl_edu study english_grade english_score     diff
1           1 2020   F      N  LING             6        5.1902  0.80976
2           2 2020   M      N  LING             7        6.8208  0.17917
3           3 2020   M      N  LING             8        8.2118 -0.21182
4           4 2020   F      N   CIS             7        7.3397 -0.33970
5           5 2020   F      N  LING             7        6.5873  0.41273
6           6 2020   F      N  LING             8        7.5489  0.45106

Supplementing the data: adding columns (2)

Conditional indexing allows us additional flexibility

dat$pass_fail <- 'PASS' # new column, initially PASS for everybody
dat[dat$english_grade < 5.5,]$pass_fail <- 'FAIL' # if grade too low, then FAIL
tail(dat[dat$english_grade > 4 & dat$english_grade < 6, 2:9]) # show subset of data

    year sex bl_edu study english_grade english_score      diff pass_fail
392 2023   F      N   CIS           5.6        5.9877 -0.387718      PASS
436 2024   F      N    IS           5.8        5.7252  0.074803      PASS
454 2024   F      N  LING           5.0        6.1166 -1.116598      FAIL
468 2024   F      Y   CIS           5.0        4.3000  0.700000      FAIL
490 2024   F      N  LING           5.8        6.0576 -0.257642      PASS
494 2024   F      N  LING           5.8        5.1720  0.627971      PASS

Question 7

Visualization in `R`

Many basic visualization options are available in R
In this course, we will learn how to use the following functions (list not exhaustive):
- barplot() (illustrated in the following)
- plot()
- boxplot()
- hist()
- qqnorm() and qqline()

Example of using a plotting function: barplot()

The bar plot is used to visualize frequencies of categorical variables

(counts <- table(dat$sex)) # first create frequency table


  F   M 
346 154

barplot(counts)

Graphical parameters

There are various graphical parameters to allow you to customize your graphics

barplot(counts, col = c("pink", "lightblue"), ylim = c(0, 350), main = "My barplot", 
    xlab = "Sex", ylab = "Frequency")

Question 8

Another example: segmented bar plot

(counts <- table(dat$sex,dat$study))

   
    CIS  IS LING OTHER
  F  99  26  179    42
  M  24  98   16    16

barplot(counts, col=c("pink","lightblue"), legend = c('F','M'), ylim=c(0,185))

Statistics in `R`

The main purpose of R is to conduct statistical analyses
Many different functions to obtain descriptive and inferential statistics are available in R
The following examples illustrate how to conduct some statistical analyses in R
- The why and what is covered in the next lectures

Descriptive statistics: central tendency

mean(dat$english_score) # mean of all people for English score

[1] 7.6178

mean(dat[dat$sex == 'F',]$english_score) # mean of females for English score

[1] 7.5342

median(dat$english_score) # median of all people for English score

[1] 7.6377

Descriptive statistics: spread

min(dat$english_score) # minimum value

[1] 4.3

max(dat$english_score) # maximum value

[1] 9.7421

var(dat$english_score) # variance: average squared deviation from mean

[1] 0.85021

sd(dat$english_score) # standard deviation (square root of variance)

[1] 0.92207

Descriptive statistics: frequency tables

table(dat$sex)


  F   M 
346 154

table(dat$study)


  CIS    IS  LING OTHER 
  123   124   195    58

Question 9

Descriptive statistics: cross-tables

table(dat$sex,dat$study)

   
    CIS  IS LING OTHER
  F  99  26  179    42
  M  24  98   16    16

table(dat$sex,dat$bl_edu)

Inferential statistics

A large number of statistical inference functions are available in R
In this course we will cover the following functions:
- cor() for the correlation
- lm() for linear regression
- glm() for logistic regression
- alpha() (from package psych) for Cronbach’s $\alpha$

Example of inferential statistics: regression

Assessing average group differences for a numerical variable (lecture 4)

summary(lm(english_grade ~ bl_edu, data=dat))


Call:
lm(formula = english_grade ~ bl_edu, data = dat)

Residuals:
   Min     1Q Median     3Q    Max 
-2.640 -0.246 -0.246  0.754  2.154 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   7.2457     0.0404  179.28   <2e-16 ***
bl_eduY       0.3947     0.1318    2.99   0.0029 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.86 on 498 degrees of freedom
Multiple R-squared:  0.0177,    Adjusted R-squared:  0.0157 
F-statistic: 8.97 on 1 and 498 DF,  p-value: 0.00289

Recap

In this lecture, we’ve covered the basics of R
- R as calculator
- Variables
- Functions and help
- Importing data in R
- Viewing and modifying data
- Some examples of visualizations
- Some examples of descriptive and inferential statistics
See Winter (Ch. 1 and 2) for more information about the functionality of R
In the lab session, you will experiment with using R
Next lecture: Descriptive statistics
Evaluation: if at least 50 people evaluate the lecture, exam-like question visible

Please evaluate this lecture!

Exam question

Questions?

Thank you for your attention!

https://www.martijnwieling.nl

m.b.wieling@rug.nl

Statistiek I

This lecture

General information about the course setup (1)

General information about the course setup (2)

Language policy in this course

Some remarks about the lab sessions

Lab reports and plagiarism

Goals of this course

Statistics in Language Studies

The dataset used during this course

Some of our data (1)

Some of our data (2)

Some of our data (3)

Some of our data (4)

Some of our data (5)

Question 1

Question 2

We use statistics because…

What you will learn during this course (1)

What you will learn during this course (2)

What your lab reports will look like

Why do we use R?

Our tool: RStudio (frontend to R)

RStudio: quick overview

Basic functionality: R as calculator

Basic functionality: using variables

Storing multiple values in a variable

Question 3

Basic functionality: using functions

Basic functionality: a help file

Question 4

Getting data into R: exporting a data set

Getting data into R: importing a data set

Investigating our data: using head

Investigating our data: RStudio viewer

Accessing table data using numbers

Accessing table data using names (1)

Accessing table data using names (2)

Storing accessed table data

Question 5

Accessing table data using conditional indexing (1)

Accessing table data using conditional indexing (2)

Accessing table data using conditional indexing (3)

Question 6

Supplementing the data: adding columns (1)

Supplementing the data: adding columns (2)

Question 7

Visualization in R

Example of using a plotting function: barplot()

Graphical parameters

Question 8

Another example: segmented bar plot

Statistics in R

Descriptive statistics: central tendency

Descriptive statistics: spread

Descriptive statistics: frequency tables

Question 9

Descriptive statistics: cross-tables

Inferential statistics

Example of inferential statistics: regression

Recap

Please evaluate this lecture!

Exam question

Questions?

Why do we use `R`?

Our tool: RStudio (frontend to `R`)

Basic functionality: `R` as calculator

Getting data into `R`: exporting a data set

Getting data into `R`: importing a data set

Investigating our data: using `head`

Visualization in `R`

Statistics in `R`