Statistiek I

Introduction to R

Martijn Wieling

This lecture

  • General information about the course
  • Introduction to dataset used in the course
  • Why use statistics?
  • Introduction to RStudio and R
    • R as calculator
    • Variables
    • Functions and help
    • Importing data in R
    • Viewing and modifying data
    • Visualization in R
    • Statistics in R

General information about the course setup (1)

General information about the course setup (2)

  • 7 weekly lectures:
    • Slides will be made available via Brightspace
    • Interactive questions during lecture
  • 7 weekly lab sessions
    • You should have registered via Brightspace for one of the groups
    • No switching when groups are full
    • Attendance obligatory!
    • Finishing lab exercises results in at most 1 bonus point
      • Only when final test score \(T\) at least 5.0
      • Calculation: \(T + 0.25 \times P\, \textrm{(max 2)} \times H\, \textrm{(max 2)}\)

Language policy in this course

  • Lecture slides, book and lab sessions are in English. Why?
    • Most statistics terminology you encounter will be in English
    • Statistiek II will be completely in English (English teacher)
  • You may choose the language of the lab reports (either Dutch or English)
  • The final exam is in Dutch
    • (but with English-Dutch translation of statistical terminology when necessary)

Some remarks about the lab sessions

  • Lab reports due one week after lab
  • Late or sub-optimal lab report: (max.) 1 point (out of 2)
  • Lab report over 1 week late or insufficient: 0 points
  • Exam requirement: average lab score of 1 point, with at most two times 0 points
  • Lab attendance required:
    • You can miss at most 1 lab session, but timely lab report (\(\geq\) 1 point) required
    • If you miss 2 or more lab sessions: course failed
  • Why required lab attendance?
    • Actively participating in lab session increases chances of passing the exam!
    • Lab score correlates significantly (\(r = .4, p < .001\)) with exam grade
  • Lab score of last year may not be used (due to change in contents)

Lab reports and plagiarism

  • Lab sessions are individual, and lab reports are to be made individually
  • Questions about lab exercises should be addressed to lab teachers
  • It is never allowed to copy from another student or external source (= fraud)
  • Allowing another student to copy from you is likewise not allowed (= fraud)
  • Using ChatGPT (etc.) to provide answers to exercises is not allowed (= fraud)
  • If fraud is suspected, the board of examiners are always notified
    • Sanctions include a fraud registration and usually removal from the course
  • Better safe than sorry!

Goals of this course

  • Understand basics of descriptive and inferential statistics
    • Emphasis on statistical reasoning
    • Practical approach, but some mathematics to help understand the concepts
  • Understand and apply basic statistical analyses
  • Report on results of statistical analyses
  • Understand reports using statistical analyses
  • Conduct basic statistical analyses in R

Statistics in Language Studies

  • Many experiments are conducted in communication, information science and linguistics
    • Effect of comics vs. normal text on understanding?
    • Effect of algorithm on quality of an automatically generated summary?
    • Influence of sex on learning a second language?
  • Availability of online data increases opportunities for statistical analysis (e.g., Digital Humanities)
  • In this course we will mainly work with the data collected in the survey
    • But other examples will be given as well

The dataset used during this course

  • Data on the basis of your answers to the initial survey
    • Age, sex, handedness, study, etc.,
    • Information about your language history
    • Information about your English use and (subjective) proficiency
  • We are interested in investigating which factors are related to English proficiency
    • Measured by your English grade in high school
    • And via an automatically calculated approximate measure of proficiency (English score) based on your input
  • Data also includes a subset of survey results from earlier years

Some of our data (1)

participant year sex bl_edu study english_grade english_score
1 2020 F N LING 6 5.19
2 2020 M N LING 7 6.82
3 2020 M N LING 8 8.21
4 2020 F N CIS 7 7.34
5 2020 F N LING 7 6.59
6 2020 F N LING 8 7.55
7 2020 F N LING 7 7.19
8 2020 F Y LING 8 7.63
9 2020 F N LING 6 6.58
10 2020 M N IS 8 8.89
11 2020 M N CIS 7 6.76

Some of our data (2)

participant year sex bl_edu study english_grade english_score
123 2021 M N LING 5.0 6.10
124 2021 F N CIS 6.0 6.67
125 2021 F N CIS 7.0 7.42
126 2021 F N LING 8.0 9.10
127 2021 F N CIS 7.0 7.47
128 2021 M N LING 8.4 8.14
129 2021 F N LING 8.0 7.65
130 2021 F N CIS 6.0 7.35
131 2021 F N LING 8.0 8.54
132 2021 M N IS 8.0 8.39
133 2021 F N LING 7.0 7.98

Some of our data (3)

participant year sex bl_edu study english_grade english_score
225 2022 M N IS 8 7.10
226 2022 F N OTHER 9 7.76
227 2022 F N OTHER 7 5.68
228 2022 F N CIS 7 7.31
229 2022 F N LING 7 7.95
230 2022 M N OTHER 7 7.51
231 2022 F N IS 7 6.97
232 2022 F N CIS 6 6.22
233 2022 M N OTHER 8 8.71
234 2022 F N LING 7 6.78
235 2022 F N CIS 6 5.94

Some of our data (4)

participant year sex bl_edu study english_grade english_score
320 2023 M N LING 8.0 9.02
321 2023 F N LING 8.0 7.44
322 2023 F N CIS 9.0 9.74
323 2023 F N CIS 7.0 9.06
324 2023 F N CIS 8.0 8.35
325 2023 F N LING 7.3 8.55
326 2023 F N CIS 6.0 6.51
327 2023 F N LING 7.0 7.87
328 2023 M N CIS 6.0 7.22
329 2023 F N LING 7.0 7.08
330 2023 F N OTHER 8.0 8.69

Some of our data (5)

participant year sex bl_edu study english_grade english_score
427 2024 M N IS 8.0 7.87
428 2024 F N OTHER 8.0 8.99
429 2024 F N OTHER 7.3 7.75
430 2024 F N OTHER 7.0 8.37
431 2024 F N LING 7.0 6.93
432 2024 M N IS 8.0 8.21
433 2024 F N LING 7.0 8.12
434 2024 F N LING 8.0 8.28
435 2024 F N LING 7.0 8.81
436 2024 F N IS 5.8 5.73
437 2024 F N LING 7.1 7.31

Question 1

Question 2

We use statistics because…

  • We would like to make sense of (in this course: your) data
  • For this we need to:
    • Summarize the data (descriptive statistics)
    • Assess relationships in our data (inferential statistics)
      • (During other courses, some of you will have already encountered inferential statistical tests, such as the \(t\)-test, which can be used for this)
  • The requirement of the data is that it is variable (there must be variation)
  • Note that statistics is not mathematics (it’s data analysis)!

What you will learn during this course (1)

  • Descriptive statistics (describe data without conclusions)
    • Measures of central tendency and spread

      • E.g.: mean English grade (7.3), and range of English grades (5 - 9.5)
    • Visualization

      • E.g.: showing number of participants per study with a bar plot

What you will learn during this course (2)

  • Inferential statistics (link findings based on sample to population)
    • Comparing 2 groups
      • E.g.: is the pronunciation of females better than males?
    • Associations between 2 numerical variables
      • E.g.: is the proficiency in English dependent on age?
    • Internal consistency of questions in a survey
      • E.g.: does a group of questions measure a single construct?
  • How to do statistics in R (this lecture)
    • And how to make reproducible lab reports in R

What your lab reports will look like

Why do we use R?

  • Very nice reproducible lab reports (no copy-paste necessary!)
  • Other advantages of R compared to (e.g.,) SPSS
    • Free for everybody
    • Customizable: people can create their own statistical functions
    • State-of-the-art statistical methods are integrated very quickly
  • Also some disadvantages:
    • No substantial graphical user interface: typing instead of clicking
    • Takes more time to learn
    • State-of-the-art statistical methods sometimes contain bugs

Our tool: RStudio (frontend to R)

RStudio: quick overview

Basic functionality: R as calculator

# Addition (this is a comment: preceded by '#')
5 + 5
[1] 10
# Multiplication
5 * 3
[1] 15
# Division
5 / 3
[1] 1.6667

Basic functionality: using variables

a <- 5 # store a single value; instead of "<-" you can also use "="
a # display the value
[1] 5
b <- a * a # b contains the value of multiplying a with itself
b
[1] 25
(d <- NA) # set value of d to missing (NA) and show value
[1] NA

Storing multiple values in a variable

b <- c(2,4,6,7,8) # store a series of values in a vector (reusing variable b)
b
[1] 2 4 6 7 8
b[4] <- a # assign value 5 (stored in 'a') to the 4th element of vector b
b
[1] 2 4 6 5 8
b <- c(b,NA) # add element NA to b
b
[1]  2  4  6  5  8 NA

Question 3

Basic functionality: using functions

b # show values in variable b (b contains a vector: a list of values)
[1]  2  4  6  5  8 NA
mn <- mean(b) # calculating the mean and storing in variable mn
mn 
[1] NA
# mn is NA (missing) as one of the values is missing
mean(b, na.rm = TRUE) # we can use the function parameter na.rm to ignore NAs
[1] 5
# But which parameters does a function have: use help!
help(mean) # alternatively: ?mean

Basic functionality: a help file

Question 4

Getting data into R: exporting a data set

Getting data into R: importing a data set

setwd('C:/Users/Martijn/Desktop/Statistiek-I/HC1') # set working directory
dat <- read.csv('survey.csv',sep=',',dec='.') # reads csv file from work dir
str(dat) # shows structure of the data frame (i.e. table is 2-dimensional)
'data.frame':   500 obs. of  7 variables:
 $ participant  : int  1 2 3 4 5 6 7 8 9 10 ...
 $ year         : int  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
 $ sex          : chr  "F" "M" "M" "F" ...
 $ bl_edu       : chr  "N" "N" "N" "N" ...
 $ study        : chr  "LING" "LING" "LING" "CIS" ...
 $ english_grade: num  6 7 8 7 7 8 7 8 6 8 ...
 $ english_score: num  5.19 6.82 8.21 7.34 6.59 ...
dim(dat) # number of rows and columns of data set 
[1] 500   7

Investigating our data: using head

head(dat) # show first few rows of dat 
  participant year sex bl_edu study english_grade english_score
1           1 2020   F      N  LING             6        5.1902
2           2 2020   M      N  LING             7        6.8208
3           3 2020   M      N  LING             8        8.2118
4           4 2020   F      N   CIS             7        7.3397
5           5 2020   F      N  LING             7        6.5873
6           6 2020   F      N  LING             8        7.5489

Investigating our data: RStudio viewer

Accessing table data using numbers

  • Access parts of table by specifying row and/or column numbers

  • dat[a,b]:

    • a indicates the selected rows of dat
    • b indicates the selected columns of dat
dat[1,] # values in first row (dat[,1]: values in first column)
  participant year sex bl_edu study english_grade english_score
1           1 2020   F      N  LING             6        5.1902
dat[c(1,5),c(1,2,3)] # values in rows 1 and 5 and columns 1, 2 and 3
  participant year sex
1           1 2020   F
5           5 2020   F

Accessing table data using names (1)

  • Additionally, we can access parts of the table by specifying the names of the columns we want to look at
dat[c(1,3,5),c("participant","study")] # rows 1, 3 and 5, and 2 named columns
  participant study
1           1  LING
3           3  LING
5           5  LING

Accessing table data using names (2)

  • We may also select a single column by its name (not nr.) using the $ operator
    • E.g., dat$sex accesses the column sex of dat
head(dat$sex,200) # show sex of first 200 students
  [1] "F" "M" "M" "F" "F" "F" "F" "F" "F" "M" "M" "F" "M" "F" "F" "F" "F" "F"
 [19] "F" "F" "F" "F" "F" "F" "M" "F" "M" "M" "F" "F" "F" "F" "F" "F" "F" "F"
 [37] "M" "M" "F" "F" "F" "M" "M" "F" "F" "F" "M" "M" "M" "F" "F" "F" "M" "M"
 [55] "F" "F" "F" "F" "F" "F" "F" "F" "M" "F" "M" "F" "F" "F" "M" "F" "F" "M"
 [73] "M" "M" "M" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "M" "F" "F" "M"
 [91] "M" "F" "M" "F" "M" "M" "F" "F" "F" "M" "F" "F" "F" "F" "M" "M" "F" "F"
[109] "F" "F" "M" "F" "F" "F" "M" "F" "F" "M" "M" "M" "M" "M" "M" "F" "F" "F"
[127] "F" "M" "F" "F" "F" "M" "F" "M" "F" "M" "M" "F" "M" "F" "F" "M" "M" "F"
[145] "F" "M" "F" "F" "F" "F" "F" "F" "F" "M" "F" "F" "F" "F" "M" "M" "F" "F"
[163] "F" "F" "F" "F" "F" "M" "M" "F" "F" "F" "M" "F" "F" "M" "F" "F" "F" "F"
[181] "F" "F" "F" "F" "F" "F" "F" "M" "F" "F" "M" "F" "F" "F" "F" "F" "F" "F"
[199] "F" "F"

Storing accessed table data

tmp <- dat[5:8,c(1,3)] # store columns 1 and 3 for rows 5 to 8 in variable tmp 
tmp # show what is stored in variable tmp
  participant sex
5           5   F
6           6   F
7           7   F
8           8   F

Question 5

Accessing table data using conditional indexing (1)

  • Conditional indexing allows us the select parts of the data on the basis of conditions
tmp <- dat[dat$sex == 'M',] # only observations for male participants
head(tmp)
   participant year sex bl_edu study english_grade english_score
2            2 2020   M      N  LING             7        6.8208
3            3 2020   M      N  LING             8        8.2118
10          10 2020   M      N    IS             8        8.8922
11          11 2020   M      N   CIS             7        6.7571
13          13 2020   M      N   CIS             6        6.3324
25          25 2020   M      N OTHER             9        8.3452

Accessing table data using conditional indexing (2)

  • Methods to combine conditions:
    • and: &
    • or: |
# only participants who study IS *and- are male
tmp <- dat[dat$sex == 'M' & dat$study == 'IS',] 
head(tmp)
   participant year sex bl_edu study english_grade english_score
10          10 2020   M      N    IS           8.0        8.8922
27          27 2020   M      N    IS           8.0        8.9217
28          28 2020   M      N    IS           7.0        8.0216
37          37 2020   M      N    IS           8.1        8.6534
43          43 2020   M      N    IS           6.0        6.6602
47          47 2020   M      N    IS           9.0        8.9312

Accessing table data using conditional indexing (3)

  • Invert a condition with ! (not)
    • is not equal to: !=
# only females (i.e. not males) *or* everybody with an English grade over 7
tmp <- dat[dat$sex != 'M' | dat$english_grade > 7,] 
tail(tmp) # tail shows final 6 rows 
    participant year sex bl_edu study english_grade english_score
494         494 2024   F      N  LING           5.8        5.1720
495         495 2024   F      N  LING           7.0        8.0231
496         496 2024   M      N    IS           8.0        7.5441
497         497 2024   F      N  LING           6.0        7.1884
498         498 2024   F      N  LING           6.5        6.4241
499         499 2024   M      N    IS           9.0        9.5693

Question 6

Supplementing the data: adding columns (1)

  • Frequently, you need to add columns to the data, for example by computing a new value on the basis of the values in two columns
    • the operator $ helps us to do that
# new column 'diff': English grade - English proficiency score
dat$diff <- dat$english_grade - dat$english_score 
head(dat)
  participant year sex bl_edu study english_grade english_score     diff
1           1 2020   F      N  LING             6        5.1902  0.80976
2           2 2020   M      N  LING             7        6.8208  0.17917
3           3 2020   M      N  LING             8        8.2118 -0.21182
4           4 2020   F      N   CIS             7        7.3397 -0.33970
5           5 2020   F      N  LING             7        6.5873  0.41273
6           6 2020   F      N  LING             8        7.5489  0.45106

Supplementing the data: adding columns (2)

  • Conditional indexing allows us additional flexibility
dat$pass_fail <- 'PASS' # new column, initially PASS for everybody
dat[dat$english_grade < 5.5,]$pass_fail <- 'FAIL' # if grade too low, then FAIL
tail(dat[dat$english_grade > 4 & dat$english_grade < 6, 2:9]) # show subset of data
    year sex bl_edu study english_grade english_score      diff pass_fail
392 2023   F      N   CIS           5.6        5.9877 -0.387718      PASS
436 2024   F      N    IS           5.8        5.7252  0.074803      PASS
454 2024   F      N  LING           5.0        6.1166 -1.116598      FAIL
468 2024   F      Y   CIS           5.0        4.3000  0.700000      FAIL
490 2024   F      N  LING           5.8        6.0576 -0.257642      PASS
494 2024   F      N  LING           5.8        5.1720  0.627971      PASS

Question 7

Visualization in R

  • Many basic visualization options are available in R
  • In this course, we will learn how to use the following functions (list not exhaustive):
    • barplot() (illustrated in the following)
    • plot()
    • boxplot()
    • hist()
    • qqnorm() and qqline()

Example of using a plotting function: barplot()

  • The bar plot is used to visualize frequencies of categorical variables
(counts <- table(dat$sex)) # first create frequency table

  F   M 
346 154 
barplot(counts)

Graphical parameters

  • There are various graphical parameters to allow you to customize your graphics
barplot(counts, col = c("pink", "lightblue"), ylim = c(0, 350), main = "My barplot", 
    xlab = "Sex", ylab = "Frequency")

Question 8

Another example: segmented bar plot

(counts <- table(dat$sex,dat$study))
   
    CIS  IS LING OTHER
  F  99  26  179    42
  M  24  98   16    16
barplot(counts, col=c("pink","lightblue"), legend = c('F','M'), ylim=c(0,185))

Statistics in R

  • The main purpose of R is to conduct statistical analyses
  • Many different functions to obtain descriptive and inferential statistics are available in R
  • The following examples illustrate how to conduct some statistical analyses in R
    • The why and what is covered in the next lectures

Descriptive statistics: central tendency

mean(dat$english_score) # mean of all people for English score
[1] 7.6178
mean(dat[dat$sex == 'F',]$english_score) # mean of females for English score
[1] 7.5342
median(dat$english_score) # median of all people for English score
[1] 7.6377

Descriptive statistics: spread

min(dat$english_score) # minimum value
[1] 4.3
max(dat$english_score) # maximum value
[1] 9.7421
var(dat$english_score) # variance: average squared deviation from mean
[1] 0.85021
sd(dat$english_score) # standard deviation (square root of variance)
[1] 0.92207

Descriptive statistics: frequency tables

table(dat$sex)

  F   M 
346 154 
table(dat$study)

  CIS    IS  LING OTHER 
  123   124   195    58 

Question 9

Descriptive statistics: cross-tables

table(dat$sex,dat$study)
   
    CIS  IS LING OTHER
  F  99  26  179    42
  M  24  98   16    16
table(dat$sex,dat$bl_edu)
   
      N   Y
  F 313  33
  M 140  14

Inferential statistics

  • A large number of statistical inference functions are available in R
  • In this course we will cover the following functions:
    • cor() for the correlation
    • lm() for linear regression
    • glm() for logistic regression
    • alpha() (from package psych) for Cronbach’s \(\alpha\)

Example of inferential statistics: regression

  • Assessing average group differences for a numerical variable (lecture 4)
summary(lm(english_grade ~ bl_edu, data=dat))

Call:
lm(formula = english_grade ~ bl_edu, data = dat)

Residuals:
   Min     1Q Median     3Q    Max 
-2.640 -0.246 -0.246  0.754  2.154 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   7.2457     0.0404  179.28   <2e-16 ***
bl_eduY       0.3947     0.1318    2.99   0.0029 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.86 on 498 degrees of freedom
Multiple R-squared:  0.0177,    Adjusted R-squared:  0.0157 
F-statistic: 8.97 on 1 and 498 DF,  p-value: 0.00289

Recap

  • In this lecture, we’ve covered the basics of R
    • R as calculator
    • Variables
    • Functions and help
    • Importing data in R
    • Viewing and modifying data
    • Some examples of visualizations
    • Some examples of descriptive and inferential statistics
  • See Winter (Ch. 1 and 2) for more information about the functionality of R
  • In the lab session, you will experiment with using R
  • Next lecture: Descriptive statistics
  • Evaluation: if at least 50 people evaluate the lecture, exam-like question visible

Please evaluate this lecture!

Exam question

Questions?

Thank you for your attention!

 

https://www.martijnwieling.nl

m.b.wieling@rug.nl