Statistiek I

Introduction to R

Martijn Wieling
University of Groningen

This lecture

  • General information about the course
  • Introduction to dataset used in the course
  • Why use statistics?
  • Introduction to RStudio and R
    • R as calculator
    • Variables
    • Functions and help
    • Importing data in R
    • Viewing and modifying data
    • Visualization in R
    • Statistics in R

General information about the course setup (1)

General information about the course setup (2)

  • 7 weekly lectures
    • Slides (and 2017-2018 recordings) will be made available via Nestor
    • Interactive Mentimeter questions during lecture
  • 7 weekly lab sessions
    • You should have registered via Nestor for a group associated with your study
    • Attendance obligatory!
    • Finishing lab exercises results in at most 1 bonus point
      • Only when final test score \(T\) at least 5.0
      • Calculation: \(T + 0.25 \times P \textrm{(max 2)} \times H \textrm{(max 2)}\)
  • 3 optional lectures
    • Feb. 19, March 5, March 19, 17:00 - 19:00, Zernikezaal, Academy building
    • All exercises have to be made in advance (see Nestor)!

Language policy in this course

  • Lecture slides, book and lab sessions are in English. Why?
    • Most statistics terminology you encounter will be in English
    • Statistiek II will be completely in English (English teacher)
  • You may choose the language of the lab reports (either Dutch or English)
  • The final exam is in Dutch
    • (but with English-Dutch translation of statistical terminology when necessary)

Some remarks about the lab sessions

  • Lab reports due one week after lab
  • Late or sub-optimal lab report: (max.) 1 point (out of 2)
  • Lab report over 1 week late or insufficient: 0 points
  • Exam requirement: average lab score of 1 point, with at most three times 0 points
  • Lab attendance required:
    • You can miss at most 1 lab session, but timely lab report (\(\geq\) 1 point) required
    • If you miss 2 or more lab sessions: course failed
  • Why required lab attendance?
    • Actively participating in lab session increases chances of passing the exam!
  • Lab score of last year may be used, but only as a whole (same restrictions)

Goals of this course

  • Understand basics of descriptive and inferential statistics
    • Emphasis on statistical reasoning
    • Practical approach, but some mathematics to help understand the concepts
  • Understand and apply basic statistical analyses
  • Report on results of statistical analyses
  • Understand reports using statistical analyses
  • Conduct basic statistical analyses in R

Statistics in Language Studies

  • Many experiments are conducted in communication, information science and linguistics
    • Effect of comics vs. normal text on understanding?
    • Effect of algorithm on quality of automatically generated summary?
    • Influence of gender on learning a second language?
  • Availability of online data increases opportunities for statistical analysis (e.g., Digital Humanities)
  • In this course we will mainly work with the data collected in the survey
    • But other examples will be given as well

The dataset used during this course

  • Data on the basis of your answers to the initial survey
    • Age, gender, handedness, study, etc.,
    • Information about your language history
    • Information about your English use and (subjective) proficiency
  • We are interested in investigating which factors are related to English proficiency
    • Measured by your English grade in high school
    • And via an automatically calculated approximate measure of proficiency (English score) based on your input
  • Data also includes a subset of survey results from earlier years

Some of our data (1)

participant year gender bl_edu study english_grade english_score
1 1 2017 F N LING 6 4.31
2 2 2017 M N LING 7 6.42
3 3 2017 M N LING 8 8.17
4 4 2017 F N CIS 7 6.99
5 5 2017 F N LING 7 6.12
6 6 2017 F N LING 8 7.35
7 7 2017 F N LING 7 6.78
8 8 2017 F Y LING 8 7.43
9 9 2017 F N LING 6 6.04
10 10 2017 M N IS 8 8.94
11 11 2017 M N CIS 7 6.27
12 12 2017 F N OTHER 8 7.99
13 13 2017 M N CIS 6 5.77

Some of our data (2)

participant year gender bl_edu study english_grade english_score
14 14 2017 F N OTHER 7 6.78
15 15 2017 F N OTHER 8 8.45
16 16 2017 F N CIS 7 7.11
17 17 2017 F N LING 8 8.47
18 18 2017 F N IS 8 7.48
19 19 2017 F N LING 8 7.52
20 20 2017 F N LING 8 7.58
21 21 2017 F Y OTHER 7 8.99
22 22 2017 F N CIS 8 8.70
23 23 2017 F N IS 6 4.52
24 24 2017 F N LING 7 6.83
25 25 2017 M N OTHER 9 8.32
26 26 2017 F N LING 7 6.80

Some of our data (3)

participant year gender bl_edu study english_grade english_score
27 27 2018 M N IS 8 9.06
28 28 2018 M N IS 7 7.94
29 29 2018 F N CIS 7 7.23
30 30 2018 F N CIS 7 7.99
31 31 2018 F N IS 7 7.42
32 32 2018 F Y IS 8 6.50
33 33 2018 F N LING 7 7.34
34 34 2018 F N OTHER 8 8.82
35 35 2018 F N LING 9 8.63
36 36 2018 F N LING 8 7.94
37 37 2018 M N IS 8 8.63
38 38 2018 M N LING 9 7.89
39 39 2018 F N CIS 7 6.18

Some of our data (4)

participant year gender bl_edu study english_grade english_score
40 40 2018 F N CIS 7 7.09
41 41 2018 F Y OTHER 8 9.50
42 42 2018 M N CIS 7 7.79
43 43 2018 F N LING 7 6.25
44 44 2018 M N IS 6 6.14
45 45 2018 F N LING 7 6.77
46 46 2018 F N LING 9 6.71
47 47 2018 M N IS 9 9.05
48 48 2018 M N IS 7 7.06
49 49 2018 M N IS 6 6.74
50 50 2018 F N LING 7 7.27
51 51 2018 F N LING 6 4.36
52 52 2018 F N LING 8 7.72

Some of our data (5)

participant year gender bl_edu study english_grade english_score
53 53 2018 M N IS 8 8.15
54 54 2018 M N IS 7 6.73
55 55 2018 F N LING 7 7.36
56 56 2018 F Y CIS 8 8.28
57 57 2018 F N IS 8 7.43
58 58 2018 F N CIS 7 7.72
59 59 2018 F Y CIS 7 6.44
60 60 2018 F N LING 7 6.42
61 61 2018 F N LING 7 8.07
62 62 2018 F N CIS 6 6.33
63 63 2018 M N IS 8 7.56
64 64 2018 F Y OTHER 8 8.91
65 65 2018 M N IS 8 7.93

Question 1

Question 2

We use statistics because...

  • We would like to make sense of (in this course: your) data
  • For this we need to:
    • Summarize the data (descriptive statistics)
    • Assess relationships in our data (inferential statistics)
      • (During other courses, some of you will have already encountered inferential statistical tests such as the \(t\)-test or chi-square test which can be used for this)
  • The requirement of the data is that it is variable (there must be variation)
  • Note that statistics is not mathematics (it's data analysis)!

What you will learn during this course (1)

  • Descriptive statistics (describe data without conclusions)
    • Measures of central tendency and spread
      • E.g.: mean English grade (7.3), and range of English grades (5 - 9.5)
    • Visualization
      • E.g.: showing number of participants per study with a bar plot plot of chunk unnamed-chunk-6

What you will learn during this course (2)

  • Inferential statistics (link findings based on sample to population)
    • Comparing 2 groups (or 1 group with a value)
      • E.g.: pronunciation of women better than men?
    • Associations between 2 variables
      • E.g.: are gender and handedness related?
    • Internal consistency of questions in a survey
      • E.g.: does a group of questions measure a single construct?
  • How to do statistics in R (this lecture)
    • And how to make reproducible lab reports in R

What your lab reports will look like

Why do we use R?

  • Very nice reproducible lab reports (no copy-paste necessary!)
  • Other advantages of R compared to (e.g.,) SPSS
    • Free for everybody
    • Customizable: people can create their own statistical functions
    • State-of-the-art statistical methods are integrated very quickly
  • Also some disadvantages:
    • No substantial graphical user interface: typing instead of clicking
    • Takes more time to learn
    • State-of-the-art statistical methods sometimes contain bugs

Our tool: RStudio (frontend to R)

RStudio: quick overview

Basic functionality: R as calculator

# Addition (this is a comment: preceded by '#')
5 + 5
# [1] 10
# Multiplication
5 * 3
# [1] 15
# Division
5/3
# [1] 1.6667

Basic functionality: using variables

a <- 5  # store a single value; instead of '<-' you can also use '='
a  # display the value
# [1] 5
b <- a * a  # b contains the value of multiplying a with itself
b
# [1] 25
(d <- NA)  # set value of d to missing (NA) and show value
# [1] NA

Storing multiple values in a variable

b <- c(2, 4, 6, 7, 8)  # store a series of values in a vector (reusing variable b)
b
# [1] 2 4 6 7 8
b[4] <- a  # assign value 5 (stored in 'a') to the 4th element of vector b
b
# [1] 2 4 6 5 8
b <- c(b, NA)  # add element NA to b
b
# [1]  2  4  6  5  8 NA

Question 3

Basic functionality: using functions

b  # show values in variable b (b contains a vector: a list of values)
# [1]  2  4  6  5  8 NA
mn <- mean(b)  # calculating the mean and storing in variable mn
mn
# [1] NA
# mn is NA (missing) as one of the values is missing
mean(b, na.rm = TRUE)  # we can use the function parameter na.rm to ignore NAs
# [1] 5
# But which parameters does a function have: use help!
help(mean)  # alternatively: ?mean

Basic functionality: a help file

Question 4

Getting data into R: exporting a data set

Getting data into R: importing a data set

setwd("C:/Users/Martijn/Desktop/Statistiek-I/HC1/")  # set working directory
dat <- read.csv("survey.csv", sep = ";", dec = ".")  # reads csv file from work dir
str(dat)  # shows structure of the data frame (i.e. table is 2-dimensional)
# 'data.frame': 315 obs. of  7 variables:
#  $ participant  : int  1 2 3 4 5 6 7 8 9 10 ...
#  $ year         : int  2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
#  $ gender       : Factor w/ 2 levels "F","M": 1 2 2 1 1 1 1 1 1 2 ...
#  $ bl_edu       : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 2 1 1 ...
#  $ study        : Factor w/ 4 levels "CIS","IS","LING",..: 3 3 3 1 3 3 3 3 3 2 ...
#  $ english_grade: num  6 7 8 7 7 8 7 8 6 8 ...
#  $ english_score: num  4.31 6.42 8.17 6.99 6.12 ...
dim(dat)  # number of rows and columns of data set 
# [1] 315   7

Investigating imported data set: using head

head(dat)  # show first few rows of dat 
#   participant year gender bl_edu study english_grade english_score
# 1           1 2017      F      N  LING             6        4.3055
# 2           2 2017      M      N  LING             7        6.4152
# 3           3 2017      M      N  LING             8        8.1696
# 4           4 2017      F      N   CIS             7        6.9925
# 5           5 2017      F      N  LING             7        6.1165
# 6           6 2017      F      N  LING             8        7.3538

Investigating imported data set: RStudio viewer

Accessing table data using numbers

  • Access parts of table by specifying row and/or column numbers
  • dat[a,b]:
    • a indicates the selected rows of dat
    • b indicates the selected columns of dat
dat[1, ]  # values in first row (dat[,1]: values in first column)
#   participant year gender bl_edu study english_grade english_score
# 1           1 2017      F      N  LING             6        4.3055
dat[c(1, 5), c(1, 2, 3)]  # values in rows 1 and 5 and columns 1, 2 and 3
#   participant year gender
# 1           1 2017      F
# 5           5 2017      F

Accessing table data using names (1)

  • Additionally, we can access parts of the table by specifying the names of the columns we want to look at
dat[c(1, 3, 5), c("participant", "study")]  # rows 1, 3 and 5, and 2 named columns
#   participant study
# 1           1  LING
# 3           3  LING
# 5           5  LING

Accessing table data using names (2)

  • We may also select a single column by its name (not number) using the $ operator
    • E.g., dat$gender accesses the column gender of dat
dat$gender
#   [1] F M M F F F F F F M M F M F F F F F F F F F F F M F M M F F F F F F F F M M F F
#  [41] F M F M F F M M M F F F M M F F F F F F F F M F M F F F M F F M M M M F F F F F
#  [81] F F F F F F M F F M M F M F M M F F F M F F F F M M F F F F M F F F M F F M M F
# [121] M M F F F F M F F F M F F M M F M F F M M F F M F F F F F F F M F F F F M F F F
# [161] F F F M M F F F M F F M F F F F F F F F F F F M F F M F F F F F F F F F F F F F
# [201] F F M F F F F F F F M F F F M M F F F F M F F F F M F F M F F M F M F F F F F M
# [241] F F M F F F F M M F F F M F F M F F M F M M M F F F M F F M M F F F F F F M F F
# [281] F M M F F F F F F M F F F F M F F M F M M F M M F F M M M F M F F M M
# Levels: F M

Storing accessed table data

tmp <- dat[5:8, c(1, 3)]  # store columns 1 and 3 for rows 5 to 8 in variable tmp 
tmp  # show what is stored in variable tmp
#   participant gender
# 5           5      F
# 6           6      F
# 7           7      F
# 8           8      F

Question 5

Accessing table data using conditional indexing (1)

  • Conditional indexing allows us the select parts of the data on the basis of conditions
tmp <- dat[dat$gender == "M", ]  # only observations for male participants
head(tmp)
#    participant year gender bl_edu study english_grade english_score
# 2            2 2017      M      N  LING             7        6.4152
# 3            3 2017      M      N  LING             8        8.1696
# 10          10 2017      M      N    IS             8        8.9375
# 11          11 2017      M      N   CIS             7        6.2686
# 13          13 2017      M      N   CIS             6        5.7744
# 25          25 2017      M      N OTHER             9        8.3227

Accessing table data using conditional indexing (2)

  • Methods to combine conditions:
    • and: &
    • or: |
# only participants who study IS *and* are male
tmp <- dat[dat$gender == "M" & dat$study == "IS", ]
head(tmp)
#    participant year gender bl_edu study english_grade english_score
# 10          10 2017      M      N    IS           8.0        8.9375
# 27          27 2017      M      N    IS           8.0        9.0563
# 28          28 2017      M      N    IS           7.0        7.9433
# 37          37 2017      M      N    IS           8.1        8.6273
# 44          44 2017      M      N    IS           6.0        6.1372
# 47          47 2017      M      N    IS           9.0        9.0493

Accessing table data using conditional indexing (3)

  • Inverse a condition with ! (not)
    • is not equal to: !=
# only women (i.e. not men) *or* everybody with an English grade over 7
tmp <- dat[dat$gender != "M" | dat$english_grade > 7, ]
tail(tmp)  # tail shows final 6 rows 
#     participant year gender bl_edu study english_grade english_score
# 308         308 2016      M      N    IS           8.0        8.0876
# 309         309 2016      M      N    IS           8.0        7.8982
# 310         310 2016      F      N OTHER           8.0        8.4761
# 312         312 2016      F      N OTHER           7.0        6.8564
# 313         313 2016      F      N  LING           6.0        6.9069
# 315         315 2016      M      N    IS           7.5        5.7257

Question 6

Supplementing the data: adding columns (1)

  • Frequently, you need to add columns to the data, for example by computing a new value on the basis of the values in two columns
    • the operator $ helps us to do that
# new column 'diff': English grade - English proficiency score
dat$diff <- dat$english_grade - dat$english_score
head(dat)
#   participant year gender bl_edu study english_grade english_score       diff
# 1           1 2017      F      N  LING             6        4.3055  1.6945488
# 2           2 2017      M      N  LING             7        6.4152  0.5847601
# 3           3 2017      M      N  LING             8        8.1696 -0.1695835
# 4           4 2017      F      N   CIS             7        6.9925  0.0075054
# 5           5 2017      F      N  LING             7        6.1165  0.8835361
# 6           6 2017      F      N  LING             8        7.3538  0.6461548

Supplementing the data: adding columns (2)

  • Conditional indexing allows us additional flexibility
dat$pass_fail <- "PASS"  # new column, initially PASS for everybody
dat[dat$english_grade < 5.5, ]$pass_fail <- "FAIL"  # if grade too low, then FAIL
head(dat[dat$english_grade > 4 & dat$english_grade < 6, 2:9])  # show subset of data
#     year gender bl_edu study english_grade english_score     diff pass_fail
# 78  2017      F      N  LING           5.0        4.3000  0.70000      FAIL
# 122 2016      M      N  LING           5.0        5.5963 -0.59625      FAIL
# 209 2016      F      N  LING           5.5        4.3196  1.18042      PASS
# 284 2016      F      N  LING           5.0        5.3275 -0.32755      FAIL

Question 7

Visualization in R

  • Many basic visualization options are available in R
  • In this course, we will learn how to use the following functions (list not exhaustive):
    • barplot() (illustrated in the following)
    • plot()
    • boxplot()
    • hist()
    • qqnorm() and qqline()

Example of using a plotting function: barplot()

  • The bar plot is used to visualize frequencies of categorical variables
(counts <- table(dat$gender))  # first create frequency table
# 
#   F   M 
# 222  93
barplot(counts)

plot of chunk unnamed-chunk-23

Graphical parameters

  • There are various graphical parameters to allow you to customize your graphics
barplot(counts, col = c("pink", "lightblue"), ylim = c(0, 250), main = "My barplot", 
    xlab = "Gender", ylab = "Frequency")

plot of chunk unnamed-chunk-24

Question 8

Another example: segmented bar plot

(counts <- table(dat$gender, dat$study))
#    
#     CIS IS LING OTHER
#   F  82 18   93    29
#   M  19 56    9     9
barplot(counts, col = c("pink", "lightblue"), legend = c("F", "M"), ylim = c(0, 105))

plot of chunk unnamed-chunk-25

Statistics in R

  • The main purpose of R is to conduct statistical analyses
  • Many different functions to obtain descriptive and inferential statistics are available in R
  • The following examples illustrate how to conduct some statistical analyses in R
    • The why and what is covered in the next lectures

Descriptive statistics: central tendency

mean(dat$english_score)  # mean of all people for English score
# [1] 7.347
mean(dat[dat$gender == "F", ]$english_score)  # mean of women for English score
# [1] 7.2036
median(dat$english_score)  # median of all people for English score
# [1] 7.3538

Descriptive statistics: spread

min(dat$english_score)  # minimum value
# [1] 4.3
max(dat$english_score)  # maximum value
# [1] 10
var(dat$english_score)  # variance: average squared deviation from mean
# [1] 1.287
sd(dat$english_score)  # standard deviation (square root of variance)
# [1] 1.1345

Descriptive statistics: frequency tables

table(dat$gender)
# 
#   F   M 
# 222  93
table(dat$study)
# 
#   CIS    IS  LING OTHER 
#   101    74   102    38

Question 9

Descriptive statistics: cross-tables

table(dat$gender, dat$study)
#    
#     CIS IS LING OTHER
#   F  82 18   93    29
#   M  19 56    9     9
table(dat$gender, dat$bl_edu)
#    
#       N   Y
#   F 202  20
#   M  83  10

Inferential statistics

  • A large number of statistical inference functions are available in R
  • In this course we will cover the following functions:
    • t.test() for a \(t\)-test (single sample, paired, independent)
    • wilcox.test() for non-parametric alternatives to the \(t\)-test
      (Mann Whitney U test, Wilcoxon signed-rank test)
    • binom.test() for the sign test
    • chisq.test() for the chi-square test
    • cor() for the correlation
    • alpha() (from package psych) for Cronbach's \(\alpha\)

Example of inferential statistics: \(t\)-test

  • Assessing average group differences for a numerical variable (lecture 4)
t.test(english_grade ~ bl_edu, data = dat)
# 
#   Welch Two Sample t-test
# 
# data:  english_grade by bl_edu
# t = -3.56, df = 37.4, p-value = 0.001
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
#  -0.80665 -0.22212
# sample estimates:
# mean in group N mean in group Y 
#          7.2323          7.7467

Example of inferential statistics: correlation

  • Assessing strength of the relationship between two numerical variables (lecture 6)
cor(dat$english_score, dat$english_grade)
# [1] 0.74346

Recap

  • In this lecture, we've covered the basics of R
    • R as calculator
    • Variables
    • Functions and help
    • Importing data in R
    • Viewing and modifying data
    • Some examples of visualizations
    • Some examples of descriptive and inferential statistics
  • See Levshina (Ch. 2) for more information about the functionality of R
  • In the lab session, you will experiment with using R
  • Next lecture: Descriptive statistics
  • Evaluation: if at least 50 people evaluate the lecture, exam-like question visible

Please evaluate this lecture!

Exam question

Questions?

Thank you for your attention!

http://www.martijnwieling.nl
m.b.wieling@rug.nl