# Statistiek I

## Introduction to R

Martijn Wieling
University of Groningen

## This lecture

• General information about the course
• Introduction to dataset used in the course
• Why use statistics?
• Introduction to RStudio and R
• R as calculator
• Variables
• Functions and help
• Importing data in R
• Viewing and modifying data
• Visualization in R
• Statistics in R

## General information about the course setup (2)

• 7 weekly lectures
• Slides (and 2017-2018 recordings) will be made available via Nestor
• Interactive Mentimeter questions during lecture
• 7 weekly lab sessions
• You should have registered via Nestor for a group associated with your study
• Attendance obligatory!
• Finishing lab exercises results in at most 1 bonus point
• Only when final test score $T$ at least 5.0
• Calculation: $T + 0.25 \times P \textrm{(max 2)} \times H \textrm{(max 2)}$
• 3 optional lectures
• Feb. 19, March 5, March 19, 17:00 - 19:00, Zernikezaal, Academy building
• All exercises have to be made in advance (see Nestor)!

## Language policy in this course

• Lecture slides, book and lab sessions are in English. Why?
• Most statistics terminology you encounter will be in English
• Statistiek II will be completely in English (English teacher)
• You may choose the language of the lab reports (either Dutch or English)
• The final exam is in Dutch
• (but with English-Dutch translation of statistical terminology when necessary)

## Some remarks about the lab sessions

• Lab reports due one week after lab
• Late or sub-optimal lab report: (max.) 1 point (out of 2)
• Lab report over 1 week late or insufficient: 0 points
• Exam requirement: average lab score of 1 point, with at most three times 0 points
• Lab attendance required:
• You can miss at most 1 lab session, but timely lab report ($\geq$ 1 point) required
• If you miss 2 or more lab sessions: course failed
• Why required lab attendance?
• Actively participating in lab session increases chances of passing the exam!
• Lab score of last year may be used, but only as a whole (same restrictions)

## Goals of this course

• Understand basics of descriptive and inferential statistics
• Emphasis on statistical reasoning
• Practical approach, but some mathematics to help understand the concepts
• Understand and apply basic statistical analyses
• Report on results of statistical analyses
• Understand reports using statistical analyses
• Conduct basic statistical analyses in R

## Statistics in Language Studies

• Many experiments are conducted in communication, information science and linguistics
• Effect of comics vs. normal text on understanding?
• Effect of algorithm on quality of automatically generated summary?
• Influence of gender on learning a second language?
• Availability of online data increases opportunities for statistical analysis (e.g., Digital Humanities)
• In this course we will mainly work with the data collected in the survey
• But other examples will be given as well

## The dataset used during this course

• Data on the basis of your answers to the initial survey
• Age, gender, handedness, study, etc.,
• Information about your language history
• Information about your English use and (subjective) proficiency
• We are interested in investigating which factors are related to English proficiency
• Measured by your English grade in high school
• And via an automatically calculated approximate measure of proficiency (English score) based on your input
• Data also includes a subset of survey results from earlier years

## Some of our data (1)

participant year gender bl_edu study english_grade english_score
1 1 2017 F N LING 6 4.31
2 2 2017 M N LING 7 6.42
3 3 2017 M N LING 8 8.17
4 4 2017 F N CIS 7 6.99
5 5 2017 F N LING 7 6.12
6 6 2017 F N LING 8 7.35
7 7 2017 F N LING 7 6.78
8 8 2017 F Y LING 8 7.43
9 9 2017 F N LING 6 6.04
10 10 2017 M N IS 8 8.94
11 11 2017 M N CIS 7 6.27
12 12 2017 F N OTHER 8 7.99
13 13 2017 M N CIS 6 5.77

## Some of our data (2)

participant year gender bl_edu study english_grade english_score
14 14 2017 F N OTHER 7 6.78
15 15 2017 F N OTHER 8 8.45
16 16 2017 F N CIS 7 7.11
17 17 2017 F N LING 8 8.47
18 18 2017 F N IS 8 7.48
19 19 2017 F N LING 8 7.52
20 20 2017 F N LING 8 7.58
21 21 2017 F Y OTHER 7 8.99
22 22 2017 F N CIS 8 8.70
23 23 2017 F N IS 6 4.52
24 24 2017 F N LING 7 6.83
25 25 2017 M N OTHER 9 8.32
26 26 2017 F N LING 7 6.80

## Some of our data (3)

participant year gender bl_edu study english_grade english_score
27 27 2018 M N IS 8 9.06
28 28 2018 M N IS 7 7.94
29 29 2018 F N CIS 7 7.23
30 30 2018 F N CIS 7 7.99
31 31 2018 F N IS 7 7.42
32 32 2018 F Y IS 8 6.50
33 33 2018 F N LING 7 7.34
34 34 2018 F N OTHER 8 8.82
35 35 2018 F N LING 9 8.63
36 36 2018 F N LING 8 7.94
37 37 2018 M N IS 8 8.63
38 38 2018 M N LING 9 7.89
39 39 2018 F N CIS 7 6.18

## Some of our data (4)

participant year gender bl_edu study english_grade english_score
40 40 2018 F N CIS 7 7.09
41 41 2018 F Y OTHER 8 9.50
42 42 2018 M N CIS 7 7.79
43 43 2018 F N LING 7 6.25
44 44 2018 M N IS 6 6.14
45 45 2018 F N LING 7 6.77
46 46 2018 F N LING 9 6.71
47 47 2018 M N IS 9 9.05
48 48 2018 M N IS 7 7.06
49 49 2018 M N IS 6 6.74
50 50 2018 F N LING 7 7.27
51 51 2018 F N LING 6 4.36
52 52 2018 F N LING 8 7.72

## Some of our data (5)

participant year gender bl_edu study english_grade english_score
53 53 2018 M N IS 8 8.15
54 54 2018 M N IS 7 6.73
55 55 2018 F N LING 7 7.36
56 56 2018 F Y CIS 8 8.28
57 57 2018 F N IS 8 7.43
58 58 2018 F N CIS 7 7.72
59 59 2018 F Y CIS 7 6.44
60 60 2018 F N LING 7 6.42
61 61 2018 F N LING 7 8.07
62 62 2018 F N CIS 6 6.33
63 63 2018 M N IS 8 7.56
64 64 2018 F Y OTHER 8 8.91
65 65 2018 M N IS 8 7.93

## We use statistics because...

• We would like to make sense of (in this course: your) data
• For this we need to:
• Summarize the data (descriptive statistics)
• Assess relationships in our data (inferential statistics)
• (During other courses, some of you will have already encountered inferential statistical tests such as the $t$-test or chi-square test which can be used for this)
• The requirement of the data is that it is variable (there must be variation)
• Note that statistics is not mathematics (it's data analysis)!

## What you will learn during this course (1)

• Descriptive statistics (describe data without conclusions)
• Measures of central tendency and spread
• E.g.: mean English grade (7.3), and range of English grades (5 - 9.5)
• Visualization
• E.g.: showing number of participants per study with a bar plot

## What you will learn during this course (2)

• Inferential statistics (link findings based on sample to population)
• Comparing 2 groups (or 1 group with a value)
• E.g.: pronunciation of women better than men?
• Associations between 2 variables
• E.g.: are gender and handedness related?
• Internal consistency of questions in a survey
• E.g.: does a group of questions measure a single construct?
• How to do statistics in R (this lecture)
• And how to make reproducible lab reports in R

## Why do we use R?

• Very nice reproducible lab reports (no copy-paste necessary!)
• Other advantages of R compared to (e.g.,) SPSS
• Free for everybody
• Customizable: people can create their own statistical functions
• State-of-the-art statistical methods are integrated very quickly
• Also some disadvantages:
• No substantial graphical user interface: typing instead of clicking
• Takes more time to learn
• State-of-the-art statistical methods sometimes contain bugs

## Basic functionality: R as calculator

# Addition (this is a comment: preceded by '#')
5 + 5

# [1] 10

# Multiplication
5 * 3

# [1] 15

# Division
5/3

# [1] 1.6667


## Basic functionality: using variables

a <- 5  # store a single value; instead of '<-' you can also use '='
a  # display the value

# [1] 5

b <- a * a  # b contains the value of multiplying a with itself
b

# [1] 25

(d <- NA)  # set value of d to missing (NA) and show value

# [1] NA


## Storing multiple values in a variable

b <- c(2, 4, 6, 7, 8)  # store a series of values in a vector (reusing variable b)
b

# [1] 2 4 6 7 8

b[4] <- a  # assign value 5 (stored in 'a') to the 4th element of vector b
b

# [1] 2 4 6 5 8

b <- c(b, NA)  # add element NA to b
b

# [1]  2  4  6  5  8 NA


## Basic functionality: using functions

b  # show values in variable b (b contains a vector: a list of values)

# [1]  2  4  6  5  8 NA

mn <- mean(b)  # calculating the mean and storing in variable mn
mn

# [1] NA

# mn is NA (missing) as one of the values is missing
mean(b, na.rm = TRUE)  # we can use the function parameter na.rm to ignore NAs

# [1] 5

# But which parameters does a function have: use help!
help(mean)  # alternatively: ?mean


## Getting data into R: importing a data set

setwd("C:/Users/Martijn/Desktop/Statistiek-I/HC1/")  # set working directory
dat <- read.csv("survey.csv", sep = ";", dec = ".")  # reads csv file from work dir
str(dat)  # shows structure of the data frame (i.e. table is 2-dimensional)

# 'data.frame': 315 obs. of  7 variables:
#  $participant : int 1 2 3 4 5 6 7 8 9 10 ... #$ year         : int  2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
#  $gender : Factor w/ 2 levels "F","M": 1 2 2 1 1 1 1 1 1 2 ... #$ bl_edu       : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 2 1 1 ...
#  $study : Factor w/ 4 levels "CIS","IS","LING",..: 3 3 3 1 3 3 3 3 3 2 ... #$ english_grade: num  6 7 8 7 7 8 7 8 6 8 ...
#  $english_score: num 4.31 6.42 8.17 6.99 6.12 ...  dim(dat) # number of rows and columns of data set  # [1] 315 7  ## Investigating imported data set: using head head(dat) # show first few rows of dat  # participant year gender bl_edu study english_grade english_score # 1 1 2017 F N LING 6 4.3055 # 2 2 2017 M N LING 7 6.4152 # 3 3 2017 M N LING 8 8.1696 # 4 4 2017 F N CIS 7 6.9925 # 5 5 2017 F N LING 7 6.1165 # 6 6 2017 F N LING 8 7.3538  ## Investigating imported data set: RStudio viewer ## Accessing table data using numbers • Access parts of table by specifying row and/or column numbers • dat[a,b]: • a indicates the selected rows of dat • b indicates the selected columns of dat dat[1, ] # values in first row (dat[,1]: values in first column)  # participant year gender bl_edu study english_grade english_score # 1 1 2017 F N LING 6 4.3055  dat[c(1, 5), c(1, 2, 3)] # values in rows 1 and 5 and columns 1, 2 and 3  # participant year gender # 1 1 2017 F # 5 5 2017 F  ## Accessing table data using names (1) • Additionally, we can access parts of the table by specifying the names of the columns we want to look at dat[c(1, 3, 5), c("participant", "study")] # rows 1, 3 and 5, and 2 named columns  # participant study # 1 1 LING # 3 3 LING # 5 5 LING  ## Accessing table data using names (2) • We may also select a single column by its name (not number) using the $ operator
• E.g., dat$gender accesses the column gender of dat dat$gender

#   [1] F M M F F F F F F M M F M F F F F F F F F F F F M F M M F F F F F F F F M M F F
#  [41] F M F M F F M M M F F F M M F F F F F F F F M F M F F F M F F M M M M F F F F F
#  [81] F F F F F F M F F M M F M F M M F F F M F F F F M M F F F F M F F F M F F M M F
# [121] M M F F F F M F F F M F F M M F M F F M M F F M F F F F F F F M F F F F M F F F
# [161] F F F M M F F F M F F M F F F F F F F F F F F M F F M F F F F F F F F F F F F F
# [201] F F M F F F F F F F M F F F M M F F F F M F F F F M F F M F F M F M F F F F F M
# [241] F F M F F F F M M F F F M F F M F F M F M M M F F F M F F M M F F F F F F M F F
# [281] F M M F F F F F F M F F F F M F F M F M M F M M F F M M M F M F F M M
# Levels: F M


## Storing accessed table data

tmp <- dat[5:8, c(1, 3)]  # store columns 1 and 3 for rows 5 to 8 in variable tmp
tmp  # show what is stored in variable tmp

#   participant gender
# 5           5      F
# 6           6      F
# 7           7      F
# 8           8      F


## Accessing table data using conditional indexing (1)

• Conditional indexing allows us the select parts of the data on the basis of conditions
tmp <- dat[dat$gender == "M", ] # only observations for male participants head(tmp)  # participant year gender bl_edu study english_grade english_score # 2 2 2017 M N LING 7 6.4152 # 3 3 2017 M N LING 8 8.1696 # 10 10 2017 M N IS 8 8.9375 # 11 11 2017 M N CIS 7 6.2686 # 13 13 2017 M N CIS 6 5.7744 # 25 25 2017 M N OTHER 9 8.3227  ## Accessing table data using conditional indexing (2) • Methods to combine conditions: • and: & • or: | # only participants who study IS *and* are male tmp <- dat[dat$gender == "M" & dat$study == "IS", ] head(tmp)  # participant year gender bl_edu study english_grade english_score # 10 10 2017 M N IS 8.0 8.9375 # 27 27 2017 M N IS 8.0 9.0563 # 28 28 2017 M N IS 7.0 7.9433 # 37 37 2017 M N IS 8.1 8.6273 # 44 44 2017 M N IS 6.0 6.1372 # 47 47 2017 M N IS 9.0 9.0493  ## Accessing table data using conditional indexing (3) • Inverse a condition with ! (not) • is not equal to: != # only women (i.e. not men) *or* everybody with an English grade over 7 tmp <- dat[dat$gender != "M" | dat$english_grade > 7, ] tail(tmp) # tail shows final 6 rows  # participant year gender bl_edu study english_grade english_score # 308 308 2016 M N IS 8.0 8.0876 # 309 309 2016 M N IS 8.0 7.8982 # 310 310 2016 F N OTHER 8.0 8.4761 # 312 312 2016 F N OTHER 7.0 6.8564 # 313 313 2016 F N LING 6.0 6.9069 # 315 315 2016 M N IS 7.5 5.7257  ## Question 6 ## Supplementing the data: adding columns (1) • Frequently, you need to add columns to the data, for example by computing a new value on the basis of the values in two columns • the operator $ helps us to do that
# new column 'diff': English grade - English proficiency score
dat$diff <- dat$english_grade - dat$english_score head(dat)  # participant year gender bl_edu study english_grade english_score diff # 1 1 2017 F N LING 6 4.3055 1.6945488 # 2 2 2017 M N LING 7 6.4152 0.5847601 # 3 3 2017 M N LING 8 8.1696 -0.1695835 # 4 4 2017 F N CIS 7 6.9925 0.0075054 # 5 5 2017 F N LING 7 6.1165 0.8835361 # 6 6 2017 F N LING 8 7.3538 0.6461548  ## Supplementing the data: adding columns (2) • Conditional indexing allows us additional flexibility dat$pass_fail <- "PASS"  # new column, initially PASS for everybody
dat[dat$english_grade < 5.5, ]$pass_fail <- "FAIL"  # if grade too low, then FAIL
head(dat[dat$english_grade > 4 & dat$english_grade < 6, 2:9])  # show subset of data

#     year gender bl_edu study english_grade english_score     diff pass_fail
# 78  2017      F      N  LING           5.0        4.3000  0.70000      FAIL
# 122 2016      M      N  LING           5.0        5.5963 -0.59625      FAIL
# 209 2016      F      N  LING           5.5        4.3196  1.18042      PASS
# 284 2016      F      N  LING           5.0        5.3275 -0.32755      FAIL


## Visualization in R

• Many basic visualization options are available in R
• In this course, we will learn how to use the following functions (list not exhaustive):
• barplot() (illustrated in the following)
• plot()
• boxplot()
• hist()
• qqnorm() and qqline()

## Example of using a plotting function: barplot()

• The bar plot is used to visualize frequencies of categorical variables
(counts <- table(dat$gender)) # first create frequency table  # # F M # 222 93  barplot(counts)  ## Graphical parameters • There are various graphical parameters to allow you to customize your graphics barplot(counts, col = c("pink", "lightblue"), ylim = c(0, 250), main = "My barplot", xlab = "Gender", ylab = "Frequency")  ## Question 8 ## Another example: segmented bar plot (counts <- table(dat$gender, dat$study))  # # CIS IS LING OTHER # F 82 18 93 29 # M 19 56 9 9  barplot(counts, col = c("pink", "lightblue"), legend = c("F", "M"), ylim = c(0, 105))  ## Statistics in R • The main purpose of R is to conduct statistical analyses • Many different functions to obtain descriptive and inferential statistics are available in R • The following examples illustrate how to conduct some statistical analyses in R • The why and what is covered in the next lectures ## Descriptive statistics: central tendency mean(dat$english_score)  # mean of all people for English score

# [1] 7.347

mean(dat[dat$gender == "F", ]$english_score)  # mean of women for English score

# [1] 7.2036

median(dat$english_score) # median of all people for English score  # [1] 7.3538  ## Descriptive statistics: spread min(dat$english_score)  # minimum value

# [1] 4.3

max(dat$english_score) # maximum value  # [1] 10  var(dat$english_score)  # variance: average squared deviation from mean

# [1] 1.287

sd(dat$english_score) # standard deviation (square root of variance)  # [1] 1.1345  ## Descriptive statistics: frequency tables table(dat$gender)

#
#   F   M
# 222  93

table(dat$study)  # # CIS IS LING OTHER # 101 74 102 38  ## Question 9 ## Descriptive statistics: cross-tables table(dat$gender, dat$study)  # # CIS IS LING OTHER # F 82 18 93 29 # M 19 56 9 9  table(dat$gender, dat$bl_edu)  # # N Y # F 202 20 # M 83 10  ## Inferential statistics • A large number of statistical inference functions are available in R • In this course we will cover the following functions: • t.test() for a $t$-test (single sample, paired, independent) • wilcox.test() for non-parametric alternatives to the $t$-test (Mann Whitney U test, Wilcoxon signed-rank test) • binom.test() for the sign test • chisq.test() for the chi-square test • cor() for the correlation • alpha() (from package psych) for Cronbach's $\alpha$ ## Example of inferential statistics: $t$-test • Assessing average group differences for a numerical variable (lecture 4) t.test(english_grade ~ bl_edu, data = dat)  # # Welch Two Sample t-test # # data: english_grade by bl_edu # t = -3.56, df = 37.4, p-value = 0.001 # alternative hypothesis: true difference in means is not equal to 0 # 95 percent confidence interval: # -0.80665 -0.22212 # sample estimates: # mean in group N mean in group Y # 7.2323 7.7467  ## Example of inferential statistics: correlation • Assessing strength of the relationship between two numerical variables (lecture 6) cor(dat$english_score, dat\$english_grade)

# [1] 0.74346


## Recap

• In this lecture, we've covered the basics of R
• R as calculator
• Variables
• Functions and help
• Importing data in R
• Viewing and modifying data
• Some examples of visualizations
• Some examples of descriptive and inferential statistics
• See Levshina (Ch. 2) for more information about the functionality of R
• In the lab session, you will experiment with using R
• Next lecture: Descriptive statistics
• Evaluation: if at least 50 people evaluate the lecture, exam-like question visible

## Questions?

Thank you for your attention!