Martijn Wieling

University of Groningen

- Why use statistics?
- Summarize data (
*descriptive statistics*) - Assess relationships in data (
*inferential statistics*)

- Summarize data (
- Introduction to RStudio and
`R`

- Variables and functions
- Import, view and modify data in
`R`

- Some visualization and statistics in
`R`

- Descriptive statistics vs. inferential statistics
- Sample vs. population
- (Types of) variables
- Distribution of a variable
- Measures of central tendency
- Measures of variation
- Standardized scores
- Checking for a normal distribution

- Descriptive statistics:
- Statistics used to
**describe**(sample) data without further conclusions- Measures of
**central tendency**: Mean, median, mode - Measures of
**variation**(or spread): range, IQR, variance, standard deviation

- Measures of

- Statistics used to
- Inferential statistics:
- Investigate data of
**sample**in order to infer patterns in the**population**- Statistical tests: \(t\)-test, \(\chi^2\)-test, etc.

- Investigate data of

- Studying the whole population is (almost always) practically impossible
- Sample is a (selected) subset of population and thus more accessible
- Selection of
**representative**sample is very important!

- Selection of

- Statistics always involves variables
- Relations: involve two variables (e.g., English grade vs. English score)

- The values of the variables indicate properties of the
**cases**(i.e. the individuals or entities you study)

Variable | Example values |
---|---|

Gender | male, female, ... |

English grade | 4.6, 5.5, 6.3, 7.2, ... |

Year of birth | 1990, 1991, 1993, ... |

Native language | Dutch, German, English, ... |

- Each case is shown in a row
- Each variable in a column
- For part of our data:

participant | year | gender | bl_edu | study | english_grade | english_score |
---|---|---|---|---|---|---|

405 | 2021 | F | N | CIS | 8 | 8.21 |

406 | 2021 | F | N | LING | 7 | 7.78 |

407 | 2021 | F | N | OTHER | 7 | 8.13 |

408 | 2021 | F | N | OTHER | 8 | 9.36 |

409 | 2021 | M | N | IS | 8 | 7.98 |

410 | 2021 | F | N | LING | 7 | 7.66 |

**Nominal**(categorical) scale: unordered categories- Gender (frequently binary: two categories), Native language, etc.

**Ordinal**: ordered (ranked) scale, but amount of difference unclear- Rank of English profiency (in class),
*Likert scale*(Rate on a scale from 1 to 5...)

- Rank of English profiency (in class),

**Interval**scale: numerical with meaningful difference but no true 0- Year of birth, temperature in Celsius

**Ratio**scale: numerical with meaningful difference and true 0- Number of questions correct, age

- Scale of variable determines possible statistics
- E.g., mean age is possible, but not mean native language

```
table(dat$gender) # absolute frequencies
```

```
#
# F M
# 346 154
```

```
prop.table(table(dat$gender)) # relative frequencies
```

```
#
# F M
# 0.692 0.308
```

```
par(mfrow = c(1, 2))
barplot(table(dat$gender), col = c("pink", "lightblue"), main = "abs.frq.")
barplot(prop.table(table(dat$gender)), col = c("pink", "lightblue"), main = "rel.frq.")
```

```
pie(table(dat$study), col = c("red", "green", "blue", "yellow"))
```

- We are generally not interested in individual values of a variable, but rather all values and their frequency
- This is captured by a
**distribution**- Famous distribution:
**Normal distribution**("bell-shaped" curve): e.g., IQ scores

- Famous distribution:

- Distribution of a variable shows variability of variable (i.e. frequency of values)

```
table(dat$english_grade)
```

```
#
# 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5
# 6 2 76 12 191 21 148 15 26 3
```

**Histogram**shows frequency of all values in groups: \((a,b]\)- Look for general pattern, symmetry, outliers

```
hist(dat$english_grade, xlab = "English grade", main = "")
```

**Density curve**shows area proportional to the relative frequency

```
plot(density(dat$english_grade), main = "", xlab = "English grade")
```

- The total area under a density curve is equal to 1
- A density curve does not provide information about the frequency of one value
- E.g., there might be no one who has scored a grade of exactly 6.1

- It only provides information about an
**interval**- E.g., more than 50% of the grades lie between 5.5 and 7.5

- The normal distribution has convenient characteristics
- Completely symmetric
- Red area: (about) 68%
- Red and green area: (about) 95%

- A distribution can also be characterized by measures of
**center**and**variation**- (
*skewness*measures the symmetry of the distribution; not covered in this course)

- (

**Mode**: most frequent element (for nominal data:*only*meaningful measure)**Median**: when data is sorted from small to large, it is the middle value**Mean**: arithmetical average

\[\bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n} = \frac{1}{n}\sum\limits_{i=1}^n x_i\]

- (You need to be able to calculate the mean of a series of values by hand, so it is useful to remember the formula)

```
mean(dat$english_grade) # arithmetic average
```

```
# [1] 7.2828
```

```
median(dat$english_grade)
```

```
# [1] 7
```

```
# no built-in function to get mode: new function
my_mode <- function(x) {
counts <- table(x)
names(which(counts == max(counts)))
}
my_mode(dat$english_grade)
```

```
# [1] "7"
```

**Quantiles**: cutpoints to divide the sorted data in subsets of equal size- Quartiles: three cutpoints to divide the data in
**four**equal-sized sets- \(q_1\) (1st quartile): cutpoint between 1st and 2nd group
- \(q_2\) (2nd quartile): cutpoint between 2nd and 3rd group (=
**median**!) - \(q_3\) (3rd quartile): cutpoint between 3rd and 4th group

- Percentiles: divide data in
**hundred**equal-sized subsets- \(q_1\) = 25th percentile
- \(q_2\) (= median) = 50th percentile
- Score at $n$th percentile is better than \(n\)% of scores

- Quartiles: three cutpoints to divide the data in

```
quantile(dat$english_grade) # default: quartiles
```

```
# 0% 25% 50% 75% 100%
# 5.0 7.0 7.0 8.0 9.5
```

**Minimum, maximum**: lowest and highest value**Range**: difference between minimum and maximum**Interquartile range**(IQR): \(q_3\) - \(q_1\)

- A
**box plot**is used to visualize variation of a variable- Box (IQR): \(q_1\) (bottom), median (thickest line), \(q_3\) (top)
- (In example below, \(q_1\) and median have the same value)

- Whiskers: maximum (top) and minimum (bottom) non-outlier value
- Circle(s): outliers (> 1.5 IQR distance from box)

- Box (IQR): \(q_1\) (bottom), median (thickest line), \(q_3\) (top)

```
boxplot(dat$english_grade, col = "red")
```