Martijn Wieling

University of Groningen

- Dataset for this lecture
- Comparing one or two groups: \(t\)-test
- Non-parametric alternatives: Mann-Whitney U and Wilcoxon signed rank

- Assessing the dependency between two categorical variables: \(\chi^2\) test
- Comparing more than two groups: ANOVA

- This lecture focuses on
*how-to-use*and*when-to-use*, rather than on the underlying calculations - Make sure to report
**effect size**as significance is dependent on sample size

difference (in \(s\)) | \(n\) | \(p\) |
---|---|---|

0.01 | 40,000 | 0.05 |

0.10 | 400 | 0.05 |

0.25 | 64 | 0.05 |

0.54 | 16 | 0.05 |

```
load("dat.rda")
head(dat)
```

```
# Speaker Language PronDist PronDistCat LangDist LangDistAlt Age Gender AEO LR NrLang
# 1 arabic1 arabic 0.185727 Different 0.63699 0.44864 38 F 12 4 0
# 2 arabic10 arabic -0.172175 Similar 0.63699 0.44864 26 M 5 2 2
# 3 arabic13 arabic -0.035423 Similar 0.63699 0.44864 25 M 15 1 2
# 4 arabic12 arabic 0.372547 Different 0.63699 0.44864 32 M 11 8 0
# 5 arabic17 arabic -0.175237 Similar 0.63699 0.44864 35 M 15 0 1
# 6 arabic18 arabic 0.168120 Different 0.63699 0.44864 18 M 6 0 1
```

```
str(dat)
```

```
# 'data.frame': 712 obs. of 11 variables:
# $ Speaker : Factor w/ 712 levels "afrikaans1","afrikaans2",..: 21 22 25 24 27 28 26 30 31 23 ...
# $ Language : Factor w/ 159 levels "afrikaans","agni",..: 7 7 7 7 7 7 7 7 7 7 ...
# $ PronDist : num 0.1857 -0.1722 -0.0354 0.3725 -0.1752 ...
# $ PronDistCat: Factor w/ 2 levels "Different","Similar": 1 2 2 1 2 1 1 2 2 2 ...
# $ LangDist : num 0.637 0.637 0.637 0.637 0.637 ...
# $ LangDistAlt: num 0.449 0.449 0.449 0.449 0.449 ...
# $ Age : num 38 26 25 32 35 18 22 36 23 30 ...
# $ Gender : Factor w/ 2 levels "F","M": 1 2 2 2 2 2 2 2 1 1 ...
# $ AEO : num 12 5 15 11 15 6 16 12 10 14 ...
# $ LR : num 4 2 1 8 0 0 0 1 0 4 ...
# $ NrLang : int 0 2 2 0 1 1 2 2 2 1 ...
```

- Values between two groups (or vs. value) can be compared using the \(t\)-test
- Assumptions:
- Randomly selected sample(s)
- Independent observations (except for paired data)
- Data has interval scale (difference between two values is meaningful) or ratio scale (meaningful difference and true 0)
- E.g., interval scale: temperature in C; ratio scale: length in cm.

- Data in sample(s) normally distributed (for \(N \leq 30\))
- Variances in samples homogeneous (Welch's adjustment, default in
`R`

, corrects for this) - Note:
*Likert scale*is ordinal data, so \(t\)-test in principle not adequate- But in practice not problematic (De Winter & Dodou, 2011)

**Visualize**the data if possible (facilitates interpretation)

- Result of \(t\)-test is a \(t\)-value, which is compared to the appropriate \(t\)-distribution
- \(t\)-distribution depends on degrees of freedom (therefore: report dF!)

```
german <- droplevels(dat[dat$Language == "german", ])
boxplot(german$PronDist)
abline(h = 0, col = "red", lty = 2)
```

```
t.test(german$PronDist, mu = 0)
```

```
#
# One Sample t-test
#
# data: german$PronDist
# t = -5.33, df = 21, p-value = 2.7e-05
# alternative hypothesis: true mean is not equal to 0
# 95 percent confidence interval:
# -0.208787 -0.091657
# sample estimates:
# mean of x
# -0.15022
```

```
library(lsr)
cohensD(german$PronDist, mu = 0)
```

```
# [1] 1.1373
```

- Cohen's \(d\) measures the difference in terms of the number of standard deviations
- Rough guideline: Cohen's \(d\) < 0.3: small effect size; 0.3 - 0.8: medium; > 0.8: large

- Install the
*Mathematical Biostatistics Boot Camp*swirl course:

```
library(swirl)
install_from_swirl("Mathematical_Biostatistics_Boot_Camp")
```

- Run
`swirl()`

in RStudio and finish the following lesson of the*Mathematical Biostatistics Boot Camp*course:*Lesson 1*: One Sample t-test

```
# aggregate data per language (159 languages)
lang <- aggregate(cbind(LangDist, LangDistAlt) ~ Language, data = dat, FUN = mean)
par(mfrow = c(1, 2))
boxplot(lang[, c("LangDist", "LangDistAlt")])
boxplot(lang$LangDist - lang$LangDistAlt, main = "Pairwise differences")
```

```
t.test(lang$LangDist, lang$LangDistAlt, paired = T)
```

```
#
# Paired t-test
#
# data: lang$LangDist and lang$LangDistAlt
# t = -3.73, df = 158, p-value = 0.00027
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -0.085703 -0.026367
# sample estimates:
# mean of the differences
# -0.056035
```

```
t.test(lang$LangDist - lang$LangDistAlt, mu = 0) # identical to one-sample test of differences
```

```
#
# One Sample t-test
#
# data: lang$LangDist - lang$LangDistAlt
# t = -3.73, df = 158, p-value = 0.00027
# alternative hypothesis: true mean is not equal to 0
# 95 percent confidence interval:
# -0.085703 -0.026367
# sample estimates:
# mean of x
# -0.056035
```

```
cohensD(lang$LangDist, lang$LangDistAlt, method = "paired") # effect size
```

```
# [1] 0.29585
```

```
rusger <- droplevels(dat[dat$Language %in% c("russian", "german"), ])
boxplot(PronDist ~ Language, data = rusger)
```

```
t.test(PronDist ~ Language, data = rusger, alternative = "two.sided")
```

```
#
# Welch Two Sample t-test
#
# data: PronDist by Language
# t = -3.56, df = 42.5, p-value = 0.00092
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -0.267719 -0.074108
# sample estimates:
# mean in group german mean in group russian
# -0.150222 0.020691
```

```
cohensD(PronDist ~ Language, data = rusger)
```

```
# [1] 1.0166
```

- Pronunciation difference from native English was smaller for the German speakers (mean: \(-0.15\), sd: \(0.132\)) than for the Russian speakers (mean: \(0.02\), sd: \(0.194\)). The difference was \(-0.17\) (Cohen's \(d\): \(1.02\), large effect) and reached significance using an independent samples Welch's unequal variances \(t\)-test at an \(\alpha\)-level of \(0.05\), \(t(42.5) = -3.56, p < 0.001\).

- ✓ Randomly selected sample(s)
- ✓ Independent observations (except for pairs)
- ✓ Data has interval or ratio scale
- ? Variance in samples homogeneous (corrected with Welch's adjustment)
- ? Data in compared samples are
**normally distributed**(for \(N \leq 30\))

- Testing homoscedasticity using Levene's test

```
library(car)
leveneTest(PronDist ~ Language, data = rusger)
```

```
# Levene's Test for Homogeneity of Variance (center = median)
# Df F value Pr(>F)
# group 1 5 0.03 *
# 45
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

- Levene's test shows that the variances are different and the default Welch's adjustment is warranted
- But note that the Welch's \(t\)-test can always be used as it is more robust and power is comparable to that of the normal \(t\)-test

- For investigating normality, a normal quantile plot can be used

```
russian <- droplevels(dat[dat$Language == "russian", ])
qqnorm(russian$PronDist) # plot actual values vs. theoretical quantiles
qqline(russian$PronDist) # plot reference line of normal distribution
```

- Alternatively, one can use the Shapiro-Wilk test of normality

```
shapiro.test(russian$PronDist)
```

```
#
# Shapiro-Wilk normality test
#
# data: russian$PronDist
# W = 0.958, p-value = 0.38
```

```
qqnorm(german$PronDist)
qqline(german$PronDist)
```