# Basic statistical tests

Martijn Wieling
University of Groningen

## This lecture

• Dataset for this lecture
• Comparing one or two groups: $$t$$-test
• Non-parametric alternatives: Mann-Whitney U and Wilcoxon signed rank
• Assessing the dependency between two categorical variables: $$\chi^2$$ test
• Comparing more than two groups: ANOVA

## Some basic points

• This lecture focuses on how-to-use and when-to-use, rather than on the underlying calculations
• If you want more information about the tests and concepts illustrated in this lecture, I recommend the books from Levshina, Winter or (free) Navarro
• Make sure to report effect size as significance is dependent on sample size
difference (in $$s$$) $$n$$ $$p$$
0.01 40,000 0.05
0.10 400 0.05
0.25 64 0.05
0.54 16 0.05

## Dataset for this lecture

load("dat.rda")

#    Speaker Language  PronDist PronDistCat LangDist LangDistAlt Age Gender AEO LR NrLang
# 1  arabic1   arabic  0.185727   Different  0.63699     0.44864  38      F  12  4      0
# 2 arabic10   arabic -0.172175     Similar  0.63699     0.44864  26      M   5  2      2
# 3 arabic13   arabic -0.035423     Similar  0.63699     0.44864  25      M  15  1      2
# 4 arabic12   arabic  0.372547   Different  0.63699     0.44864  32      M  11  8      0
# 5 arabic17   arabic -0.175237     Similar  0.63699     0.44864  35      M  15  0      1
# 6 arabic18   arabic  0.168120   Different  0.63699     0.44864  18      M   6  0      1


## Dataset structure

str(dat)

# 'data.frame': 712 obs. of  11 variables:
#  $Speaker : Factor w/ 712 levels "afrikaans1","afrikaans2",..: 21 22 25 24 27 28 26 30 31 23 ... #$ Language   : Factor w/ 159 levels "afrikaans","agni",..: 7 7 7 7 7 7 7 7 7 7 ...
#  $PronDist : num 0.1857 -0.1722 -0.0354 0.3725 -0.1752 ... #$ PronDistCat: Factor w/ 2 levels "Different","Similar": 1 2 2 1 2 1 1 2 2 2 ...
#  $LangDist : num 0.637 0.637 0.637 0.637 0.637 ... #$ LangDistAlt: num  0.449 0.449 0.449 0.449 0.449 ...
#  $Age : num 38 26 25 32 35 18 22 36 23 30 ... #$ Gender     : Factor w/ 2 levels "F","M": 1 2 2 2 2 2 2 2 1 1 ...
#  $AEO : num 12 5 15 11 15 6 16 12 10 14 ... #$ LR         : num  4 2 1 8 0 0 0 1 0 4 ...
boxplot(german$PronDist) abline(h = 0, col = "red", lty = 2)  ## Group mean vs. value: one sample $$t$$-test t.test(german$PronDist, mu = 0)

#
#   One Sample t-test
#
# data:  german$PronDist # t = -5.33, df = 21, p-value = 2.7e-05 # alternative hypothesis: true mean is not equal to 0 # 95 percent confidence interval: # -0.208787 -0.091657 # sample estimates: # mean of x # -0.15022  ## One sample $$t$$-test: effect size library(lsr) cohensD(german$PronDist, mu = 0)

# [1] 1.1373

• Cohen's $$d$$ measures the difference in terms of the number of standard deviations
• Rough guideline: Cohen's $$d$$ < 0.3: small effect size; 0.3 - 0.8: medium; > 0.8: large

## Try it yourself!

• Install the Mathematical Biostatistics Boot Camp swirl course:
library(swirl)
install_from_swirl("Mathematical_Biostatistics_Boot_Camp")

• Run swirl() in RStudio and finish the following lesson of the Mathematical Biostatistics Boot Camp course:
• Lesson 1: One Sample t-test

## Comparing paired data: visualization

# aggregate data per language (159 languages)
lang <- aggregate(cbind(LangDist, LangDistAlt) ~ Language, data = dat, FUN = mean)
par(mfrow = c(1, 2))
boxplot(lang[, c("LangDist", "LangDistAlt")])
boxplot(lang$LangDist - lang$LangDistAlt, main = "Pairwise differences")


## Paired samples $$t$$-test

t.test(lang$LangDist, lang$LangDistAlt, paired = T)

#
#   Paired t-test
#
# data:  lang$LangDist and lang$LangDistAlt
# t = -3.73, df = 158, p-value = 0.00027
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
#  -0.085703 -0.026367
# sample estimates:
# mean of the differences
#               -0.056035


## Paired samples $$t$$-test = one sample $$t$$-test

t.test(lang$LangDist - lang$LangDistAlt, mu = 0)  # identical to one-sample test of differences

#
#   One Sample t-test
#
# data:  lang$LangDist - lang$LangDistAlt
# t = -3.73, df = 158, p-value = 0.00027
# alternative hypothesis: true mean is not equal to 0
# 95 percent confidence interval:
#  -0.085703 -0.026367
# sample estimates:
# mean of x
# -0.056035

cohensD(lang$LangDist, lang$LangDistAlt, method = "paired")  # effect size

# [1] 0.29585


rusger <- droplevels(dat[dat$Language %in% c("russian", "german"), ]) boxplot(PronDist ~ Language, data = rusger)  ## Comparing two groups: independent samples $$t$$-test t.test(PronDist ~ Language, data = rusger, alternative = "two.sided")  # # Welch Two Sample t-test # # data: PronDist by Language # t = -3.56, df = 42.5, p-value = 0.00092 # alternative hypothesis: true difference in means is not equal to 0 # 95 percent confidence interval: # -0.267719 -0.074108 # sample estimates: # mean in group german mean in group russian # -0.150222 0.020691  cohensD(PronDist ~ Language, data = rusger)  # [1] 1.0166  ## Reporting results of a $$t$$-test • Pronunciation difference from native English was smaller for the German speakers (mean: $$-0.15$$, sd: $$0.132$$) than for the Russian speakers (mean: $$0.02$$, sd: $$0.194$$). The difference was $$-0.17$$ (Cohen's $$d$$: $$1.02$$, large effect) and reached significance using an independent samples Welch's unequal variances $$t$$-test at an $$\alpha$$-level of $$0.05$$, $$t(42.5) = -3.56, p < 0.001$$. ## Assumptions met? • ✓ Randomly selected sample(s) • ✓ Independent observations (except for pairs) • ✓ Data has interval or ratio scale • ? Variance in samples homogeneous (corrected with Welch's adjustment) • ? Data in compared samples are normally distributed (for $$N \leq 30$$) ## Testing if variances are equal (homoscedasticity) • Testing homoscedasticity using Levene's test library(car) leveneTest(PronDist ~ Language, data = rusger)  # Levene's Test for Homogeneity of Variance (center = median) # Df F value Pr(>F) # group 1 5 0.03 * # 45 # --- # Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  • Levene's test shows that the variances are different and the default Welch's adjustment is warranted • But note that the Welch's $$t$$-test can always be used as it is more robust and power is comparable to that of the normal $$t$$-test ## Assessing normality: Russian data (1) • For investigating normality, a normal quantile plot can be used russian <- droplevels(dat[dat$Language == "russian", ])
qqnorm(russian$PronDist) # plot actual values vs. theoretical quantiles qqline(russian$PronDist)  # plot reference line of normal distribution


## Assessing normality: Russian data (2)

• Alternatively, one can use the Shapiro-Wilk test of normality
shapiro.test(russian$PronDist)  # # Shapiro-Wilk normality test # # data: russian$PronDist
# W = 0.958, p-value = 0.38


## Assessing normality: German data (1)

qqnorm(german$PronDist) qqline(german$PronDist)