Basic statistical tests

Martijn Wieling
University of Groningen

This lecture

  • Dataset for this lecture
  • Comparing one or two groups: \(t\)-test
    • Non-parametric alternatives: Mann-Whitney U and Wilcoxon signed rank
  • Assessing the dependency between two categorical variables: \(\chi^2\) test
  • Comparing more than two groups: ANOVA

Some basic points

  • This lecture focuses on how-to-use and when-to-use, rather than on the underlying calculations
    • If you want more information about the tests and concepts illustrated in this lecture, I recommend the books from Levshina, Winter or (free) Navarro
  • Make sure to report effect size as significance is dependent on sample size
difference (in \(s\)) \(n\) \(p\)
0.01 40,000 0.05
0.10 400 0.05
0.25 64 0.05
0.54 16 0.05

Question 1

Dataset for this lecture

load("dat.rda")
head(dat)
#    Speaker Language  PronDist PronDistCat LangDist LangDistAlt Age Gender AEO LR NrLang
# 1  arabic1   arabic  0.185727   Different  0.63699     0.44864  38      F  12  4      0
# 2 arabic10   arabic -0.172175     Similar  0.63699     0.44864  26      M   5  2      2
# 3 arabic13   arabic -0.035423     Similar  0.63699     0.44864  25      M  15  1      2
# 4 arabic12   arabic  0.372547   Different  0.63699     0.44864  32      M  11  8      0
# 5 arabic17   arabic -0.175237     Similar  0.63699     0.44864  35      M  15  0      1
# 6 arabic18   arabic  0.168120   Different  0.63699     0.44864  18      M   6  0      1

Dataset structure

str(dat)
# 'data.frame': 712 obs. of  11 variables:
#  $ Speaker    : Factor w/ 712 levels "afrikaans1","afrikaans2",..: 21 22 25 24 27 28 26 30 31 23 ...
#  $ Language   : Factor w/ 159 levels "afrikaans","agni",..: 7 7 7 7 7 7 7 7 7 7 ...
#  $ PronDist   : num  0.1857 -0.1722 -0.0354 0.3725 -0.1752 ...
#  $ PronDistCat: Factor w/ 2 levels "Different","Similar": 1 2 2 1 2 1 1 2 2 2 ...
#  $ LangDist   : num  0.637 0.637 0.637 0.637 0.637 ...
#  $ LangDistAlt: num  0.449 0.449 0.449 0.449 0.449 ...
#  $ Age        : num  38 26 25 32 35 18 22 36 23 30 ...
#  $ Gender     : Factor w/ 2 levels "F","M": 1 2 2 2 2 2 2 2 1 1 ...
#  $ AEO        : num  12 5 15 11 15 6 16 12 10 14 ...
#  $ LR         : num  4 2 1 8 0 0 0 1 0 4 ...
#  $ NrLang     : int  0 2 2 0 1 1 2 2 2 1 ...

Comparing one or two groups: \(t\)-test

  • Values between two groups (or vs. value) can be compared using the \(t\)-test
  • Assumptions:
    • Randomly selected sample(s)
    • Independent observations (except for paired data)
    • Data has interval scale (difference between two values is meaningful) or ratio scale (meaningful difference and true 0)
      • E.g., interval scale: temperature in C; ratio scale: length in cm.
    • Data in sample(s) normally distributed (for \(N \leq 30\))
    • Variances in samples homogeneous (Welch's adjustment, default in R, corrects for this)
    • Note: Likert scale is ordinal data, so \(t\)-test in principle not adequate
  • Visualize the data if possible (facilitates interpretation)

Question 2

\(t\)-test

  • Result of \(t\)-test is a \(t\)-value, which is compared to the appropriate \(t\)-distribution
  • \(t\)-distribution depends on degrees of freedom (therefore: report dF!)

plot of chunk unnamed-chunk-4

Group mean vs. value: visualization

german <- droplevels(dat[dat$Language == "german", ])
boxplot(german$PronDist)
abline(h = 0, col = "red", lty = 2)

plot of chunk unnamed-chunk-5

Group mean vs. value: one sample \(t\)-test

t.test(german$PronDist, mu = 0)
# 
#   One Sample t-test
# 
# data:  german$PronDist
# t = -5.33, df = 21, p-value = 2.7e-05
# alternative hypothesis: true mean is not equal to 0
# 95 percent confidence interval:
#  -0.208787 -0.091657
# sample estimates:
# mean of x 
#  -0.15022

One sample \(t\)-test: effect size

library(lsr)
cohensD(german$PronDist, mu = 0)
# [1] 1.1373
  • Cohen's \(d\) measures the difference in terms of the number of standard deviations
    • Rough guideline: Cohen's \(d\) < 0.3: small effect size; 0.3 - 0.8: medium; > 0.8: large

Try it yourself!

  • Install the Mathematical Biostatistics Boot Camp swirl course:
library(swirl)
install_from_swirl("Mathematical_Biostatistics_Boot_Camp")
  • Run swirl() in RStudio and finish the following lesson of the Mathematical Biostatistics Boot Camp course:
    • Lesson 1: One Sample t-test

Comparing paired data: visualization

# aggregate data per language (159 languages)
lang <- aggregate(cbind(LangDist, LangDistAlt) ~ Language, data = dat, FUN = mean)
par(mfrow = c(1, 2))
boxplot(lang[, c("LangDist", "LangDistAlt")])
boxplot(lang$LangDist - lang$LangDistAlt, main = "Pairwise differences")

plot of chunk unnamed-chunk-9

Paired samples \(t\)-test

t.test(lang$LangDist, lang$LangDistAlt, paired = T)
# 
#   Paired t-test
# 
# data:  lang$LangDist and lang$LangDistAlt
# t = -3.73, df = 158, p-value = 0.00027
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
#  -0.085703 -0.026367
# sample estimates:
# mean of the differences 
#               -0.056035

Question 3

Paired samples \(t\)-test = one sample \(t\)-test

t.test(lang$LangDist - lang$LangDistAlt, mu = 0)  # identical to one-sample test of differences
# 
#   One Sample t-test
# 
# data:  lang$LangDist - lang$LangDistAlt
# t = -3.73, df = 158, p-value = 0.00027
# alternative hypothesis: true mean is not equal to 0
# 95 percent confidence interval:
#  -0.085703 -0.026367
# sample estimates:
# mean of x 
# -0.056035
cohensD(lang$LangDist, lang$LangDistAlt, method = "paired")  # effect size
# [1] 0.29585

Comparing two groups: visualization

rusger <- droplevels(dat[dat$Language %in% c("russian", "german"), ])
boxplot(PronDist ~ Language, data = rusger)

plot of chunk unnamed-chunk-12

Comparing two groups: independent samples \(t\)-test

t.test(PronDist ~ Language, data = rusger, alternative = "two.sided")
# 
#   Welch Two Sample t-test
# 
# data:  PronDist by Language
# t = -3.56, df = 42.5, p-value = 0.00092
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
#  -0.267719 -0.074108
# sample estimates:
#  mean in group german mean in group russian 
#             -0.150222              0.020691
cohensD(PronDist ~ Language, data = rusger)
# [1] 1.0166

Reporting results of a \(t\)-test

  • Pronunciation difference from native English was smaller for the German speakers (mean: \(-0.15\), sd: \(0.132\)) than for the Russian speakers (mean: \(0.02\), sd: \(0.194\)). The difference was \(-0.17\) (Cohen's \(d\): \(1.02\), large effect) and reached significance using an independent samples Welch's unequal variances \(t\)-test at an \(\alpha\)-level of \(0.05\), \(t(42.5) = -3.56, p < 0.001\).

Assumptions met?

  • ✓ Randomly selected sample(s)
  • ✓ Independent observations (except for pairs)
  • ✓ Data has interval or ratio scale
  • ? Variance in samples homogeneous (corrected with Welch's adjustment)
  • ? Data in compared samples are normally distributed (for \(N \leq 30\))

Testing if variances are equal (homoscedasticity)

  • Testing homoscedasticity using Levene's test
library(car)
leveneTest(PronDist ~ Language, data = rusger)
# Levene's Test for Homogeneity of Variance (center = median)
#       Df F value Pr(>F)  
# group  1       5   0.03 *
#       45                 
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Levene's test shows that the variances are different and the default Welch's adjustment is warranted
    • But note that the Welch's \(t\)-test can always be used as it is more robust and power is comparable to that of the normal \(t\)-test

Assessing normality: Russian data (1)

  • For investigating normality, a normal quantile plot can be used
russian <- droplevels(dat[dat$Language == "russian", ])
qqnorm(russian$PronDist)  # plot actual values vs. theoretical quantiles
qqline(russian$PronDist)  # plot reference line of normal distribution

plot of chunk unnamed-chunk-16

Assessing normality: Russian data (2)

  • Alternatively, one can use the Shapiro-Wilk test of normality
shapiro.test(russian$PronDist)
# 
#   Shapiro-Wilk normality test
# 
# data:  russian$PronDist
# W = 0.958, p-value = 0.38

Question 4

Assessing normality: German data (1)

qqnorm(german$PronDist)
qqline(german$PronDist)