Statistiek I

Nonparametric tests

Martijn Wieling
University of Groningen

Question 1: last lecture

Last lecture

  • Three variants of the \(t\)-test
  • How to calculate the effect size (Cohen's \(d\))
  • How to report results of a statistical test

This lecture

  • Nonparametric tests:
    • Mann-Whitney U test: alternative to independent samples \(t\)-test
    • Wilcoxon signed-rank test: alternative to paired and single sample \(t\)-test
    • Sign test: alternative to Wilcoxon signed-rank test
  • Reporting statistical analyses (again)

Nonparametric tests

  • Nonparametric tests do not assume an underlying distribution and therefore have no parameters
    • In contrast to e.g., \(N(0,1)\) and \(t(18)\)
  • Nonparametric tests are applied when the distribution is unknown or the required assumptions of the parametric test are violated
    • They can also be applied to data assumed to be normally distributed
  • Often best option for nonnumeric data (next lecture: \(\chi^2\))
  • Less sensitive than parametric tests (i.e. less power)!

Popular nonparametric tests

  • Mann-Whitney U test: alternative to independent samples \(t\)-test
    • When data normally distributed: 95% of power of \(t\)-test
  • Wilcoxon signed-rank test: alternative to paired \(t\)-test
    • Requirement: distribution symmetrical
    • When data normally distributed: 95% of power of \(t\)-test
  • Sign test: alternative to Wilcoxon signed-rank test when data not symmetrical

Question 2

Mann-Whitney U test

  • Alternative to independent samples \(t\)-test (i.e. comparing two indep. samples)
    • Applicable to ordinal data (there is an ordering: no exact scale) and num. data
    • Also when \(n\) < 30 and data in (at least) one group not normally distributed
    • \(H_0\): \(P(X > Y) = P(Y > X)\), \(H_a\): \(P(X > Y) \neq P(Y > X)\)
      • If distributions of samples the same, this also means:
        \(H_0\): medians of both groups equal, \(H_a\): medians of both groups differ
  • Frequently applied to Likert data: on a scale from 1 (easiest) to 5 (hardest) ...
  • (Identical to: Wilcoxon's rank sum test)

Question 3

Mann-Whitney U test: idea

  • Idea: combine the two sets of values, order them from low to high and count how often the items in one set come after items in the other set
  • Group A: (2, 4, 6, 10, 20), Group B: (8, 12, 14, 16, 18)
    • Ordered: A A A B A B B B B A (values: 2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
    • \(U_A = 0 + 0 + 0 + 1 + 5 = 6\)  (A)
    • \(U_B = 3 + 4 + 4 + 4 + 4 = 19\)  (B)
  • Mann-Whitney U \(= min(U_A,U_B) = 6\)
    • The lower U, the more likely to be significantly different
  • \(p\)-value: obtained by assessing where \(U\) is located in the distribution of all possible \(U\)-values for given sample sizes \(n_A\) and \(n_B\)
    • Distribution of all \(U\)-values resembles a normal distribution (for larger \(n\))

Distribution of \(U\)-values

plot of chunk unnamed-chunk-1

Mann-Whitney U test: additional information

  • In R: wilcox.test()
    • Identical usage as t.test()
  • With Mann-Whitney U test: data is converted to ranks
    • Actual values are ignored: loss of information!

Example: tongue difference between /θ/ and /t/

Study: native vs. non-native English

  • Research question: Do native English speakers show a stronger distinction of /t/ from /θ/ ("th") with their tongue than non-native (Dutch) speakers of English?
  • Hypothesis: The tongue position difference between /t/ and /θ/ is larger for English native speakers than for non-native Dutch speakers of English
    • \(H_0\): same frontal /t/-/θ /position difference for Dutch and English speakers
    • \(H_a\): larger frontal /t/-/θ position difference for English (versus Dutch) speakers

Study: native vs. non-native English

  • Data: 22 English and 19 Dutch participants who pronounced 10 minimal pairs
    /t/:/θ/ while connected to the articulography device:
    • 'fate'-'faith', 'forth'-'fort', 'kit'-'kith', 'mitt'-'myth', 'tent'-'tenth'
    • 'tank'-'thank', 'team'-'theme', 'tick'-'thick', 'ties'-'thighs', 'tongs'-'thongs'
  • For each speaker, we calculated the average difference in frontal tongue position between /t/-words and /θ/-words

Data visualization

plot of chunk unnamed-chunk-2

Distributions

plot of chunk unnamed-chunk-3

Question 4

Which analysis?

  • As the values in one group are not normally distributed, we used the Mann-Whitney U test to analyze the difference between the two groups
  • Our \(\alpha\)-level is set at 0.05 (one-tailed)

Analysis in R: Mann-Whitney U test

wilcox.test(diffEN$Diff, diffNL$Diff, alternative = "greater")  # 1st > 2nd?
# 
#   Wilcoxon rank sum test
# 
# data:  diffEN$Diff and diffNL$Diff
# W = 315, p-value = 0.0025
# alternative hypothesis: true location shift is greater than 0

Conclusion of analysis

  • We reject the null hypothesis, and accept the alternative hypothesis that the native speakers show a greater tongue distinction between /t/ and /θ/ than non-native speakers
  • If we would have incorrectly analyzed the data using the independent samples \(t\)-test, we would also have rejected the null hypothesis
    • But with \(p = 0.004\)

Effect size of Mann-Whitney U test

  • Cliff's delta (or \(d\)) measures effect size of the Mann-Whitney U test
    • \(|d| < 0.147\): negl.; \(|d| < 0.33\): small; \(|d| < 0.474\): medium; \(|d| \geq 0.474\): large
library(effsize)
cliff.delta(diffEN$Diff, diffNL$Diff)
# 
# Cliff's Delta
# 
# delta estimate: 0.50718 (large)
# 95 percent confidence interval:
#     inf     sup 
# 0.13076 0.75579

Some remarks about the Mann-Whitney U test

  • Instead of the Mann-Whitney U test, an independent samples \(t\)-test of the ranks gives a \(p\)-value close to that of the Mann-Whitney U test
    • E.g., A A A B A B B B B A: ranks group A = (1,2,3,5,10), ranks group B: (4,6,7,8,9)
    • For our example: \(p = 0.0024\) (Mann-Whitney U test: \(p = 0.0025\))
  • Mann-Whitney U test cannot be applied to single samples, nor paired data
    • For that we use the Wilcoxon signed-rank test

Wilcoxon signed-rank test

  • Alternative to single sample or paired \(t\)-test
    • Applied when data is non-normal
      • However, distribution should be roughly symmetric, not skewed
      • If distribution is skewed, sign test should be used
    • Applicable to ordinal and scaled data

Wilcoxon signed-rank test: hypotheses

  • For paired samples:
    • \(H_0\): median of the differences \(=\) 0
    • \(H_a\): median of the differences \(\neq\) 0 (for two-tailed hypothesis)
  • For single sample:
    • \(H_0\): distribution symmetric around \(x\) (\(\approx \mu = x\), due to symmetry)
    • \(H_a\): distribution non-symmetric around \(x\) (\(\approx \mu \neq x\))
    • If \(H_0\) rejected: results may be reported as being significantly different from \(x\)

Wilcoxon signed-rank test: idea

  • Calculate pairwise differences (single sample: with respect to single value)
  • Rank the absolute differences from low to high (excluding differences of 0)
  • Add the signs of the differences to the ranks
  • Sum the positive ranks: \(W\)
    • If \(H_0\) true then \(W\) close to half of the total sum of all unsigned-ranks
  • \(p\)-value: obtained by assessing where \(W\) is located in the distribution of all possible \(W\)-values for a given sample size \(n\)
    • Distribution of all \(W\)-values resembles a normal distribution (for larger \(n\))

Distribution of \(W\)-values

plot of chunk unnamed-chunk-7

Wilcoxon signed-rank test: idea (calculations)

  • Example: comparing English scores to 7.5 (only 6 cases)
english_score diff abs_diff rank signed_rank
10 8.94 1.44 1.44 5 5
11 6.27 -1.23 1.23 4 -4
12 7.99 0.49 0.49 1 1
13 5.77 -1.73 1.73 6 -6
14 6.78 -0.72 0.72 2 -2
15 8.45 0.95 0.95 3 3
  • \(W = 5 + 1 + 3 = 9\)
    • Compared to half of total sum of ranks (\(21 / 2 = 10.5\), so quite close)

Wilcoxon signed-rank test: additional information

  • Fortunately we don't have to do this manually!
  • In R: wilcox.test() (same as for Mann-Whitney U)
  • Data is converted to ranks: actual values are ignored (i.e. information loss)

Wilcoxon signed-rank test: single sample example

  • Given our English proficiency data, we'd like to assess if the average English score is different from 7.5 (with \(\alpha = 0.05\))
    • \(H_0\): \(\mu = 7.5\)
    • \(H_a\): \(\mu \neq 7.5\)
  • Visualization:

plot of chunk unnamed-chunk-9

Wilcoxon signed-rank test: not necessary!

plot of chunk unnamed-chunk-10

  • Normally distributed and also more than 30 values, so \(t\)-test is more appropriate (i.e. more powerful)

Wilcoxon signed-rank test: R code

  • As \(n > 30\), and the data is normally distributed, we should use a \(t\)-test
    • But here we illustrate how to conduct the Wilcoxon signed-rank test
wilcox.test(dat$english_score, alternative = "two.sided", mu = 7.5)
# 
#   Wilcoxon signed rank test with continuity correction
# 
# data:  dat$english_score
# V = 21400, p-value = 0.032
# alternative hypothesis: true location is not equal to 7.5

Question 5

Wilcoxon signed-rank test for single sample: effect size

  • Effect size for single sample Wilcoxon signed-rank test: \(r = z/\sqrt{n}\)
    • \(|r| < 0.3\) (small), \(0.3 \leq |r| < 0.5\) (medium), \(|r| \geq 0.5\) (large)
    • \(z\) can be found via the \(p\)-value
pval <- wilcox.test(dat$english_score, alternative = "two.sided", mu = 7.5)$p.value
zval <- qnorm(pval/2, lower.tail = FALSE)  # pval/2 because of two-tailed test
n <- nrow(dat)
(effectsize <- zval/sqrt(n))
# [1] 0.12052

Wilcoxon signed-rank test: paired data example

  • Research question: Do Dutch speakers of English distinguish /t/ from /θ/ ("th")?
  • Hypothesis: The tongue position of Dutch speakers of English is more frontal when pronouncing /θ/ than /t/.
    • \(H_0\): no (median) difference in frontal position between /t/ and /θ/
    • \(H_a\): more frontal (median) position for /θ/ than for /t/

Wilcoxon signed-rank test: paired data example

  • Data: we randomly selected 19 Dutch participants who pronounced 10 minimal pairs /t/:/θ/, when connected to the articulography device:
    • 'fate'-'faith', 'forth'-'fort', 'kit'-'kith', 'mitt'-'myth', 'tent'-'tenth'
    • 'tank'-'thank', 'team'-'theme', 'tick'-'thick', 'ties'-'thighs', 'tongs'-'thongs'
    • For each speaker, we calculated the average normalized frontal tongue position for both sets of words (/t/-words, /θ/-words) and their difference

Distribution of differences: non-normal

plot of chunk unnamed-chunk-13

Question 6

Which analysis?

  • \(t\)-test is not appropriate as \(n < 30\) and distribution is not normal
  • Wilcoxon signed-rank test is also not appropriate as the data is not symmetric
    • Sign test is needed
  • But first we illustrate the analysis using Wilcoxon signed-rank test for paired data
  • Our \(\alpha\)-level is set at 0.05 (one-tailed)

Visualization: Dutch speakers' /t/ and /θ/

plot of chunk unnamed-chunk-14

Wilcoxon signed-rank test for paired data: R code

levels(datNL$Sound)  # shows which level is first, and which is second
# [1] "T"  "TH"
# formula interface with alternative='less': first level < second level?
wilcox.test(Frontness ~ Sound, data = datNL, paired = TRUE, alternative = "less")
# 
#   Wilcoxon signed rank test
# 
# data:  Frontness by Sound
# V = 57, p-value = 0.067
# alternative hypothesis: true location shift is less than 0
  • Using a \(t\)-test instead would show a significant result (\(p =\) 0.04; see last lecture)!

Wilcoxon signed-rank test for paired data: effect size

cliff.delta(Frontness ~ Sound, data = datNL)  # effect size for Wilcoxon test
# 
# Cliff's Delta
# 
# delta estimate: -0.22438 (small)
# 95 percent confidence interval:
#      inf      sup 
# -0.54654  0.15563

Using Wilcoxon signed-rank test is also wrong here!

  • According to the Wilcoxon signed-rank test we should retain the null hypothesis
  • However, this test is not appropriate here
    • Requirement: data symmetric (which was not the case)
  • Important: take note of test assumptions!

What test should we use?

  • \(t\)-test requires normality for small samples (\(\leq\) 30)
  • Our dataset is too small, so alternative is the sign test
    • Much less powerful (due to information loss)!

Sign test

  • Divides data into three clases + (higher), - (lower) and 0 (no change)
  • Use when distribution non-normal and asymmetric
  • Compares proportions + to proportions -
  • Tests whether division is roughly chance like:
    • \(H_0\): no weighting toward + (or -) (about same number of +'s as -'s)
    • \(H_a\): weighting toward + (and/or -)
  • Based on binomial distribution \(B(n,p)\), with \(p = 0.5\)

Binomial distribution

  • E.g., coin toss 100 times, record number of heads: \(B(100,0.5)\)
    • For large samples: binomial distribution \(\approx\) normal distribution (\(z\) usable)
      • Red bars: 2 or more \(\sigma\) from mean \(\mu\) plot of chunk unnamed-chunk-17

Sign test applied to articulography data

subject pos. /\(\theta\)/ pos. /t/ pos /\(\theta\)/ - /t/ sign
1 0.738 0.781 -0.043 -
2 0.767 0.766 0.001 +
3 0.879 0.884 -0.005 -
4 0.761 0.748 0.013 +
5 0.774 0.748 0.027 +
6 0.749 0.752 -0.003 -
... ... ... ...
  • 12 out of 19 Dutch speakers show more frontal positions for /θ/
  • Significant at \(\alpha\)-level 0.05 (one-tailed)?

Sign test applied to articulography data: R code

binom.test(x = 12, n = 19, p = 0.5, alternative = "greater")
# 
#   Exact binomial test
# 
# data:  12 and 19
# number of successes = 12, number of trials = 19, p-value = 0.18
# alternative hypothesis: true probability of success is greater than 0.5
# 95 percent confidence interval:
#  0.41806 1.00000
# sample estimates:
# probability of success 
#                0.63158

Visualization

plot of chunk unnamed-chunk-19

Question 7

Non-parametric tests: summary

  • Non-parametric tests are applied when the distribution is unknown or the required assumptions of the parametric test are violated
    • They can also be applied to data assumed to be normally distributed, but the power to detect an effect is generally lower
    • Lower power caused by using ranks or signs rather than the actual values
  • Often best option for nonnumeric data (next lecture: \(\chi^2\))
  • We discussed: Mann-Whitney U test, Wilcoxon signed-rank test, and sign test

Question 8

Decision tree

 

Reporting results: example

  • Consider the following situation:
    • It is suspected that the Spanish language proficiency of social workers in larger cities is different from that of social workers from smaller cities and towns (simply due to their different exposure to the language). Your company wishes to test this, since training programs may differ depending on proficiency levels. You obtain data from twenty social workers, ten from each group, and you wish to test whether the groups are different.

Question 9

Hypotheses

  • We compare large cities (\(l\)) and small cities (\(s\))
    • \(H_0: \mu_l = \mu_{s}\)
    • \(H_a: \mu_l \neq \mu_{s}\)
  • Hypothesis is two-sided (text indicates "different")
    • One-sided example: is the group from the cities (with more exposure) better?
  • Read problem statements carefully!
  • How to test?

How to test?

  • We will test a hypothesis about differences in means in two different groups using a \(t\)-test for independent samples
    • At least if the assumptions hold, otherwise we use the non-parametric alternative (Mann-Whitney U test)

Assumptions met?

  • Data randomly selected from population ✓
  • Data measured at interval scale (proficiency test) ✓
  • Independent observations, also between groups ✓
  • Observations roughly normally distributed in both groups (as \(n \leq\) 30)?
    • How can we test normality?

Testing normality

  • Test normality using normal quantile plot
    • Show this when reporting results!

plot of chunk unnamed-chunk-20

What if the normal quantile plot is unclear?

  • If you are uncertain if the distribution is roughly normal, you can test this using the Shapiro-Wilk test for normality
    • \(p\)-value of test \(< \alpha\): data cannot be assumed to be normal
    • It is harder to reject the null hypothesis for small samples
    • Use in addition to visualization, not instead
shapiro.test(spanish[spanish$Group == "LargeCity", ]$Score)
# 
#   Shapiro-Wilk normality test
# 
# data:  spanish[spanish$Group == "LargeCity", ]$Score
# W = 0.92, p-value = 0.35

Analysis

  • Results show:
    • \(m_l = 28.4\)
    • \(m_{s} = 26.2\)
    • \(sd\) \(\approx 5\)
    • \(t(18) = 0.98\)
    • \(p = 0.34\)
  • How to report?

Report

We suspected that the Spanish language proficiency of social workers in larger cities was different from that of social workers from smaller cities and towns (simply due to their different exposure to the language). We wished to test this since training programs may differ depending on proficiency levels. Our \(H_0: \mu_l = \mu_{s}\) and our \(H_a: \mu_l \neq \mu_{s}\). We obtained data from twenty randomly selected social workers, ten from each group, verified that the samples were roughly normally distributed, and tested whether the groups differed in means (see Figure 1 for the box plots), obtaining \(t(18)= 0.98, p = 0.34\). The more urban group scored 28.4 and was 0.44 sd better than the other group with score 26.2 (Cohen's \(d\): medium effect). We retained the null hypothesis that the groups do not differ, as the \(p\)-value was higher than the \(\alpha\)-value (significance threshold) of 0.05.

Question 10

Recap

  • In this lecture, we've covered:
    • Three different non-parametric tests:
      • Mann-Whitney U test as alternative to independent samples \(t\)-test
      • Wilcoxon signed-rank test as alternative to paired and single sample \(t\)-test
      • Sign test as alternative to the Wilcoxon signed-rank test
    • How to report statistical analyses
  • Next lecture: Relating same-type variables (\(\chi^2\) test, correlation, Cronbach's \(\alpha\))

Please evaluate this lecture!

Exam question

Questions?

Thank you for your attention!

http://www.martijnwieling.nl
m.b.wieling@rug.nl