Martijn Wieling

University of Groningen

- Three variants of the \(t\)-test
- How to calculate the effect size (Cohen's \(d\))
- How to report results of a statistical test

- Nonparametric tests:
- Mann-Whitney U test: alternative to independent samples \(t\)-test
- Wilcoxon signed-rank test: alternative to paired and single sample \(t\)-test
- Sign test: alternative to Wilcoxon signed-rank test

- Reporting statistical analyses (again)

- Nonparametric tests do
**not**assume an underlying distribution and therefore have no parameters- In contrast to e.g., \(N(0,1)\) and \(t(18)\)

- Nonparametric tests are applied when the distribution is unknown or the required assumptions of the parametric test are violated
- They can also be applied to data assumed to be normally distributed

- Often best option for nonnumeric data (next lecture: \(\chi^2\))
- Less sensitive than parametric tests (i.e. less
**power**)!

**Mann-Whitney U test**: alternative to independent samples \(t\)-test- When data normally distributed: 95% of power of \(t\)-test

**Wilcoxon signed-rank test**: alternative to paired \(t\)-test- Requirement: distribution symmetrical
- When data normally distributed: 95% of power of \(t\)-test

**Sign test**: alternative to Wilcoxon signed-rank test when data not symmetrical

- Alternative to independent samples \(t\)-test (i.e. comparing two indep. samples)
- Applicable to ordinal data (there is an ordering: no exact scale) and num. data
- Also when \(n\) < 30 and data in (at least) one group not normally distributed
- \(H_0\): \(P(X > Y) = P(Y > X)\), \(H_a\): \(P(X > Y) \neq P(Y > X)\)
- If distributions of samples the same, this also means:

\(H_0\): medians of both groups equal, \(H_a\): medians of both groups differ

- If distributions of samples the same, this also means:

- Frequently applied to Likert data: on a scale from 1 (easiest) to 5 (hardest) ...
- (Identical to: Wilcoxon's
**rank sum**test)

- Idea: combine the two sets of values, order them from low to high and count how often the items in one set come after items in the other set
- Group A: (2, 4, 6, 10, 20), Group B: (8, 12, 14, 16, 18)
- Ordered: A A A B A B B B B A (values: 2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
- \(U_A = 0 + 0 + 0 + 1 + 5 = 6\) (A)
- \(U_B = 3 + 4 + 4 + 4 + 4 = 19\) (B)

- Mann-Whitney U \(= min(U_A,U_B) = 6\)
- The lower U, the more likely to be significantly different

- \(p\)-value: obtained by assessing where \(U\) is located in the distribution of all possible \(U\)-values for given sample sizes \(n_A\) and \(n_B\)
- Distribution of all \(U\)-values resembles a
**normal distribution**(for larger \(n\))

- Distribution of all \(U\)-values resembles a

- In
`R`

:`wilcox.test()`

- Identical usage as
`t.test()`

- Identical usage as
- With Mann-Whitney U test: data is converted to ranks
- Actual values are ignored: loss of information!

- Research question: Do native English speakers show a stronger distinction of /t/ from /θ/ ("th") with their tongue than non-native (Dutch) speakers of English?
- Hypothesis: The tongue position difference between /t/ and /θ/ is larger for English native speakers than for non-native Dutch speakers of English
- \(H_0\): same frontal /t/-/θ /position difference for Dutch and English speakers
- \(H_a\): larger frontal /t/-/θ position difference for English (versus Dutch) speakers

- Data: 22 English and 19 Dutch participants who pronounced 10 minimal pairs

/t/:/θ/ while connected to the articulography device:- 'fate'-'faith', 'forth'-'fort', 'kit'-'kith', 'mitt'-'myth', 'tent'-'tenth'
- 'tank'-'thank', 'team'-'theme', 'tick'-'thick', 'ties'-'thighs', 'tongs'-'thongs'

- For each speaker, we calculated the average difference in frontal tongue position between /t/-words and /θ/-words

- As the values in one group are not normally distributed, we used the
**Mann-Whitney U test**to analyze the difference between the two groups - Our \(\alpha\)-level is set at 0.05 (one-tailed)

```
wilcox.test(diffEN$Diff, diffNL$Diff, alternative = "greater") # 1st > 2nd?
```

```
#
# Wilcoxon rank sum test
#
# data: diffEN$Diff and diffNL$Diff
# W = 315, p-value = 0.0025
# alternative hypothesis: true location shift is greater than 0
```

- We reject the null hypothesis, and accept the alternative hypothesis that the native speakers show a greater tongue distinction between /t/ and /θ/ than non-native speakers
- If we would have incorrectly analyzed the data using the independent samples \(t\)-test, we would also have rejected the null hypothesis
- But with \(p = 0.004\)

- Cliff's delta (or \(d\)) measures effect size of the Mann-Whitney U test
- \(|d| < 0.147\): negl.; \(|d| < 0.33\): small; \(|d| < 0.474\): medium; \(|d| \geq 0.474\): large

```
library(effsize)
cliff.delta(diffEN$Diff, diffNL$Diff)
```

```
#
# Cliff's Delta
#
# delta estimate: 0.50718 (large)
# 95 percent confidence interval:
# inf sup
# 0.13076 0.75579
```

- Instead of the Mann-Whitney U test, an independent samples \(t\)-test of the ranks gives a \(p\)-value close to that of the Mann-Whitney U test
- E.g., A A A B A B B B B A: ranks group A = (1,2,3,5,10), ranks group B: (4,6,7,8,9)
- For our example: \(p = 0.0024\) (Mann-Whitney U test: \(p = 0.0025\))

- Mann-Whitney U test cannot be applied to single samples, nor paired data
- For that we use the
**Wilcoxon signed-rank test**

- For that we use the

- Alternative to single sample or paired \(t\)-test
- Applied when data is non-normal
- However, distribution should be roughly symmetric, not skewed
- If distribution is skewed,
**sign test**should be used

- Applicable to ordinal and scaled data

- Applied when data is non-normal

- For paired samples:
- \(H_0\): median of the differences \(=\) 0
- \(H_a\): median of the differences \(\neq\) 0 (for two-tailed hypothesis)

- For single sample:
- \(H_0\): distribution symmetric around \(x\) (\(\approx \mu = x\), due to symmetry)
- \(H_a\): distribution non-symmetric around \(x\) (\(\approx \mu \neq x\))
- If \(H_0\) rejected: results may be reported as being significantly different from \(x\)

- Calculate pairwise differences (single sample: with respect to single value)
- Rank the
**absolute**differences from low to high (excluding differences of 0) - Add the signs of the differences to the ranks
- Sum the positive ranks: \(W\)
- If \(H_0\) true then \(W\) close to half of the total sum of all unsigned-ranks

- \(p\)-value: obtained by assessing where \(W\) is located in the distribution of all possible \(W\)-values for a given sample size \(n\)
- Distribution of all \(W\)-values resembles a
**normal distribution**(for larger \(n\))

- Distribution of all \(W\)-values resembles a

- Example: comparing English scores to 7.5 (only 6 cases)

english_score | diff | abs_diff | rank | signed_rank | |
---|---|---|---|---|---|

10 | 8.94 | 1.44 | 1.44 | 5 | 5 |

11 | 6.27 | -1.23 | 1.23 | 4 | -4 |

12 | 7.99 | 0.49 | 0.49 | 1 | 1 |

13 | 5.77 | -1.73 | 1.73 | 6 | -6 |

14 | 6.78 | -0.72 | 0.72 | 2 | -2 |

15 | 8.45 | 0.95 | 0.95 | 3 | 3 |

- \(W = 5 + 1 + 3 = 9\)
- Compared to half of total sum of ranks (\(21 / 2 = 10.5\), so quite close)

- Fortunately we don't have to do this manually!
- In
`R`

:`wilcox.test()`

(same as for Mann-Whitney U) - Data is converted to ranks: actual values are ignored (i.e. information loss)

- Given our English proficiency data, we'd like to assess if the average English score is different from 7.5 (with \(\alpha = 0.05\))
- \(H_0\): \(\mu = 7.5\)
- \(H_a\): \(\mu \neq 7.5\)

- Visualization:

- Normally distributed and also more than 30 values, so \(t\)-test is more appropriate (i.e. more powerful)

- As \(n > 30\), and the data is normally distributed, we should use a \(t\)-test
- But here we illustrate how to conduct the Wilcoxon signed-rank test

```
wilcox.test(dat$english_score, alternative = "two.sided", mu = 7.5)
```

```
#
# Wilcoxon signed rank test with continuity correction
#
# data: dat$english_score
# V = 21400, p-value = 0.032
# alternative hypothesis: true location is not equal to 7.5
```

- Effect size for single sample Wilcoxon signed-rank test: \(r = z/\sqrt{n}\)
- \(|r| < 0.3\) (small), \(0.3 \leq |r| < 0.5\) (medium), \(|r| \geq 0.5\) (large)
- \(z\) can be found via the \(p\)-value

```
pval <- wilcox.test(dat$english_score, alternative = "two.sided", mu = 7.5)$p.value
zval <- qnorm(pval/2, lower.tail = FALSE) # pval/2 because of two-tailed test
n <- nrow(dat)
(effectsize <- zval/sqrt(n))
```

```
# [1] 0.12052
```

- Research question: Do Dutch speakers of English distinguish /t/ from /θ/ ("th")?
- Hypothesis: The tongue position of Dutch speakers of English is more frontal when pronouncing /θ/ than /t/.
- \(H_0\): no (median) difference in frontal position between /t/ and /θ/
- \(H_a\): more frontal (median) position for /θ/ than for /t/

- Data: we randomly selected 19 Dutch participants who pronounced 10 minimal pairs /t/:/θ/, when connected to the articulography device:
- 'fate'-'faith', 'forth'-'fort', 'kit'-'kith', 'mitt'-'myth', 'tent'-'tenth'
- 'tank'-'thank', 'team'-'theme', 'tick'-'thick', 'ties'-'thighs', 'tongs'-'thongs'
- For each speaker, we calculated the average normalized frontal tongue position for both sets of words (/t/-words, /θ/-words) and their difference

- \(t\)-test is not appropriate as \(n < 30\) and distribution is not normal
**Wilcoxon signed-rank test**is also not appropriate as the data is not symmetric**Sign test**is needed

- But first we illustrate the analysis using
**Wilcoxon signed-rank test**for paired data - Our \(\alpha\)-level is set at 0.05 (one-tailed)

```
levels(datNL$Sound) # shows which level is first, and which is second
```

```
# [1] "T" "TH"
```

```
# formula interface with alternative='less': first level < second level?
wilcox.test(Frontness ~ Sound, data = datNL, paired = TRUE, alternative = "less")
```

```
#
# Wilcoxon signed rank test
#
# data: Frontness by Sound
# V = 57, p-value = 0.067
# alternative hypothesis: true location shift is less than 0
```

- Using a \(t\)-test instead would show a significant result (\(p =\) 0.04; see last lecture)!

```
cliff.delta(Frontness ~ Sound, data = datNL) # effect size for Wilcoxon test
```

```
#
# Cliff's Delta
#
# delta estimate: -0.22438 (small)
# 95 percent confidence interval:
# inf sup
# -0.54654 0.15563
```

- According to the Wilcoxon signed-rank test we should retain the null hypothesis
- However, this test is not appropriate here
- Requirement: data symmetric (which was not the case)

**Important: take note of test assumptions!**

- \(t\)-test requires normality for small samples (\(\leq\) 30)
- Our dataset is too small, so alternative is the
**sign test**- Much less powerful (due to information loss)!

- Divides data into three clases
`+`

(higher),`-`

(lower) and`0`

(no change) - Use when distribution non-normal and asymmetric
- Compares proportions
`+`

to proportions`-`

- Tests whether division is roughly chance like:
- \(H_0\): no weighting toward
`+`

(or`-`

) (about same number of`+`

's as`-`

's) - \(H_a\): weighting toward
`+`

(and/or`-`

)

- \(H_0\): no weighting toward
- Based on binomial distribution \(B(n,p)\), with \(p = 0.5\)

- E.g., coin toss 100 times, record number of heads: \(B(100,0.5)\)
- For large samples: binomial distribution \(\approx\) normal distribution (\(z\) usable)
- Red bars: 2 or more \(\sigma\) from mean \(\mu\)

- For large samples: binomial distribution \(\approx\) normal distribution (\(z\) usable)

subject | pos. /\(\theta\)/ | pos. /t/ | pos /\(\theta\)/ - /t/ | sign |
---|---|---|---|---|

1 | 0.738 | 0.781 | -0.043 | - |

2 | 0.767 | 0.766 | 0.001 | + |

3 | 0.879 | 0.884 | -0.005 | - |

4 | 0.761 | 0.748 | 0.013 | + |

5 | 0.774 | 0.748 | 0.027 | + |

6 | 0.749 | 0.752 | -0.003 | - |

... | ... | ... | ... |

- 12 out of 19 Dutch speakers show more frontal positions for /θ/
- Significant at \(\alpha\)-level 0.05 (one-tailed)?

```
binom.test(x = 12, n = 19, p = 0.5, alternative = "greater")
```

```
#
# Exact binomial test
#
# data: 12 and 19
# number of successes = 12, number of trials = 19, p-value = 0.18
# alternative hypothesis: true probability of success is greater than 0.5
# 95 percent confidence interval:
# 0.41806 1.00000
# sample estimates:
# probability of success
# 0.63158
```

- Non-parametric tests are applied when the distribution is unknown or the required assumptions of the parametric test are violated
- They can also be applied to data assumed to be normally distributed, but the power to detect an effect is generally
**lower** - Lower power caused by using ranks or signs rather than the actual values

- They can also be applied to data assumed to be normally distributed, but the power to detect an effect is generally
- Often best option for nonnumeric data (next lecture: \(\chi^2\))
- We discussed: Mann-Whitney U test, Wilcoxon signed-rank test, and sign test

- Consider the following situation:
- It is suspected that the Spanish language proficiency of social workers in larger cities is different from that of social workers from smaller cities and towns (simply due to their different exposure to the language). Your company wishes to test this, since training programs may differ depending on proficiency levels. You obtain data from twenty social workers, ten from each group, and you wish to test whether the groups are different.

- We compare large cities (\(l\)) and small cities (\(s\))
- \(H_0: \mu_l = \mu_{s}\)
- \(H_a: \mu_l \neq \mu_{s}\)

- Hypothesis is two-sided (text indicates "different")
- One-sided example: is the group from the cities (with more exposure)
**better**?

- One-sided example: is the group from the cities (with more exposure)
- Read problem statements carefully!
- How to test?

- We will test a hypothesis about differences in means in two different groups using a \(t\)-test for independent samples
- At least if the assumptions hold, otherwise we use the non-parametric alternative (Mann-Whitney U test)

- Data randomly selected from population ✓
- Data measured at interval scale (proficiency test) ✓
- Independent observations, also between groups ✓
- Observations roughly normally distributed in both groups (as \(n \leq\) 30)?
- How can we test normality?

- Test normality using normal quantile plot
- Show this when reporting results!

- If you are uncertain if the distribution is roughly normal, you can test this using the Shapiro-Wilk test for normality
- \(p\)-value of test \(< \alpha\): data cannot be assumed to be normal
- It is harder to reject the null hypothesis for small samples
- Use
**in addition**to visualization, not**instead**

```
shapiro.test(spanish[spanish$Group == "LargeCity", ]$Score)
```

```
#
# Shapiro-Wilk normality test
#
# data: spanish[spanish$Group == "LargeCity", ]$Score
# W = 0.92, p-value = 0.35
```

- Results show:
- \(m_l = 28.4\)
- \(m_{s} = 26.2\)
- \(sd\) \(\approx 5\)
- \(t(18) = 0.98\)
- \(p = 0.34\)

- How to report?

We suspected that the Spanish language proficiency of social workers in larger cities was different from that of social workers from smaller cities and towns (simply due to their different exposure to the language). We wished to test this since training programs may differ depending on proficiency levels. Our \(H_0: \mu_l = \mu_{s}\) and our \(H_a: \mu_l \neq \mu_{s}\). We obtained data from twenty randomly selected social workers, ten from each group, verified that the samples were roughly normally distributed, and tested whether the groups differed in means (see Figure 1 for the box plots), obtaining \(t(18)= 0.98, p = 0.34\). The more urban group scored 28.4 and was 0.44 sd better than the other group with score 26.2 (Cohen's \(d\): medium effect). We retained the null hypothesis that the groups do *not* differ, as the \(p\)-value was higher than the \(\alpha\)-value (significance threshold) of 0.05.

- In this lecture, we've covered:
- Three different non-parametric tests:
- Mann-Whitney U test as alternative to independent samples \(t\)-test
- Wilcoxon signed-rank test as alternative to paired and single sample \(t\)-test
- Sign test as alternative to the Wilcoxon signed-rank test

- How to report statistical analyses

- Three different non-parametric tests:
- Next lecture:
**Relating same-type variables**(\(\chi^2\) test, correlation, Cronbach's \(\alpha\))

Thank you for your attention!