Statistiek I

$t$-tests

Martijn Wieling
University of Groningen

Question 1: last lecture

Last lecture

  • How to reason about the population using a sample (CLT)
  • Calculating the standard error ($SE$)
    • Standard error: used when reasoning about the population using a sample
    • Standard deviation: used when comparing an individual to the population
  • Calculating a confidence interval
  • Specifying a concrete testable hypothesis based on a research question
  • Specifying the null ($H_0$) and alternative hypothesis ($H_a$)
  • Conducting a \(z\)-test and using the results to evaluate a hypothesis
  • Definition of a \(p\)-value: probability of data given that \(H_0\) is true
  • Evaluating the statistical significance given \(p\)-value and \(\alpha\)-level
  • Difference between a one-tailed and a two-tailed test
  • Type I and II errors

This lecture

  • Introduction to \(t\)-test
  • Three types of \(t\)-tests:
    • Single sample \(t\)-test
    • Independent samples \(t\)-test
    • Paired samples \(t\)-test
  • Effect size
  • How to report?

Introduction: \(t\)-test similar to \(z\)-test

  • Last lecture: \(z\)-test is used for comparing averages when \(\sigma\) is known
    • \(\sigma\) is only known for standardized tests, such as IQ tests
  • When \(\sigma\) is not known (in most cases), we can use the $t$-test
    • This test includes an estimation of \(\sigma\) based on sample standard deviation \(s\)

Calculating \(t\)-value

  • Very similar to calculating \(z\)-value for a sample (using standard error):

$$t = \frac{m - \mu}{s / \sqrt{n}} \hspace{70pt} z = \frac{m - \mu}{\sigma / \sqrt{n}}$$

  • Only difference: sample standard deviation \(s\) is used instead of \(\sigma\)
  • The precise formula depends on the type of \(t\)-test (independent samples, etc.)
    • (But for the exam, you only have to know the basic formulas shown above)

Obtaining \(p\)-values on the basis of \(t\)-values

  • \(z\)-values are compared to the standard normal distribution
  • But $t$-values are compared to the $t$-distribution
  • \(t\)-distributions look similar to the standard normal distribution
    • but dependent on the number of degrees of freedom (dF)

What are degrees of freedom?

  • There are five balloons each having a different color
  • There are five students ($n = 5$) who need to select a balloon
    • If 4 students have selected a balloon (dF = 4), student nr. 5 gets the last balloon
  • Similarly: if we have a fixed mean value calculated from 10 values
    • 9 values may vary in their value, but the 10th is fixed: dF = 10 - 1 = 9

Question 2

\(t\)-distribution vs. normal distribution

  • Difference between normal distribution and \(t\)-distribution is large for small dFs
  • When dF \(\geq\) 100, the difference is negligible
  • As the shape differs, the \(p\)-value associated with a certain \(t\)-value also changes
  • That is why it is essential to specify dF when describing the results of a \(t\)-test:
    $t$(dF)

Visualizing \(t\)-distributions

plot of chunk unnamed-chunk-1

plot of chunk unnamed-chunk-1

  • For significance (given \(\alpha\)), higher (abs.) \(t\)-values are needed than \(z\)-values (but only when dF < 100, otherwise \(z\) and \(t\) are equal)
qt(0.025, df = 10, lower.tail = F)  # crit. t-value (alpha = 0.025) for dF = 10
# [1] 2.2281

Question 3

Answer to question 3

pt(2, 10, lower.tail = F) * 2  # two-sided p-value = 2 * one-sided p-value
# [1] 0.073388
plot of chunk unnamed-chunk-4

plot of chunk unnamed-chunk-4

  • Dark gray area: \(p\) < 0.05 (2-tailed)

Three types of \(t\)-tests

  • Single sample \(t\)-test: compare mean with fixed value
  • Independent sample \(t\)-test: compare the means of two independent groups
  • Paired \(t\)-test: compare pairs of (dependent) values (e.g., repeated measurements of same subjects)
  • Requirement for all \(t\)-tests: Data should be approximately normally distributed
    • Otherwise: use non-parametric tests (discussed in next lecture)

Single sample \(t\)-test

$$t = \frac{m - \mu}{s / \sqrt{n}}$$

  • Used to compare mean to fixed value
  • \(H_0\): \(\mu = \mu_0\) and \(H_a\): \(\mu \neq \mu_0\)
  • Larger \(t\)-values give reason to reject to \(H_0\)
  • Automatic calculation in R using function t.test()
  • Standardized effect size is measured as difference in standard deviations
    • Cohen’s \(d\): \(d = (m - \mu) / s\)

Assumptions for the single sample \(t\)-test

  • Data randomly selected from population
  • Data measured at interval or ratio scale
  • Observations are independent
  • Observations are approximately normally distributed
    • But \(t\)-test is robust to non-normality for larger samples ($n > 30$)

Single sample \(t\)-test: example

  • Given our English proficiency data, we’d like to assess if the average English score is different from 7.5
  • \(H_0\): \(\mu = 7.5\) and \(H_a\): \(\mu \neq 7.5\)
  • We use \(\alpha\) = 0.05
  • Sample mean \(m\) = 7.62
  • Sample standard deviaton \(s\) = 0.92
  • Sample size \(n\) = 500
  • Degrees of freedom of \(t\)-test equals 500 - 1 = 499

Step 1: \(t\)-test assumptions met?

  • Data randomly selected from population ?
  • Data measured at interval scale ✓
  • Independent observations ✓
  • Data roughly normally distributed (or > 30 observations) ✓
plot of chunk unnamed-chunk-6

plot of chunk unnamed-chunk-6

Step 2: visualization

boxplot(dat$english_score)
abline(h = 7.5, lty = 2, lwd = 2)
plot of chunk unnamed-chunk-7

plot of chunk unnamed-chunk-7

Step 3: calculation of \(t\)-value (and \(p\)-value)

$$t = \frac{m - \mu}{s / \sqrt{n}}$$

  • with \(\mu\) = 7.5, \(m\) = 7.62, \(s\) = 0.92, \(n\) = 500
  • \(t = (7.62 - 7.5) / (0.92 / \sqrt{500}) = 2.86\)
  • Of course we will use R to calculate the \(t\)-value automatically: but you also need to be able to calculate the \(t\)-value manually at your exam (but with simple values)!

Question 4

Automatic calculation of \(t\)-value and \(p\)-value in R

t.test(dat$english_score, alternative = "two.sided", mu = 7.5)
# 
# 	One Sample t-test
# 
# data:  dat$english_score
# t = 2.86, df = 499, p-value = 0.0045
# alternative hypothesis: true mean is not equal to 7.5
# 95 percent confidence interval:
#  7.5368 7.6988
# sample estimates:
# mean of x 
#    7.6178
  • \(p\)-value < \(\alpha\): reject \(H_0\) and accept \(H_a\)

Final step (4): calculation of effect size

  • Effect size is used to quantify the difference (i.e. the effect):
    • Significant results may be uninteresting if the effects are only small
    • For interpretability, effect size may be reported practically, e.g. score difference
  • Cohen’s \(d\) can be used as a standardized measure of effect size for the \(t\)-test
    • Cohen’s \(d = (m - \mu) / s = (7.62 - 7.5) / 0.92 = 0.13\)
    • Effect sizes ($|d|$): negligible (< 0.2), small (< 0.5), medium (< 0.8), large ($\geq$ 0.8)
  • See also: https://rpsychologist.com/d3/cohend/

Question 5

Effect size and sample size (1)

  • Statistical significance: effect unlikely to have arisen by chance
  • Very small differences may be significant when samples are large
  • As we saw when discussing the normal distribution: a difference of two standard errors or more is likely to arise less than 5% of the time due to chance
    • Standard error is reduced for larger samples (divided by \(\sqrt{n}\))

Effect size and sample size (2)

difference (in \(s\)) \(n\) \(p\)
0.01 40,000 0.05
0.10 400 0.05
0.25 64 0.05
0.38 30 0.05
0.54 16 0.05
  • We recommend samples of “about 30”, because small effect sizes are uninteresting, unless differences are important (e.g., health)
    • Take note of effect size when reading research reports!

Independent samples \(t\)-test

$$t = \frac{m_1 - m_2}{s_p \cdot \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}$$

  • Used to compare means of two different groups
  • \(H_0\) is always \(\mu_1 = \mu_2\), i.e. both populations have same mean
  • Two-sided \(H_a\): \(\mu_1 \neq \mu_2\) (one-sided: \(H_a\): \(\mu_1 < \mu_2\) or \(H_a\): \(\mu_1 > \mu_2\))
  • Degrees of freedom: \((n_1 - 1) + (n_2 - 1)\) (when assuming equal variance in both populations)
  • (You don’t have to know the formula by heart, it is shown here to help you understand what is going on)

Independent samples \(t\)-test assumptions

  • Data randomly selected from population
  • Data measured at interval or ratio scale
  • Observations are independent, both within and between the groups
  • Variances are homogeneous (i.e. values are spread out similarly in both groups)
    • Ignored as t.test() includes Welch’s adjustment to correct for unequal variances (more conservative: degrees of freedom reduced)
  • Observations in both samples are approximately normally distributed
    • But \(t\)-test is robust to non-normality for larger samples ($n > 30$)

Independent samples \(t\)-test: example

  • Given our English proficiency data, we’d like to assess if the average English score is higher for those who followed bilingual education as opposed to monolingual education
  • \(H_0\): \(\mu_b = \mu_m\) and \(H_a\): \(\mu_b > \mu_m\)
  • We use \(\alpha\) = 0.05
  • Sample mean \(m_b\) = 8.15
  • Sample mean \(m_m\) = 7.56

Step 1: \(t\)-test assumptions met?

  • Data randomly selected from population ?
  • Data measured at interval scale ✓
  • Independent observations ✓
  • Data roughly normally distributed (or > 30 observations) ✓
plot of chunk unnamed-chunk-10

plot of chunk unnamed-chunk-10

Step 2: visualization

dat$bl_edu = relevel(dat$bl_edu,'Y') # make 'Y' first level (default is 'N')
boxplot(english_score ~ bl_edu, data=dat) # formula notation is easy to use
plot of chunk unnamed-chunk-11

plot of chunk unnamed-chunk-11

 
# or: boxplot(dat[dat$bl_edu=='Y',]$english_score, dat[dat$bl_edu=='N',]$english_score)

Step 3: calculation of \(t\)-value and \(p\)-value

t.test(english_score ~ bl_edu, data = dat, alternative = "greater")  # 1st > 2nd level?
# 
# 	Welch Two Sample t-test
# 
# data:  english_score by bl_edu
# t = 3.92, df = 54.3, p-value = 0.00013
# alternative hypothesis: true difference in means between group Y and group N is greater than 0
# 95 percent confidence interval:
#  0.33529     Inf
# sample estimates:
# mean in group Y mean in group N 
#          8.1483          7.5627
  • \(p\)-value < \(\alpha\): reject \(H_0\) and accept \(H_a\)
  • Note that dF much lower than ($n_1$ - 1) + ($n_2$ - 1) due to correction for uneq. var.

Step 4: effect size

  • Cohen’s \(d = (\mu_1 - \mu_2) / s\)
  • But two samples: two standard deviations…
  • For Cohen’s \(d\), a single (pooled) standard deviation necessary
    • Instead of manually calculating this, we use the function cohen.d
library(effsize)  # to install: install.packages('effsize')
cohen.d(english_score ~ bl_edu, data = dat)
# 
# Cohen's d
# 
# d estimate: 0.64563 (medium)
# 95 percent confidence interval:
#   lower   upper 
# 0.34188 0.94937

Paired \(t\)-test

  • Used to compare means of two sets of data
  • Data in both sets are from same individuals: paired
  • Analysis is more powerful than independent samples \(t\)-test
    • Sources of individual variation cancelled out via pairwise comparison
      • Of course still many sources of variation present: e.g., measurement error
plot of chunk unnamed-chunk-15

plot of chunk unnamed-chunk-15

Paired \(t\)-test: approach

  • Calculate pairwise differences and test if average differences different from 0
    \(H_0: \mu_{(x_i-y_i)} = 0\) and \(H_a: \mu_{(x_i-y_i)} \ne 0\)
  • Similar to single sample \(t\)-test of differences:
    1. Calculate differences
    2. Use single sample \(t\)-test to assess if mean of differences is different from 0
  • Degrees of freedom: \(n_{\textrm{pairs}} - 1\)

Paired \(t\)-test assumptions

  • Subjects randomly sampled from population
  • Data measured at interval or ratio scale
  • Observations are independent within each group
  • Differences between paired values are approximately normally distributed
    • But \(t\)-test is robust to non-normality for larger samples ($n_{\textrm{pairs}} > 30$)
  • Paired \(t\)-test is inappropriate when scores differ in scale, e.g., comparing percentages with grades
    • Consider regression for those cases (not covered in this course)

Paired \(t\)-test data: Lowlands Science 2019 experiment

Paired \(t\)-test: Lowlands Science 2019 data

  • Question: do people converge in their speech after interaction?
  • Data collected at Lowlands Science 2019
  • Two measurements per pair: difference in /a/ vowel formants (between the two players) at first and last trial of game
  • \(H_0: \mu_{(f_i-l_i)} = 0\) and \(H_a: \mu_{(f_i-l_i)} > 0\)
  • We use \(\alpha\) = 0.05
  • Sample mean \(m_f\) = 108.5
  • Sample mean \(m_l\) = 96.63

Step 1: \(t\)-test assumptions met?

  • Data randomly selected from population ?
  • Data measured at interval scale ✓
  • Independent observations ✓
  • Differences roughly normally distributed (or > 30 pairs: here 74 pairs) ✓
plot of chunk unnamed-chunk-17

plot of chunk unnamed-chunk-17

Step 2: visualization

boxplot(lls$Diff1, lls$Diff2, names = c("First trial", "Last trial"))
plot of chunk unnamed-chunk-18

plot of chunk unnamed-chunk-18

Question 6

Step 3: calculation of \(t\)-value and \(p\)-value

t.test(lls$Diff1, lls$Diff2, paired = TRUE, alternative = "greater")
# 
# 	Paired t-test
# 
# data:  lls$Diff1 and lls$Diff2
# t = 2.31, df = 73, p-value = 0.012
# alternative hypothesis: true mean difference is greater than 0
# 95 percent confidence interval:
#  3.2997    Inf
# sample estimates:
# mean difference 
#           11.87
  • \(p\)-value < \(\alpha\): reject \(H_0\) and accept \(H_a\)

Step 4: effect size

cohen.d(lls$Diff1, lls$Diff2, paired = T)
# 
# Cohen's d
# 
# d estimate: 0.19936 (negligible)
# 95 percent confidence interval:
#    lower    upper 
# 0.026919 0.371803

Paired data incorrectly analyzed

  • What if we would have incorrectly analyzed the data using an independent samples \(t\)-test?
t.test(lls$Diff1, lls$Diff2, paired = FALSE, alternative = "greater")$statistic
#      t 
# 1.2138
# with the paired t-test:
t.test(lls$Diff1, lls$Diff2, paired = TRUE, alternative = "greater")$statistic
#      t 
# 2.3074
  • Independent samples \(t\)-test generally results in lower absolute \(t\)-value

Question 7

Paired data incorrectly analyzed: lesson

  • More sophisticated statistics allow more sensitivity to data:
    • Incorrectly using the independent samples \(t\)-test increases the probability of a type-II error
      • Increased chance of not rejecting a false null hypothesis

Decision tree ($z$-test vs. \(t\)-test)

 

\(t\)-test: summary (1)

Simple \(t\) statistic:

$$t = \; \frac{m_1 - m_2}{s/\sqrt{n}}$$

  • For numeric data: compares means of two groups (or two series of values), or one mean versus one value, and determines whether difference is significant
  • Population statistics ($\sigma$) unnecessary, but sample statistics needed
  • Three applications:
    • Single sample: compares mean of sample to fixed value
    • Independent (i.e. unrelated) samples: compares two means
    • Paired: compares pairs of values (= single sample test of differences)

\(t\)-test: summary (2)

  • Assumptions with all \(t\)-tests:
    • Distribution roughly normal if \(n \leq 30\)
    • Randomly selected data
    • Data at interval or ratio scale
    • Data independent within one series of values, and (if independent samples \(t\)-test) between both groups
  • Additionally
    • Report effect size using Cohen’s \(d = (m_1 - m_2)/s\)

Example of reporting results of \(t\)-test

  • For the example of the independent samples \(t\)-test:

We tested whether the average English scores of students following Statistiek I was significantly higher for those who had bilingual education than for those who did not. Our hypotheses were: $H_0$: \(\mu_b = \mu_m\) and $H_a$: \(\mu_b > \mu_m\). We obtained English scores in a sample of 500 students of the Statistiek I course via an online questionnaire. Since \(\sigma\) is unknown and the samples were independent, we conducted an independent samples \(t\)-test (corrected for unequal variances) after verifying that the assumptions for the test were met (normally distributed, or more than 30 values). The mean of the English scores for the students with bilingual education in the sample was 8.15, whereas it was 7.56 for those who followed monolingual education. The effect size was medium (Cohen’s \(d\): 0.65; see box plot), and it reached significance at $\alpha$-level 0.05: $t$(54.3) = 3.92, \(p\) < 0.001. We therefore reject the null hypothesis and accept the alternative hypothesis that students who had bilingual education had higher English scores than those who did not.

Question 8

How to report results: guidelines

  1. State the issue in terms of the population(s) (not merely the samples)
  2. Formulate \(H_0\) and \(H_a\)
  3. State how your hypothesis is to be tested, how samples were obtained (including sample size), and what procedures (test materials) were used to obtain measurements
  4. Identify the \(\alpha\)-level and the statistical test to be used, and indicate why
  5. Illustrate your research question graphically, if possible (e.g., box plots)
  6. Present the results of the study on the sample, the \(p\)-value, if the result is signicant or not, and an effect size
  7. State conclusions about the hypotheses
  8. Discuss and interpret your results

Practice this in laboratory exercises!

Another real world example (time permitting)

Obtaining data

Recorded data

Study - part I: native English

  • Research question: Do native English speakers distinguish /t/ from /θ/ (“th”) with their tongue?
  • Hypothesis: The tongue position of English native speakers is more frontal when pronouncing /θ/ than /t/.
    • \(H_0: \mu_{(th_i-t_i)} = 0\) (no difference in frontal position)
    • \(H_a: \mu_{(th_i-t_i)} > 0\) (more frontal position for /θ/ than for /t/)

Data set used in the study

  • We randomly selected 22 English participants who pronounced 10 minimal pairs /t/:/θ/, when connected to the articulography device:
    • ‘fate’-‘faith’, ‘fort’-‘forth’, ‘kit’-‘kith’, ‘mitt’-‘myth’, ‘tent’-‘tenth’
    • ‘tank’-‘thank’, ‘team’-‘theme’, ‘tick’-‘thick’, ‘ties’-‘thighs’, ‘tongs’-‘thongs’
    • For each speaker, we calculated the average normalized frontal tongue position for both sets of words (/t/-words, /θ/-words)

Distribution of differences: native English speakers

plot of chunk unnamed-chunk-23

plot of chunk unnamed-chunk-23

Which analysis?

  • We used a paired \(t\)-test to assess the hypothesis as our data consists of two measurement points per speaker, and the differences were approximately normally distributed
  • We used an \(\alpha\)-level of 0.05 (one-tailed)

Visualization: native English speakers

plot of chunk unnamed-chunk-24

plot of chunk unnamed-chunk-24

Paired \(t\)-test: native English speakers

datEN$Sound = relevel(datEN$Sound, "TH")  # set TH as reference level
t.test(FrontPos ~ Sound, data = datEN, paired = T, alternative = "greater")  # paired
# 
# 	Paired t-test
# 
# data:  FrontPos by Sound
# t = 6.4, df = 21, p-value = 1.2e-06
# alternative hypothesis: true mean difference is greater than 0
# 95 percent confidence interval:
#  0.035207      Inf
# sample estimates:
# mean difference 
#        0.048154

Effect size and conclusion

cohen.d(FrontPos ~ Sound, data = datEN, paired = T)$estimate  # large effect size
# [1] 0.94206
  • Native English speakers have significantly more frontal tongue positions for /θ/-words than for /t/-words

Study - part II: non-native English

  • Research question: Do Dutch speakers of English distinguish /t/ from /θ/ (“th”) with their tongue?
  • Hypothesis: The tongue position of Dutch speakers of English is more frontal when pronouncing /θ/ than /t/.
    • \(H_0: \mu_{(th_i-t_i)} = 0\) (no difference in frontal position)
    • \(H_a: \mu_{(th_i-t_i)} > 0\) (more frontal position for /θ/ than for /t/)

Data set used in the study

  • We randomly selected 19 Dutch participants who pronounced 10 minimal pairs /t/:/θ/, when connected to the articulography device:
    • ‘fate’-‘faith’, ‘fort’-‘forth’, ‘kit’-‘kith’, ‘mitt’-‘myth’, ‘tent’-‘tenth’
    • ‘tank’-‘thank’, ‘team’-‘theme’, ‘tick’-‘thick’, ‘ties’-‘thighs’, ‘tongs’-‘thongs’
    • For each speaker, we calculated the average normalized frontal tongue position for both sets of words (/t/-words, /θ/-words)

Distribution of differences: Dutch speakers

(Note that the distribution is not normal!)

plot of chunk unnamed-chunk-28

plot of chunk unnamed-chunk-28

Which analysis?

  • We used a paired \(t\)-test to assess the hypothesis as our data consists of two measurement points per speaker
    • Though note that this analysis actually is not appropriate here, as the data is not normally distributed (a non-parametric alternative should be used: next lecture)
  • We used an \(\alpha\)-level of 0.05 (one-tailed)

Visualization: Dutch speakers

plot of chunk unnamed-chunk-29

plot of chunk unnamed-chunk-29

Paired \(t\)-test: Dutch speakers

datNL$Sound = relevel(datNL$Sound, "TH")
t.test(FrontPos ~ Sound, data = datNL, paired = T, alternative = "greater")
# 
# 	Paired t-test
# 
# data:  FrontPos by Sound
# t = 1.86, df = 18, p-value = 0.04
# alternative hypothesis: true mean difference is greater than 0
# 95 percent confidence interval:
#  0.0010611       Inf
# sample estimates:
# mean difference 
#        0.016263

Effect size and conclusion

cohen.d(FrontPos ~ Sound, data = datNL, paired = T)$estimate  # medium effect size
# [1] 0.32382
  • Dutch speakers have a significantly more frontal tongue position for /θ/-words than for /t/-words

Interpretation incorrect!

  • We rejected the null hypothesis for the Dutch group
  • But this is incorrect!
    • We used an inappropriate test: \(t\)-test while distribution was non-normal and contained large outliers
    • Using a non-parametric test (next lexture) results in retaining \(H_0\):
      • Dutch speakers do not have more frontal tongue positions for /θ/-words than for /t/-words
  • Lesson: take note of test assumptions!

Question 9

Note about multiple testing

  • Using multiple tests risks finding significance through sheer chance
  • Suppose you run two tests (as we did here), always using \(\alpha\) = 0.05
    • Chance of finding one or more significant values (family-wise error rate) is: \(1 - (1 - \alpha)^2\) = \(1 - 0.95^2 = 0.0975\) (almost twice as high as we’d like!)
  • To guarantee a family-wise error rate of 0.05, we should divide \(\alpha\) by the number of tests: Bonferroni correction

Recap

  • In this lecture, we’ve covered
    • the \(t\)-test (three variants)
    • how to calculate the effect size (Cohen’s \(d\))
    • how to report results of a statistical test
    • the problem of multiple testing
  • Experiment yourself: https://eolomea.let.rug.nl/Statistiek-I/HC4 (login with s-nr)
  • Next lecture: Non-parametric alternatives

Please evaluate this lecture!

Exam question

Questions?

Thank you for your attention!

https://www.martijnwieling.nl
m.b.wieling@rug.nl