Statistiek I

$t$-tests

Martijn Wieling
University of Groningen

Question 1: last lecture

Last lecture

How to reason about the population using a sample (CLT)
Calculating the standard error ($SE$)
- Standard error: used when reasoning about the population using a sample
- Standard deviation: used when comparing an individual to the population
Calculating a confidence interval
Specifying a concrete testable hypothesis based on a research question
Specifying the null ($H_0$) and alternative hypothesis ($H_a$)
Conducting a $z$-test and using the results to evaluate a hypothesis
Definition of a $p$-value: probability of data given that $H_0$ is true
Evaluating the statistical significance given $p$-value and $\alpha$-level
Difference between a one-tailed and a two-tailed test
Type I and II errors

This lecture

Introduction to $t$-test
Three types of $t$-tests:
- Single sample $t$-test
- Independent samples $t$-test
- Paired samples $t$-test
Effect size
How to report?

Introduction: $t$-test similar to $z$-test

Last lecture: $z$-test is used for comparing averages when $\sigma$ is known
- $\sigma$ is only known for standardized tests, such as IQ tests
When $\sigma$ is not known (in most cases), we can use the $t$-test
- This test includes an estimation of $\sigma$ based on sample standard deviation $s$

Calculating $t$-value

Very similar to calculating $z$-value for a sample (using standard error):

$$t = \frac{m - \mu}{s / \sqrt{n}} \hspace{70pt} z = \frac{m - \mu}{\sigma / \sqrt{n}}$$

Only difference: sample standard deviation $s$ is used instead of $\sigma$
The precise formula depends on the type of $t$-test (independent samples, etc.)
- (But for the exam, you only have to know the basic formulas shown above)

Obtaining $p$-values on the basis of $t$-values

$z$-values are compared to the standard normal distribution
But $t$-values are compared to the $t$-distribution
$t$-distributions look similar to the standard normal distribution
- but dependent on the number of degrees of freedom (dF)

What are degrees of freedom?

There are five balloons each having a different color
There are five students ($n = 5$) who need to select a balloon
- If 4 students have selected a balloon (dF = 4), student nr. 5 gets the last balloon

Similarly: if we have a fixed mean value calculated from 10 values
- 9 values may vary in their value, but the 10th is fixed: dF = 10 - 1 = 9

Question 2

$t$-distribution vs. normal distribution

Difference between normal distribution and $t$-distribution is large for small dFs
When dF $\geq$ 100, the difference is negligible
As the shape differs, the $p$-value associated with a certain $t$-value also changes
That is why it is essential to specify dF when describing the results of a $t$-test:
$t$(dF)

Visualizing $t$-distributions

plot of chunk unnamed-chunk-1

For significance (given $\alpha$), higher (abs.) $t$-values are needed than $z$-values (but only when dF < 100, otherwise $z$ and $t$ are equal)

qt(0.025, df = 10, lower.tail = F)  # crit. t-value (alpha = 0.025) for dF = 10

# [1] 2.2281

Question 3

Answer to question 3

pt(2, 10, lower.tail = F) * 2  # two-sided p-value = 2 * one-sided p-value

# [1] 0.073388

plot of chunk unnamed-chunk-4

Dark gray area: $p$ < 0.05 (2-tailed)

Three types of $t$-tests

Single sample $t$-test: compare mean with fixed value
Independent sample $t$-test: compare the means of two independent groups
Paired $t$-test: compare pairs of (dependent) values (e.g., repeated measurements of same subjects)
Requirement for all $t$-tests: Data should be approximately normally distributed
- Otherwise: use non-parametric tests (discussed in next lecture)

Single sample $t$-test

$$t = \frac{m - \mu}{s / \sqrt{n}}$$

Used to compare mean to fixed value
$H_0$: $\mu = \mu_0$ and $H_a$: $\mu \neq \mu_0$
Larger $t$-values give reason to reject to $H_0$
Automatic calculation in R using function t.test()
Standardized effect size is measured as difference in standard deviations
- Cohen’s $d$: $d = (m - \mu) / s$

Assumptions for the single sample $t$-test

Data randomly selected from population
Data measured at interval or ratio scale
Observations are independent
Observations are approximately normally distributed
- But $t$-test is robust to non-normality for larger samples ($n > 30$)

Single sample $t$-test: example

Given our English proficiency data, we’d like to assess if the average English score is different from 7.5
$H_0$: $\mu = 7.5$ and $H_a$: $\mu \neq 7.5$
We use $\alpha$ = 0.05
Sample mean $m$ = 7.62
Sample standard deviaton $s$ = 0.92
Sample size $n$ = 500
Degrees of freedom of $t$-test equals 500 - 1 = 499

Step 1: $t$-test assumptions met?

Data randomly selected from population ?
Data measured at interval scale ✓
Independent observations ✓
Data roughly normally distributed (or > 30 observations) ✓

plot of chunk unnamed-chunk-6

Step 2: visualization

boxplot(dat$english_score)
abline(h = 7.5, lty = 2, lwd = 2)

plot of chunk unnamed-chunk-7

Step 3: calculation of $t$-value (and $p$-value)

$$t = \frac{m - \mu}{s / \sqrt{n}}$$

with $\mu$ = 7.5, $m$ = 7.62, $s$ = 0.92, $n$ = 500
$t = (7.62 - 7.5) / (0.92 / \sqrt{500}) = 2.86$
Of course we will use R to calculate the $t$-value automatically: but you also need to be able to calculate the $t$-value manually at your exam (but with simple values)!

Question 4

Automatic calculation of $t$-value and $p$-value in R

t.test(dat$english_score, alternative = "two.sided", mu = 7.5)

# 
# 	One Sample t-test
# 
# data:  dat$english_score
# t = 2.86, df = 499, p-value = 0.0045
# alternative hypothesis: true mean is not equal to 7.5
# 95 percent confidence interval:
#  7.5368 7.6988
# sample estimates:
# mean of x 
#    7.6178

$p$-value < $\alpha$: reject $H_0$ and accept $H_a$

Final step (4): calculation of effect size

Effect size is used to quantify the difference (i.e. the effect):
- Significant results may be uninteresting if the effects are only small
- For interpretability, effect size may be reported practically, e.g. score difference
Cohen’s $d$ can be used as a standardized measure of effect size for the $t$-test
- Cohen’s $d = (m - \mu) / s = (7.62 - 7.5) / 0.92 = 0.13$
- Effect sizes ($|d|$): negligible (< 0.2), small (< 0.5), medium (< 0.8), large ($\geq$ 0.8)
See also: https://rpsychologist.com/d3/cohend/

Question 5

Effect size and sample size (1)

Statistical significance: effect unlikely to have arisen by chance
Very small differences may be significant when samples are large
As we saw when discussing the normal distribution: a difference of two standard errors or more is likely to arise less than 5% of the time due to chance
- Standard error is reduced for larger samples (divided by $\sqrt{n}$)

Effect size and sample size (2)

difference (in $s$)	$n$	$p$
0.01	40,000	0.05
0.10	400	0.05
0.25	64	0.05
0.38	30	0.05
0.54	16	0.05

We recommend samples of “about 30”, because small effect sizes are uninteresting, unless differences are important (e.g., health)
- Take note of effect size when reading research reports!

Independent samples $t$-test

$$t = \frac{m_1 - m_2}{s_p \cdot \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}$$

Used to compare means of two different groups
$H_0$ is always $\mu_1 = \mu_2$, i.e. both populations have same mean
Two-sided $H_a$: $\mu_1 \neq \mu_2$ (one-sided: $H_a$: $\mu_1 < \mu_2$ or $H_a$: $\mu_1 > \mu_2$)
Degrees of freedom: $(n_1 - 1) + (n_2 - 1)$ (when assuming equal variance in both populations)
(You don’t have to know the formula by heart, it is shown here to help you understand what is going on)

Independent samples $t$-test assumptions

Data randomly selected from population
Data measured at interval or ratio scale
Observations are independent, both within and between the groups
Variances are homogeneous (i.e. values are spread out similarly in both groups)
- Ignored as t.test() includes Welch’s adjustment to correct for unequal variances (more conservative: degrees of freedom reduced)
Observations in both samples are approximately normally distributed
- But $t$-test is robust to non-normality for larger samples ($n > 30$)

Independent samples $t$-test: example

Given our English proficiency data, we’d like to assess if the average English score is higher for those who followed bilingual education as opposed to monolingual education
$H_0$: $\mu_b = \mu_m$ and $H_a$: $\mu_b > \mu_m$
We use $\alpha$ = 0.05
Sample mean $m_b$ = 8.15
Sample mean $m_m$ = 7.56

Step 1: $t$-test assumptions met?

Data randomly selected from population ?
Data measured at interval scale ✓
Independent observations ✓
Data roughly normally distributed (or > 30 observations) ✓

plot of chunk unnamed-chunk-10

Step 2: visualization

dat$bl_edu = relevel(dat$bl_edu,'Y') # make 'Y' first level (default is 'N')
boxplot(english_score ~ bl_edu, data=dat) # formula notation is easy to use

plot of chunk unnamed-chunk-11

# or: boxplot(dat[dat$bl_edu=='Y',]$english_score, dat[dat$bl_edu=='N',]$english_score)

Step 3: calculation of $t$-value and $p$-value

t.test(english_score ~ bl_edu, data = dat, alternative = "greater")  # 1st > 2nd level?

# 
# 	Welch Two Sample t-test
# 
# data:  english_score by bl_edu
# t = 3.92, df = 54.3, p-value = 0.00013
# alternative hypothesis: true difference in means between group Y and group N is greater than 0
# 95 percent confidence interval:
#  0.33529     Inf
# sample estimates:
# mean in group Y mean in group N 
#          8.1483          7.5627

$p$-value < $\alpha$: reject $H_0$ and accept $H_a$
Note that dF much lower than ($n_1$ - 1) + ($n_2$ - 1) due to correction for uneq. var.

Step 4: effect size

Cohen’s $d = (\mu_1 - \mu_2) / s$
But two samples: two standard deviations…
For Cohen’s $d$, a single (pooled) standard deviation necessary
- Instead of manually calculating this, we use the function cohen.d

library(effsize)  # to install: install.packages('effsize')
cohen.d(english_score ~ bl_edu, data = dat)

# 
# Cohen's d
# 
# d estimate: 0.64563 (medium)
# 95 percent confidence interval:
#   lower   upper 
# 0.34188 0.94937

Paired $t$-test

Used to compare means of two sets of data
Data in both sets are from same individuals: paired
Analysis is more powerful than independent samples $t$-test
- Sources of individual variation cancelled out via pairwise comparison
  - Of course still many sources of variation present: e.g., measurement error

plot of chunk unnamed-chunk-15

Paired $t$-test: approach

Calculate pairwise differences and test if average differences different from 0
$H_0: \mu_{(x_i-y_i)} = 0$ and $H_a: \mu_{(x_i-y_i)} \ne 0$
Similar to single sample $t$-test of differences:
1. Calculate differences
2. Use single sample $t$-test to assess if mean of differences is different from 0
Degrees of freedom: $n_{\textrm{pairs}} - 1$

Paired $t$-test assumptions

Subjects randomly sampled from population
Data measured at interval or ratio scale
Observations are independent within each group
Differences between paired values are approximately normally distributed
- But $t$-test is robust to non-normality for larger samples ($n_{\textrm{pairs}} > 30$)
Paired $t$-test is inappropriate when scores differ in scale, e.g., comparing percentages with grades
- Consider regression for those cases (not covered in this course)

Paired $t$-test data: Lowlands Science 2019 experiment

Paired $t$-test: Lowlands Science 2019 data

Question: do people converge in their speech after interaction?
Data collected at Lowlands Science 2019
Two measurements per pair: difference in /a/ vowel formants (between the two players) at first and last trial of game
$H_0: \mu_{(f_i-l_i)} = 0$ and $H_a: \mu_{(f_i-l_i)} > 0$
We use $\alpha$ = 0.05
Sample mean $m_f$ = 108.5
Sample mean $m_l$ = 96.63

Step 1: $t$-test assumptions met?

Data randomly selected from population ?
Data measured at interval scale ✓
Independent observations ✓
Differences roughly normally distributed (or > 30 pairs: here 74 pairs) ✓

plot of chunk unnamed-chunk-17

Step 2: visualization

boxplot(lls$Diff1, lls$Diff2, names = c("First trial", "Last trial"))

plot of chunk unnamed-chunk-18

Question 6

Step 3: calculation of $t$-value and $p$-value

t.test(lls$Diff1, lls$Diff2, paired = TRUE, alternative = "greater")

# 
# 	Paired t-test
# 
# data:  lls$Diff1 and lls$Diff2
# t = 2.31, df = 73, p-value = 0.012
# alternative hypothesis: true mean difference is greater than 0
# 95 percent confidence interval:
#  3.2997    Inf
# sample estimates:
# mean difference 
#           11.87

$p$-value < $\alpha$: reject $H_0$ and accept $H_a$

Step 4: effect size

cohen.d(lls$Diff1, lls$Diff2, paired = T)

# 
# Cohen's d
# 
# d estimate: 0.19936 (negligible)
# 95 percent confidence interval:
#    lower    upper 
# 0.026919 0.371803

Paired data incorrectly analyzed

What if we would have incorrectly analyzed the data using an independent samples $t$-test?

t.test(lls$Diff1, lls$Diff2, paired = FALSE, alternative = "greater")$statistic

#      t 
# 1.2138

# with the paired t-test:
t.test(lls$Diff1, lls$Diff2, paired = TRUE, alternative = "greater")$statistic

#      t 
# 2.3074

Independent samples $t$-test generally results in lower absolute $t$-value

Question 7

Paired data incorrectly analyzed: lesson

More sophisticated statistics allow more sensitivity to data:
- Incorrectly using the independent samples $t$-test increases the probability of a type-II error
  - Increased chance of not rejecting a false null hypothesis

Decision tree ($z$-test vs. $t$-test)

$t$-test: summary (1)

Simple $t$ statistic:

$$t = \; \frac{m_1 - m_2}{s/\sqrt{n}}$$

For numeric data: compares means of two groups (or two series of values), or one mean versus one value, and determines whether difference is significant
Population statistics ($\sigma$) unnecessary, but sample statistics needed
Three applications:
- Single sample: compares mean of sample to fixed value
- Independent (i.e. unrelated) samples: compares two means
- Paired: compares pairs of values (= single sample test of differences)

$t$-test: summary (2)

Assumptions with all $t$-tests:
- Distribution roughly normal if $n \leq 30$
- Randomly selected data
- Data at interval or ratio scale
- Data independent within one series of values, and (if independent samples $t$-test) between both groups
Additionally
- Report effect size using Cohen’s $d = (m_1 - m_2)/s$

Example of reporting results of $t$-test

For the example of the independent samples $t$-test:

We tested whether the average English scores of students following Statistiek I was significantly higher for those who had bilingual education than for those who did not. Our hypotheses were: $H_0$: $\mu_b = \mu_m$ and $H_a$: $\mu_b > \mu_m$. We obtained English scores in a sample of 500 students of the Statistiek I course via an online questionnaire. Since $\sigma$ is unknown and the samples were independent, we conducted an independent samples $t$-test (corrected for unequal variances) after verifying that the assumptions for the test were met (normally distributed, or more than 30 values). The mean of the English scores for the students with bilingual education in the sample was 8.15, whereas it was 7.56 for those who followed monolingual education. The effect size was medium (Cohen’s $d$: 0.65; see box plot), and it reached significance at $\alpha$-level 0.05: $t$(54.3) = 3.92, $p$ < 0.001. We therefore reject the null hypothesis and accept the alternative hypothesis that students who had bilingual education had higher English scores than those who did not.

Question 8

How to report results: guidelines

State the issue in terms of the population(s) (not merely the samples)
Formulate $H_0$ and $H_a$
State how your hypothesis is to be tested, how samples were obtained (including sample size), and what procedures (test materials) were used to obtain measurements
Identify the $\alpha$-level and the statistical test to be used, and indicate why
Illustrate your research question graphically, if possible (e.g., box plots)
Present the results of the study on the sample, the $p$-value, if the result is signicant or not, and an effect size
State conclusions about the hypotheses
Discuss and interpret your results

Practice this in laboratory exercises!

Another real world example (time permitting)

Obtaining data

Recorded data

Study - part I: native English

Research question: Do native English speakers distinguish /t/ from /θ/ (“th”) with their tongue?
Hypothesis: The tongue position of English native speakers is more frontal when pronouncing /θ/ than /t/.
- $H_0: \mu_{(th_i-t_i)} = 0$ (no difference in frontal position)
- $H_a: \mu_{(th_i-t_i)} > 0$ (more frontal position for /θ/ than for /t/)

Data set used in the study

We randomly selected 22 English participants who pronounced 10 minimal pairs /t/:/θ/, when connected to the articulography device:
- ‘fate’-‘faith’, ‘fort’-‘forth’, ‘kit’-‘kith’, ‘mitt’-‘myth’, ‘tent’-‘tenth’
- ‘tank’-‘thank’, ‘team’-‘theme’, ‘tick’-‘thick’, ‘ties’-‘thighs’, ‘tongs’-‘thongs’
- For each speaker, we calculated the average normalized frontal tongue position for both sets of words (/t/-words, /θ/-words)

Distribution of differences: native English speakers

plot of chunk unnamed-chunk-23

Which analysis?

We used a paired $t$-test to assess the hypothesis as our data consists of two measurement points per speaker, and the differences were approximately normally distributed
We used an $\alpha$-level of 0.05 (one-tailed)

Visualization: native English speakers

plot of chunk unnamed-chunk-24

Paired $t$-test: native English speakers

datEN$Sound = relevel(datEN$Sound, "TH")  # set TH as reference level
t.test(FrontPos ~ Sound, data = datEN, paired = T, alternative = "greater")  # paired

# 
# 	Paired t-test
# 
# data:  FrontPos by Sound
# t = 6.4, df = 21, p-value = 1.2e-06
# alternative hypothesis: true mean difference is greater than 0
# 95 percent confidence interval:
#  0.035207      Inf
# sample estimates:
# mean difference 
#        0.048154

Effect size and conclusion

cohen.d(FrontPos ~ Sound, data = datEN, paired = T)$estimate  # large effect size

# [1] 0.94206

Native English speakers have significantly more frontal tongue positions for /θ/-words than for /t/-words

Study - part II: non-native English

Research question: Do Dutch speakers of English distinguish /t/ from /θ/ (“th”) with their tongue?
Hypothesis: The tongue position of Dutch speakers of English is more frontal when pronouncing /θ/ than /t/.
- $H_0: \mu_{(th_i-t_i)} = 0$ (no difference in frontal position)
- $H_a: \mu_{(th_i-t_i)} > 0$ (more frontal position for /θ/ than for /t/)

Data set used in the study

We randomly selected 19 Dutch participants who pronounced 10 minimal pairs /t/:/θ/, when connected to the articulography device:
- ‘fate’-‘faith’, ‘fort’-‘forth’, ‘kit’-‘kith’, ‘mitt’-‘myth’, ‘tent’-‘tenth’
- ‘tank’-‘thank’, ‘team’-‘theme’, ‘tick’-‘thick’, ‘ties’-‘thighs’, ‘tongs’-‘thongs’
- For each speaker, we calculated the average normalized frontal tongue position for both sets of words (/t/-words, /θ/-words)

Distribution of differences: Dutch speakers

(Note that the distribution is not normal!)

plot of chunk unnamed-chunk-28

Which analysis?

We used a paired $t$-test to assess the hypothesis as our data consists of two measurement points per speaker
- Though note that this analysis actually is not appropriate here, as the data is not normally distributed (a non-parametric alternative should be used: next lecture)
We used an $\alpha$-level of 0.05 (one-tailed)

Visualization: Dutch speakers

plot of chunk unnamed-chunk-29

Paired $t$-test: Dutch speakers

datNL$Sound = relevel(datNL$Sound, "TH")
t.test(FrontPos ~ Sound, data = datNL, paired = T, alternative = "greater")

# 
# 	Paired t-test
# 
# data:  FrontPos by Sound
# t = 1.86, df = 18, p-value = 0.04
# alternative hypothesis: true mean difference is greater than 0
# 95 percent confidence interval:
#  0.0010611       Inf
# sample estimates:
# mean difference 
#        0.016263

Effect size and conclusion

cohen.d(FrontPos ~ Sound, data = datNL, paired = T)$estimate  # medium effect size

# [1] 0.32382

Dutch speakers have a significantly more frontal tongue position for /θ/-words than for /t/-words

Interpretation incorrect!

We rejected the null hypothesis for the Dutch group
But this is incorrect!
- We used an inappropriate test: $t$-test while distribution was non-normal and contained large outliers
- Using a non-parametric test (next lexture) results in retaining $H_0$:
  - Dutch speakers do not have more frontal tongue positions for /θ/-words than for /t/-words
Lesson: take note of test assumptions!

Question 9

Note about multiple testing

Using multiple tests risks finding significance through sheer chance
Suppose you run two tests (as we did here), always using $\alpha$ = 0.05
- Chance of finding one or more significant values (family-wise error rate) is: $1 - (1 - \alpha)^2$ = $1 - 0.95^2 = 0.0975$ (almost twice as high as we’d like!)
To guarantee a family-wise error rate of 0.05, we should divide $\alpha$ by the number of tests: Bonferroni correction

Recap

In this lecture, we’ve covered
- the $t$-test (three variants)
- how to calculate the effect size (Cohen’s $d$)
- how to report results of a statistical test
- the problem of multiple testing
Experiment yourself: https://eolomea.let.rug.nl/Statistiek-I/HC4 (login with s-nr)
Next lecture: Non-parametric alternatives

Please evaluate this lecture!

Exam question

Questions?

Thank you for your attention!

https://www.martijnwieling.nl
m.b.wieling@rug.nl

Statistiek I

$t$-tests

Question 1: last lecture

Last lecture

This lecture

Introduction: \(t\)-test similar to \(z\)-test

Calculating \(t\)-value

Obtaining \(p\)-values on the basis of \(t\)-values

What are degrees of freedom?

Question 2

\(t\)-distribution vs. normal distribution

Visualizing \(t\)-distributions

Question 3

Answer to question 3

Three types of \(t\)-tests

Single sample \(t\)-test

Assumptions for the single sample \(t\)-test

Single sample \(t\)-test: example

Step 1: \(t\)-test assumptions met?

Step 2: visualization

Step 3: calculation of \(t\)-value (and \(p\)-value)

Question 4

Automatic calculation of \(t\)-value and \(p\)-value in R

Final step (4): calculation of effect size

Question 5

Effect size and sample size (1)

Effect size and sample size (2)

Independent samples \(t\)-test

Independent samples \(t\)-test assumptions

Independent samples \(t\)-test: example

Step 1: \(t\)-test assumptions met?

Step 2: visualization

Step 3: calculation of \(t\)-value and \(p\)-value

Step 4: effect size

Paired \(t\)-test

Paired \(t\)-test: approach

Paired \(t\)-test assumptions

Paired \(t\)-test data: Lowlands Science 2019 experiment

Paired \(t\)-test: Lowlands Science 2019 data

Step 1: \(t\)-test assumptions met?

Step 2: visualization

Question 6

Step 3: calculation of \(t\)-value and \(p\)-value

Step 4: effect size

Paired data incorrectly analyzed

Question 7

Paired data incorrectly analyzed: lesson

Decision tree ($z$-test vs. \(t\)-test)

\(t\)-test: summary (1)

\(t\)-test: summary (2)

Example of reporting results of \(t\)-test

Question 8

How to report results: guidelines

Another real world example (time permitting)

Obtaining data

Recorded data

Study - part I: native English

Data set used in the study

Distribution of differences: native English speakers

Which analysis?

Visualization: native English speakers

Paired \(t\)-test: native English speakers

Effect size and conclusion

Study - part II: non-native English

Data set used in the study

Distribution of differences: Dutch speakers

(Note that the distribution is not normal!)

Which analysis?

Visualization: Dutch speakers

Paired \(t\)-test: Dutch speakers

Effect size and conclusion

Interpretation incorrect!

Question 9

Note about multiple testing

Recap

Please evaluate this lecture!

Exam question

Questions?