# Statistiek I

## $t$-tests

Martijn Wieling
University of Groningen

## Last lecture

• Difference between the population and a sample
• Calculating the standard error ($SE$)
• Standard error: used when reasoning about the population using a sample
• Standard deviation: used when comparing an individual to the population
• Calculating a confidence interval
• Specifying a concrete testable hypothesis based on a research question
• Specifying the null ($H_0$) and alternative hypothesis ($H_a$)
• Conducting a $z$-test and using the results to evaluate a hypothesis
• Definition of a $p$-value: chance of data given that $H_0$ is true
• Evaluating the statistical significance given $p$-value and $\alpha$-level
• Difference between a one-tailed and a two-tailed test
• Type I and Type II errors

## This lecture

• Introduction to $t$-test
• Three types of $t$-tests:
• Single sample $t$-test
• Independent samples $t$-test
• Paired samples $t$-test
• Effect size
• How to report?

## Introduction: $t$-test similar to $z$-test

• Last lecture: $z$-test is used for comparing averages when $\sigma$ is known
• $\sigma$ is only known for standardized tests, such as IQ tests
• When $\sigma$ is not known (in most cases), we can use the $t$-test
• This test includes an estimation of $\sigma$ based on sample standard deviation $s$

## Calculating $t$-value

• Very similar to calculating $z$-value for a sample (using standard error):

$t = \frac{m - \mu}{s / \sqrt{n}} \hspace{70pt} z = \frac{m - \mu}{\sigma / \sqrt{n}}$

• Only difference: sample standard deviation $s$ is used instead of $\sigma$
• The precise formula depends on the type of $t$-test (independent samples, etc.)
• (But for the exam, you only have to know the basic formulas shown above)

## Obtaining $p$-values on the basis of $t$-values

• $z$-values are compared to the standard normal distribution
• But $t$-values are compared to the $t$-distribution
• $t$-distributions look similar to the standard normal distribution
• but dependent on the number of degrees of freedom (dF)

## What are degrees of freedom?

• There are five balloons each having a different color
• There are five students ($n = 5$) who need to select a balloon
• If 4 students have selected a balloon (dF = 4), student nr. 5 gets the last balloon
• Similarly: if we have a fixed mean value calculated from 10 values
• 9 values may vary in their value, but the 10th is fixed: dF = 10 - 1 = 9

## $t$-distribution vs. normal distribution

• Difference between normal distribution and $t$-distribution is large for small dFs
• When dF $\geq$ 100, the difference is negligible
• As the shape differs, the $p$-value associated with a certain $t$-value also changes
• That is why it is essential to specify dF when describing the results of a $t$-test: $t$(dF)

## Visualizing $t$-distributions

• For significance (given $\alpha$), higher (abs.) $t$-values are needed than $z$-values (but only when dF < 100, otherwise $z$ and $t$ are equal)
qt(0.025, df = 10, lower.tail = F)  # crit. t-value (alpha = 0.025) for dF = 10

# [1] 2.2281


## Question 3

pt(2, 10, lower.tail = F) * 2  # two-sided p-value = 2 * one-sided p-value

# [1] 0.073388


• Dark gray area: $p$ < 0.05 (2-tailed)

## Three types of $t$-tests

• Single sample $t$-test: compare mean with fixed value
• Independent sample $t$-test: compare the means of two independent groups
• Paired $t$-test: compare pairs of (dependent) values (e.g., repeated measurements of same subjects)
• Requirement for all $t$-tests: Data should be approximately normally distributed
• Otherwise: use non-parametric tests (discussed in next lecture)

## Single sample $t$-test

$t = \frac{m - \mu}{s / \sqrt{n}}$

• Used to compare mean to fixed value
• $H_0$: $\mu = \mu_0$ and $H_a$: $\mu \neq \mu_0$
• Larger $t$-values give reason to reject to $H_0$
• Automatic calculation in R using function t.test()
• Standardized effect size is measured as difference in standard deviations
• Cohen's $d$: $d = (m - \mu) / s$

## Assumptions for the single sample $t$-test

• Data randomly selected from population
• Data measured at interval or ratio scale
• Observations are independent
• Observations are approximately normally distributed
• But $t$-test is robust to non-normality for larger samples ($n > 30$)

## Single sample $t$-test: example

• Given our English proficiency data, we'd like to assess if the average English score is different from 7.5
• $H_0$: $\mu = 7.5$ and $H_a$: $\mu \neq 7.5$
• We use $\alpha$ = 0.05
• Sample mean $m$ = 7.35
• Sample standard deviaton $s$ = 1.13
• Sample size $n$ = 315
• Degrees of freedom of $t$-test equals 315 - 1 = 314

## Step 1: $t$-test assumptions met?

• Data randomly selected from population ?
• Data measured at interval scale ✓
• Independent observations ✓
• Data roughly normally distributed (or > 30 observations) ✓

boxplot(dat$english_score) abline(h = 7.5, lty = 2, lwd = 2)  ## Step 3: calculation of $t$-value (and $p$-value) $t = \frac{m - \mu}{s / \sqrt{n}}$ • with $\mu$ = 7.5, $m$ = 7.35, $s$ = 1.13, $n$ = 315 • $t = (7.35 - 7.5) / (1.13 / \sqrt{315}) = -2.39$ • Of course we will use R to calculate the $t$-value automatically: but you also need to be able to calculate the $t$-value manually at your exam (but with simple values)! ## Question 4 ## Automatic calculation of $t$-value and $p$-value in R t.test(dat$english_score, alternative = "two.sided", mu = 7.5)

#
#   One Sample t-test
#
# data:  dat$english_score # t = -2.39, df = 314, p-value = 0.017 # alternative hypothesis: true mean is not equal to 7.5 # 95 percent confidence interval: # 7.2212 7.4728 # sample estimates: # mean of x # 7.347  • $p$-value < $\alpha$: reject $H_0$ and accept $H_a$ ## Final step (4): calculation of effect size • Effect size is used to quantify the difference (i.e. the effect): • Significant results may be uninteresting if the effects are only small • For interpretability, effect size may be reported practically, e.g. score difference • Cohen's $d$ can be used as a standardized measure of effect size for the $t$-test • Cohen's $d = (m - \mu) / s = (7.35 - 7.5) / 1.13 = -0.13$ • Effect sizes ($|d|$): negligible ($<$ 0.2), small ($<$ 0.5), medium ($<$ 0.8), large ($\geq$ 0.8) • See also: http://rpsychologist.com/d3/cohend/ ## Question 5 ## Effect size and sample size (1) • Statistical significance: effect unlikely to have arisen by chance • Very small differences may be significant when samples are large • As we saw when discussing the normal distribution: a difference of two standard errors or more is likely to arise less than 5% of the time due to chance • Standard error is reduced for larger samples (divided by $\sqrt{n}$) ## Effect size and sample size (2) difference (in $s$) $n$ $p$ 0.01 40,000 0.05 0.10 400 0.05 0.25 64 0.05 0.38 30 0.05 0.54 16 0.05 • We recommend samples of "about 30", because small effect sizes are uninteresting, unless differences are important (e.g., health) • Take note of effect size when reading research reports! ## Independent samples $t$-test $t = \frac{m_1 - m_2}{s_p / (1/\sqrt{n_1} + 1/\sqrt{n_2})}$ • Used to compare means of two different groups • $H_0$ is always $\mu_1 = \mu_2$, i.e. both populations have same mean • Two-sided $H_a$: $\mu_1 \neq \mu_2$ (one-sided: $H_a$: $\mu_1 < \mu_2$ or $H_a$: $\mu_1 > \mu_2$) • Degrees of freedom: $(n_1 - 1) + (n_2 - 1)$ (when assuming equal variance in both populations) • (You don't have to know the formula by heart, it is shown here to help you understand what is going on) ## Independent samples $t$-test assumptions • Data randomly selected from population • Data measured at interval or ratio scale • Observations are independent, both within and between the groups • Variances are homogeneous (i.e. values are spread out similarly in both groups) • Ignored as t.test() includes Welch's adjustment to correct for unequal variances (more conservative: degrees of freedom reduced) • Observations in both samples are approximately normally distributed • But $t$-test is robust to non-normality for larger samples ($n > 30$) ## Independent samples $t$-test: example • Given our English proficiency data, we'd like to assess if the average English score is higher for those who followed bilingual education as opposed to monolingual education • $H_0$: $\mu_b = \mu_m$ and $H_a$: $\mu_b > \mu_m$ • We use $\alpha$ = 0.05 • Sample mean $m_b$ = 8.15 • Sample mean $m_m$ = 7.26 ## Step 1: $t$-test assumptions met? • Data randomly selected from population ? • Data measured at interval scale ✓ • Independent observations ✓ • Data roughly normally distributed (or > 30 observations) ✓ ## Step 2: visualization dat$bl_edu = relevel(dat$bl_edu,'Y') # make 'Y' first level (default is 'N') boxplot(english_score ~ bl_edu, data=dat) # formula notation is easy to use  # or: boxplot(dat[dat$bl_edu=='Y',]$english_score, dat[dat$bl_edu=='N',]$english_score)  ## Step 3: calculation of $t$-value and $p$-value t.test(english_score ~ bl_edu, data = dat, alternative = "greater") # 1st > 2nd level?  # # Welch Two Sample t-test # # data: english_score by bl_edu # t = 4.15, df = 35.3, p-value = 1e-04 # alternative hypothesis: true difference in means is greater than 0 # 95 percent confidence interval: # 0.52374 Inf # sample estimates: # mean in group Y mean in group N # 8.1462 7.2629  • $p$-value < $\alpha$: reject $H_0$ and accept $H_a$ • Note that dF much lower than ($n_1$ - 1) + ($n_2$ - 1) due to correction for uneq. var. ## Step 4: effect size • Cohen's $d = (\mu_1 - \mu_2) / s$ • But two samples: two standard deviations... • For Cohen's $d$, a single (pooled) standard deviation necessary • Instead of manually calculating this, we use the function cohen.d library(effsize) # to install: install.packages('effsize') cohen.d(english_score ~ bl_edu, data = dat)  # # Cohen's d # # d estimate: -0.7965 (medium) # 95 percent confidence interval: # inf sup # -1.17928 -0.41371  ## Paired $t$-test • Used to compare means of two sets of data • Data in both sets are from same individuals: paired • Analysis is more powerful than independent samples $t$-test • Sources of individual variation cancelled out via pairwise comparison • Of course still many sources of variation present: e.g., measurement error ## Paired $t$-test: approach • Calculate pairwise differences and test if average differences different from 0 $H_0: \mu_{(x_i-y_i)} = 0$ and $H_a: \mu_{(x_i-y_i)} \ne 0$ • Similar to single sample $t$-test of differences: 1. Calculate differences 2. Use single sample $t$-test to assess if mean of differences is different from 0 • Degrees of freedom: $n_{\textrm{pairs}} - 1$ ## Paired $t$-test assumptions • Subjects randomly sampled from population • Data measured at interval or ratio scale • Observations are independent within each group • Differences between paired values are approximately normally distributed • But $t$-test is robust to non-normality for larger samples ($n_{\textrm{pairs}} > 30$) • Paired $t$-test is inappropriate when scores differ in scale, e.g., comparing percentages with grades • Consider regression for those cases (not covered in this course) ## Paired $t$-test: example • Given our English proficiency data, we'd like to assess if there is a difference between the English grades and the English scores (both on a scale from 1 to 10) • $H_0: \mu_{(g_i-s_i)} = 0$ and $H_a: \mu_{(g_i-s_i)} \ne 0$ • We use $\alpha$ = 0.01 • Sample mean $m_g$ = 7.28 • Sample mean $m_s$ = 7.35 ## Step 1: $t$-test assumptions met? • Data randomly selected from population ? • Data measured at interval scale ✓ • Independent observations ✓ • Differences roughly normally distributed (or > 30 pairs) ✓ ## Step 2: visualization boxplot(dat$english_score, dat$english_grade, names = c("EN scores", "EN grades"))  ## Question 6 ## For question 6: distribution of English grades quantile(dat$english_grade)

#   0%  25%  50%  75% 100%
#  5.0  7.0  7.0  8.0  9.5

hist(dat$english_grade, col = "red", main = "English grades")  ## Step 3: calculation of $t$-value and $p$-value t.test(dat$english_grade, dat$english_score, paired = TRUE)  # # Paired t-test # # data: dat$english_grade and dat$english_score # t = -1.54, df = 314, p-value = 0.13 # alternative hypothesis: true difference in means is not equal to 0 # 95 percent confidence interval: # -0.149849 0.018371 # sample estimates: # mean of the differences # -0.065739  • $p$-value > $\alpha$: not enough evidence to reject $H_0$ ## Step 4: effect size cohen.d(dat$english_grade, dat$english_score, paired = T)  # # Cohen's d # # d estimate: -0.086646 (negligible) # 95 percent confidence interval: # inf sup # -0.243194 0.069903  ## Paired data incorrectly analyzed • What if we would have incorrectly analyzed the data using an independent samples $t$-test? t.test(dat$english_grade, dat$english_score, paired = FALSE)$statistic

#        t
# -0.82366

# with the paired t-test:
t.test(dat$english_grade, dat$english_score, paired = TRUE)$statistic  # t # -1.5378  • Independent $t$-test generally results in lower absolute $t$-value ## Question 7 ## Paired data incorrectly analyzed: lesson • More sophisticated statistics allow more sensitivity to data: • Incorrectly using the independent samples $t$-test increases the probability of a type-II error • Increased chance of the null hypothesis being false, but not rejected ## Decision tree ($z$-test vs. $t$-test) ## $t$-test: summary (1) Simple $t$ statistic: $t = \; \frac{m_1 - m_2}{s/\sqrt{n}}$ • For numeric data: compares means of two groups (or two series of values), or one mean versus one value, and determines whether difference is significant • Population statistics ($\sigma$) unnecessary, but sample statistics needed • Three applications: • Single sample: compares mean of sample to fixed value • Independent (i.e. unrelated) samples: compares two means • Paired: compares pairs of values ## $t$-test: summary (2) • Assumptions with all $t$-tests: • Distribution roughly normal if $n \leq 30$ • Randomly selected data • Data at interval or ratio scale • Data independent within one series of values, and (if independent samples $t$-test) between both groups • Additionally • Report effect size using Cohen's $d = (m_1 - m_2)/s$ ## Example of reporting results of $t$-test • For the example of the independent samples $t$-test: We tested whether the average English scores of students following this course was significantly higher for those who had bilingual education than those who only did not. Our hypotheses were: $H_0$: $\mu_b = \mu_m$ and $H_a$: $\mu_b > \mu_m$. We obtained English scores in a sample of 315 students of the Statistiek I course via an online questionnaire. Since $\sigma$ is unknown and the samples were independent, we conducted an independent samples $t$-test (corrected for unequal variances) after verifying that the assumptions for the test were met (normally distributed, or more than 30 values). The mean of the English scores for the students with bilingual education in the sample was 8.15, whereas it was 7.26 for those who followed monolingual education. The effect size was medium (Cohen's $d$: -0.8; see box plot), and it reached significance at $\alpha$-level 0.05: $t$(35.3) = 4.15, $p$ < 0.001. We therefore reject the null hypothesis and accept the alternative hypothesis that students who had bilingual education had higher English scores than those who did not. ## Question 8 ## How to report results: guidelines 1. State the issue in terms of the population(s) (not merely the samples) 2. Formulate $H_0$ and $H_a$ 3. State how your hypothesis is to be tested, how samples were obtained (including sample size), and what procedures (test materials) were used to obtain measurements 4. Identify the $\alpha$-level and the statistical test to be used, and indicate why 5. Illustrate your research question graphically, if possible (e.g., box plots) 6. Present the results of the study on the sample, the $p$-value, if the result is signicant or not, and an effect size 7. State conclusions about the hypotheses 8. Discuss and interpret your results Practice this in laboratory exercises! ## A real world example (time permitting) ## Obtaining data ## Recorded data ## Study - part I: native English • Research question: Do native English speakers distinguish /t/ from /θ/ ("th") with their tongue? • Hypothesis: The tongue position of English native speakers is more frontal when pronouncing /θ/ than /t/. • $H_0: \mu_{(th_i-t_i)} = 0$ (no difference in frontal position) • $H_a: \mu_{(th_i-t_i)} > 0$ (more frontal position for /θ/ than for /t/) ## Data set used in the study • We randomly selected 22 English participants who pronounced 10 minimal pairs /t/:/θ/, when connected to the articulography device: • 'fate'-'faith', 'forth'-'fort', 'kit'-'kith', 'mitt'-'myth', 'tent'-'tenth' • 'tank'-'thank', 'team'-'theme', 'tick'-'thick', 'ties'-'thighs', 'tongs'-'thongs' • For each speaker, we calculated the average normalized frontal tongue position for both sets of words (/t/-words, /θ/-words) ## Distribution of differences: native English speakers ## Which analysis? • We used a paired $t$-test to assess the hypothesis as our data consists of two measurement points per speaker, and the differences were approximately normally distributed • We used an $\alpha$-level of 0.05 (one-tailed) ## Visualization: native English speakers ## Paired $t$-test: native English speakers datEN$Sound = relevel(datEN$Sound, "TH") # set TH as reference level t.test(FrontPos ~ Sound, data = datEN, paired = T, alternative = "greater") # paired  # # Paired t-test # # data: FrontPos by Sound # t = 6.4, df = 21, p-value = 1.2e-06 # alternative hypothesis: true difference in means is greater than 0 # 95 percent confidence interval: # 0.035207 Inf # sample estimates: # mean of the differences # 0.048154  ## Effect size and conclusion cohen.d(FrontPos ~ Sound, data = datEN, paired = T)$estimate  # large effect size

# [1] -1.3645

• Native English speakers have significantly more frontal tongue positions for /θ/-words than for /t/-words

## Study - part II: non-native English

• Research question: Do Dutch speakers of English distinguish /t/ from /θ/ ("th") with their tongue?
• Hypothesis: The tongue position of Dutch speakers of English is more frontal when pronouncing /θ/ than /t/.
• $H_0: \mu_{(th_i-t_i)} = 0$ (no difference in frontal position)
• $H_a: \mu_{(th_i-t_i)} > 0$ (more frontal position for /θ/ than for /t/)

## Data set used in the study

• We randomly selected 19 Dutch participants who pronounced 10 minimal pairs /t/:/θ/, when connected to the articulography device:
• 'fate'-'faith', 'forth'-'fort', 'kit'-'kith', 'mitt'-'myth', 'tent'-'tenth'
• 'tank'-'thank', 'team'-'theme', 'tick'-'thick', 'ties'-'thighs', 'tongs'-'thongs'
• For each speaker, we calculated the average normalized frontal tongue position for both sets of words (/t/-words, /θ/-words)

## Which analysis?

• We used a paired $t$-test to assess the hypothesis as our data consists of two measurement points per speaker
• Though note that this analysis actually is not appropriate here, as the data is not normally distributed (a non-parametric alternative should be used: next lecture)
• We used an $\alpha$-level of 0.05 (one-tailed)

## Paired $t$-test: Dutch speakers

datNL$Sound = relevel(datNL$Sound, "TH")
t.test(FrontPos ~ Sound, data = datNL, paired = T, alternative = "greater")

#
#   Paired t-test
#
# data:  FrontPos by Sound
# t = 1.86, df = 18, p-value = 0.04
# alternative hypothesis: true difference in means is greater than 0
# 95 percent confidence interval:
#  0.0010611       Inf
# sample estimates:
# mean of the differences
#                0.016263


## Effect size and conclusion

cohen.d(FrontPos ~ Sound, data = datNL, paired = T)\$estimate  # medium effect size

# [1] -0.42559

• Dutch speakers have significantly more frontal tongue positions for /θ/-words than for /t/-words

## Interpretation incorrect!

• We rejected the null hypothesis for the Dutch group
• But this is incorrect!
• We used an inappropriate test: $t$-test while distribution was non-normal and contained large outliers
• Using a non-parametric test (next lexture) results in retaining $H_0$:
• Dutch speakers do not have more frontal tongue positions for /θ/-words than for /t/-words
• Lesson: take note of test assumptions!

## Question 9

• Using multiple tests risks finding significance through sheer chance
• Suppose you run two tests (as we did here), always using $\alpha$ = 0.05
• Chance of finding one or more significant values (family-wise error rate) is: $1 - (1 - \alpha)^2$ = $1 - 0.95^2 = 0.0975$ (almost twice as high as we'd like!)
• To guarantee a family-wise error rate of 0.05, we should divide $\alpha$ by the number of tests: Bonferroni correction

## Recap

• In this lecture, we've covered
• the $t$-test (three variants)
• how to calculate the effect size (Cohen's $d$)
• how to report results of a statistical test
• the problem of multiple testing
• Experiment yourself: http://eolomea.let.rug.nl/Statistiek-I/HC4 (login with s-nr)
• Next lecture: Non-parametric alternatives