Martijn Wieling

University of Groningen

- Difference between the population and a sample
- Calculating the standard error (\(SE\))
- Standard error: used when reasoning about the population using a
**sample** - Standard deviation: used when comparing an
**individual**to the population

- Standard error: used when reasoning about the population using a
- Calculating a confidence interval
- Specifying a concrete testable hypothesis based on a research question
- Specifying the null (\(H_0\)) and alternative hypothesis (\(H_a\))
- Conducting a \(z\)-test and using the results to evaluate a hypothesis
- Definition of a \(p\)-value: chance of data given that \(H_0\) is true
- Evaluating the statistical significance given \(p\)-value and \(\alpha\)-level
- Difference between a one-tailed and a two-tailed test
- Type I and Type II errors

- Introduction to \(t\)-test
- Three types of \(t\)-tests:
- Single sample \(t\)-test
- Independent samples \(t\)-test
- Paired samples \(t\)-test

- Effect size
- How to report?

- Last lecture: \(z\)-test is used for comparing averages when \(\sigma\) is known
- \(\sigma\) is only known for standardized tests, such as IQ tests

- When \(\sigma\) is not known (in most cases), we can use the \(t\)-test
- This test includes an estimation of \(\sigma\) based on sample standard deviation \(s\)

- Very similar to calculating \(z\)-value for a sample (using standard error):

\[t = \frac{m - \mu}{s / \sqrt{n}} \hspace{70pt} z = \frac{m - \mu}{\sigma / \sqrt{n}}\]

- Only difference: sample standard deviation \(s\) is used instead of \(\sigma\)
- The precise formula depends on the type of \(t\)-test (independent samples, etc.)
- (But for the exam, you only have to know the basic formulas shown above)

- \(z\)-values are compared to the standard normal distribution
- But \(t\)-values are compared to the \(t\)-distribution
- \(t\)-distributions look similar to the standard normal distribution
- but dependent on the number of
**degrees of freedom (dF)**

- but dependent on the number of

- There are five balloons each having a different color
- There are five students (\(n = 5\)) who need to select a balloon
- If 4 students have selected a balloon (
**dF = 4**), student nr. 5 gets the last balloon

- If 4 students have selected a balloon (
- Similarly: if we have a
*fixed*mean value calculated from 10 values- 9 values may vary in their value, but the 10th is fixed: dF = 10 - 1 = 9

- Difference between normal distribution and \(t\)-distribution is large for small dFs
- When dF \(\geq\) 100, the difference is negligible
- As the shape differs, the \(p\)-value associated with a certain \(t\)-value also changes
- That is why it is
**essential**to specify dF when describing the results of a \(t\)-test: \(t\)(dF)

- For significance (given \(\alpha\)), higher (abs.) \(t\)-values are needed than \(z\)-values (but only when dF < 100, otherwise \(z\) and \(t\) are equal)

```
qt(0.025, df = 10, lower.tail = F) # crit. t-value (alpha = 0.025) for dF = 10
```

```
# [1] 2.2281
```

```
pt(2, 10, lower.tail = F) * 2 # two-sided p-value = 2 * one-sided p-value
```

```
# [1] 0.073388
```

- Dark gray area: \(p\) < 0.05 (2-tailed)

- Single sample \(t\)-test: compare mean with fixed value
- Independent sample \(t\)-test: compare the means of two
**independent**groups - Paired \(t\)-test: compare pairs of (dependent) values (e.g., repeated measurements of same subjects)
**Requirement for all \(t\)-tests**: Data should be approximately normally distributed- Otherwise: use non-parametric tests (discussed in next lecture)

\[t = \frac{m - \mu}{s / \sqrt{n}}\]

- Used to compare mean to fixed value
- \(H_0\): \(\mu = \mu_0\) and \(H_a\): \(\mu \neq \mu_0\)
- Larger \(t\)-values give reason to reject to \(H_0\)
- Automatic calculation in
`R`

using function`t.test()`

- Standardized effect size is measured as difference in standard deviations
- Cohen's \(d\): \(d = (m - \mu) / s\)

- Data randomly selected from population
- Data measured at interval or ratio scale
- Observations are independent
- Observations are approximately normally distributed
- But \(t\)-test is robust to non-normality for larger samples (\(n > 30\))

- Given our English proficiency data, we'd like to assess if the average English score is different from 7.5
- \(H_0\): \(\mu = 7.5\) and \(H_a\): \(\mu \neq 7.5\)
- We use \(\alpha\) = 0.05
- Sample mean \(m\) = 7.35
- Sample standard deviaton \(s\) = 1.13
- Sample size \(n\) = 315
- Degrees of freedom of \(t\)-test equals 315 - 1 = 314

- Data randomly selected from population ?
- Data measured at interval scale ✓
- Independent observations ✓
- Data roughly normally distributed (or > 30 observations) ✓

```
boxplot(dat$english_score)
abline(h = 7.5, lty = 2, lwd = 2)
```

\[t = \frac{m - \mu}{s / \sqrt{n}}\]

- with \(\mu\) = 7.5, \(m\) = 7.35, \(s\) = 1.13, \(n\) = 315
- \(t = (7.35 - 7.5) / (1.13 / \sqrt{315}) = -2.39\)
- Of course we will use
`R`

to calculate the \(t\)-value automatically: but you also need to be able to calculate the \(t\)-value manually at your exam (but with simple values)!

```
t.test(dat$english_score, alternative = "two.sided", mu = 7.5)
```

```
#
# One Sample t-test
#
# data: dat$english_score
# t = -2.39, df = 314, p-value = 0.017
# alternative hypothesis: true mean is not equal to 7.5
# 95 percent confidence interval:
# 7.2212 7.4728
# sample estimates:
# mean of x
# 7.347
```

- \(p\)-value < \(\alpha\): reject \(H_0\) and accept \(H_a\)

**Effect size**is used to quantify the difference (i.e. the effect):- Significant results may be uninteresting if the effects are only small
- For interpretability, effect size may be reported practically, e.g. score difference

- Cohen's \(d\) can be used as a standardized measure of effect size for the \(t\)-test
- Cohen's \(d = (m - \mu) / s = (7.35 - 7.5) / 1.13 = -0.13\)
- Effect sizes (\(|d|\)): negligible (\(<\) 0.2), small (\(<\) 0.5), medium (\(<\) 0.8), large (\(\geq\) 0.8)

- See also: http://rpsychologist.com/d3/cohend/

- Statistical significance: effect unlikely to have arisen by chance
- Very small differences may be significant when samples are large
- As we saw when discussing the normal distribution: a difference of two standard errors or more is likely to arise less than 5% of the time due to chance
- Standard error is reduced for larger samples (divided by \(\sqrt{n}\))

difference (in \(s\)) | \(n\) | \(p\) |
---|---|---|

0.01 | 40,000 | 0.05 |

0.10 | 400 | 0.05 |

0.25 | 64 | 0.05 |

0.38 | 30 | 0.05 |

0.54 | 16 | 0.05 |

- We recommend samples of "about 30", because small effect sizes are uninteresting, unless differences are important (e.g., health)
- Take note of effect size when reading research reports!

\[t = \frac{m_1 - m_2}{s_p / (1/\sqrt{n_1} + 1/\sqrt{n_2})}\]

- Used to compare means of two different groups
- \(H_0\) is always \(\mu_1 = \mu_2\), i.e. both populations have same mean
- Two-sided \(H_a\): \(\mu_1 \neq \mu_2\) (one-sided: \(H_a\): \(\mu_1 < \mu_2\) or \(H_a\): \(\mu_1 > \mu_2\))
- Degrees of freedom: \((n_1 - 1) + (n_2 - 1)\) (when assuming equal variance in both populations)
- (You don't have to know the formula by heart, it is shown here to help you understand what is going on)

- Data randomly selected from population
- Data measured at interval or ratio scale
- Observations are independent, both within and between the groups
- Variances are homogeneous (i.e. values are spread out similarly in both groups)
- Ignored as
`t.test()`

includes Welch's adjustment to correct for unequal variances (more conservative: degrees of freedom reduced)

- Ignored as
- Observations
**in both samples**are approximately normally distributed- But \(t\)-test is robust to non-normality for larger samples (\(n > 30\))

- Given our English proficiency data, we'd like to assess if the average English score is higher for those who followed bilingual education as opposed to monolingual education
- \(H_0\): \(\mu_b = \mu_m\) and \(H_a\): \(\mu_b > \mu_m\)
- We use \(\alpha\) = 0.05
- Sample mean \(m_b\) = 8.15
- Sample mean \(m_m\) = 7.26

- Data randomly selected from population ?
- Data measured at interval scale ✓
- Independent observations ✓
- Data roughly normally distributed (or > 30 observations) ✓

```
dat$bl_edu = relevel(dat$bl_edu,'Y') # make 'Y' first level (default is 'N')
boxplot(english_score ~ bl_edu, data=dat) # formula notation is easy to use
```

```
# or: boxplot(dat[dat$bl_edu=='Y',]$english_score, dat[dat$bl_edu=='N',]$english_score)
```

```
t.test(english_score ~ bl_edu, data = dat, alternative = "greater") # 1st > 2nd level?
```

```
#
# Welch Two Sample t-test
#
# data: english_score by bl_edu
# t = 4.15, df = 35.3, p-value = 1e-04
# alternative hypothesis: true difference in means is greater than 0
# 95 percent confidence interval:
# 0.52374 Inf
# sample estimates:
# mean in group Y mean in group N
# 8.1462 7.2629
```

- \(p\)-value < \(\alpha\): reject \(H_0\) and accept \(H_a\)
- Note that dF much lower than (\(n_1\) - 1) + (\(n_2\) - 1) due to correction for uneq. var.

- Cohen's \(d = (\mu_1 - \mu_2) / s\)
- But two samples: two standard deviations...
- For Cohen's \(d\), a single (pooled) standard deviation necessary
- Instead of manually calculating this, we use the function
`cohen.d`

- Instead of manually calculating this, we use the function

```
library(effsize) # to install: install.packages('effsize')
cohen.d(english_score ~ bl_edu, data = dat)
```

```
#
# Cohen's d
#
# d estimate: -0.7965 (medium)
# 95 percent confidence interval:
# inf sup
# -1.17928 -0.41371
```

- Used to compare means of two sets of data
- Data in both sets are from same individuals: paired
- Analysis is more powerful than independent samples \(t\)-test
- Sources of individual variation cancelled out via pairwise comparison
- Of course still many sources of variation present: e.g., measurement error

- Sources of individual variation cancelled out via pairwise comparison

- Calculate pairwise differences and test if average differences different from 0

\(H_0: \mu_{(x_i-y_i)} = 0\) and \(H_a: \mu_{(x_i-y_i)} \ne 0\) - Similar to single sample \(t\)-test of differences:
- Calculate differences
- Use single sample \(t\)-test to assess if mean of differences is different from 0

- Degrees of freedom: \(n_{\textrm{pairs}} - 1\)

- Subjects randomly sampled from population
- Data measured at interval or ratio scale
- Observations are independent within each group
**Differences**between paired values are approximately normally distributed- But \(t\)-test is robust to non-normality for larger samples (\(n_{\textrm{pairs}} > 30\))

- Paired \(t\)-test is
**inappropriate**when scores differ in scale, e.g., comparing percentages with grades- Consider regression for those cases (not covered in this course)

- Given our English proficiency data, we'd like to assess if there is a difference between the English grades and the English scores (both on a scale from 1 to 10)
- \(H_0: \mu_{(g_i-s_i)} = 0\) and \(H_a: \mu_{(g_i-s_i)} \ne 0\)
- We use \(\alpha\) = 0.01
- Sample mean \(m_g\) = 7.28
- Sample mean \(m_s\) = 7.35

- Data randomly selected from population ?
- Data measured at interval scale ✓
- Independent observations ✓
- Differences roughly normally distributed (or > 30 pairs) ✓

```
boxplot(dat$english_score, dat$english_grade, names = c("EN scores", "EN grades"))
```

```
quantile(dat$english_grade)
```

```
# 0% 25% 50% 75% 100%
# 5.0 7.0 7.0 8.0 9.5
```

```
hist(dat$english_grade, col = "red", main = "English grades")
```

```
t.test(dat$english_grade, dat$english_score, paired = TRUE)
```

```
#
# Paired t-test
#
# data: dat$english_grade and dat$english_score
# t = -1.54, df = 314, p-value = 0.13
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -0.149849 0.018371
# sample estimates:
# mean of the differences
# -0.065739
```

- \(p\)-value > \(\alpha\): not enough evidence to reject \(H_0\)

```
cohen.d(dat$english_grade, dat$english_score, paired = T)
```

```
#
# Cohen's d
#
# d estimate: -0.086646 (negligible)
# 95 percent confidence interval:
# inf sup
# -0.243194 0.069903
```

- What if we would have incorrectly analyzed the data using an independent samples \(t\)-test?

```
t.test(dat$english_grade, dat$english_score, paired = FALSE)$statistic
```

```
# t
# -0.82366
```

```
# with the paired t-test:
t.test(dat$english_grade, dat$english_score, paired = TRUE)$statistic
```

```
# t
# -1.5378
```

- Independent \(t\)-test generally results in lower absolute \(t\)-value

- More sophisticated statistics allow more sensitivity to data:
- Incorrectly using the independent samples \(t\)-test increases the probability of a
**type-II**error- Increased chance of the null hypothesis being false, but not rejected

- Incorrectly using the independent samples \(t\)-test increases the probability of a

Simple \(t\) statistic:

\[t = \; \frac{m_1 - m_2}{s/\sqrt{n}}\]

- For numeric data: compares means of two groups (or two series of values), or one mean versus one value, and determines whether difference is significant
- Population statistics (\(\sigma\)) unnecessary, but sample statistics needed
- Three applications:
- Single sample: compares mean of sample to fixed value
- Independent (i.e.
**unrelated**) samples: compares two means - Paired: compares pairs of values

- Assumptions with all \(t\)-tests:
- Distribution roughly normal if \(n \leq 30\)
- Randomly selected data
- Data at interval or ratio scale
- Data independent within one series of values, and (if independent samples \(t\)-test) between both groups

- Additionally
- Report effect size using Cohen's \(d = (m_1 - m_2)/s\)

- For the example of the independent samples \(t\)-test:

We tested whether the average English scores of students following this course was significantly higher for those who had bilingual education than those who only did not. Our hypotheses were: \(H_0\): \(\mu_b = \mu_m\) and \(H_a\): \(\mu_b > \mu_m\). We obtained English scores in a sample of 315 students of the *Statistiek I* course via an online questionnaire. Since \(\sigma\) is unknown and the samples were independent, we conducted an independent samples \(t\)-test (corrected for unequal variances) after verifying that the assumptions for the test were met (normally distributed, or more than 30 values). The mean of the English scores for the students with bilingual education in the sample was 8.15, whereas it was 7.26 for those who followed monolingual education. The effect size was medium (Cohen's \(d\): -0.8; see box plot), and it reached significance at \(\alpha\)-level 0.05: \(t\)(35.3) = 4.15, \(p\) < 0.001. We therefore reject the null hypothesis and accept the alternative hypothesis that students who had bilingual education had higher English scores than those who did not.

- State the issue in terms of the population(s) (not merely the samples)
- Formulate \(H_0\) and \(H_a\)
- State how your hypothesis is to be tested, how samples were obtained (including sample size), and what procedures (test materials) were used to obtain measurements
- Identify the \(\alpha\)-level and the statistical test to be used, and indicate why
- Illustrate your research question graphically, if possible (e.g., box plots)
- Present the results of the study on the sample, the \(p\)-value, if the result is signicant or not, and an effect size
- State conclusions about the hypotheses
- Discuss and interpret your results

Practice this in laboratory exercises!

- Research question: Do native English speakers distinguish /t/ from /θ/ ("th") with their tongue?
- Hypothesis: The tongue position of English native speakers is more frontal when pronouncing /θ/ than /t/.
- \(H_0: \mu_{(th_i-t_i)} = 0\) (no difference in frontal position)
- \(H_a: \mu_{(th_i-t_i)} > 0\) (more frontal position for /θ/ than for /t/)

- We randomly selected 22 English participants who pronounced 10 minimal pairs /t/:/θ/, when connected to the articulography device:
- 'fate'-'faith', 'forth'-'fort', 'kit'-'kith', 'mitt'-'myth', 'tent'-'tenth'
- 'tank'-'thank', 'team'-'theme', 'tick'-'thick', 'ties'-'thighs', 'tongs'-'thongs'
- For each speaker, we calculated the average normalized frontal tongue position for both sets of words (/t/-words, /θ/-words)

- We used a
**paired \(t\)-test**to assess the hypothesis as our data consists of two measurement points per speaker, and the differences were approximately normally distributed - We used an \(\alpha\)-level of 0.05 (one-tailed)

```
datEN$Sound = relevel(datEN$Sound, "TH") # set TH as reference level
t.test(FrontPos ~ Sound, data = datEN, paired = T, alternative = "greater") # paired
```

```
#
# Paired t-test
#
# data: FrontPos by Sound
# t = 6.4, df = 21, p-value = 1.2e-06
# alternative hypothesis: true difference in means is greater than 0
# 95 percent confidence interval:
# 0.035207 Inf
# sample estimates:
# mean of the differences
# 0.048154
```

```
cohen.d(FrontPos ~ Sound, data = datEN, paired = T)$estimate # large effect size
```

```
# [1] -1.3645
```

- Native English speakers have significantly more frontal tongue positions for /θ/-words than for /t/-words

- Research question: Do Dutch speakers of English distinguish /t/ from /θ/ ("th") with their tongue?
- Hypothesis: The tongue position of Dutch speakers of English is more frontal when pronouncing /θ/ than /t/.
- \(H_0: \mu_{(th_i-t_i)} = 0\) (no difference in frontal position)
- \(H_a: \mu_{(th_i-t_i)} > 0\) (more frontal position for /θ/ than for /t/)

- We randomly selected 19 Dutch participants who pronounced 10 minimal pairs /t/:/θ/, when connected to the articulography device:
- 'fate'-'faith', 'forth'-'fort', 'kit'-'kith', 'mitt'-'myth', 'tent'-'tenth'
- 'tank'-'thank', 'team'-'theme', 'tick'-'thick', 'ties'-'thighs', 'tongs'-'thongs'
- For each speaker, we calculated the average normalized frontal tongue position for both sets of words (/t/-words, /θ/-words)

- We used a
**paired \(t\)-test**to assess the hypothesis as our data consists of two measurement points per speaker- Though note that this analysis actually is
**not**appropriate here, as the data is not normally distributed (a non-parametric alternative should be used: next lecture)

- Though note that this analysis actually is
- We used an \(\alpha\)-level of 0.05 (one-tailed)

```
datNL$Sound = relevel(datNL$Sound, "TH")
t.test(FrontPos ~ Sound, data = datNL, paired = T, alternative = "greater")
```

```
#
# Paired t-test
#
# data: FrontPos by Sound
# t = 1.86, df = 18, p-value = 0.04
# alternative hypothesis: true difference in means is greater than 0
# 95 percent confidence interval:
# 0.0010611 Inf
# sample estimates:
# mean of the differences
# 0.016263
```

```
cohen.d(FrontPos ~ Sound, data = datNL, paired = T)$estimate # medium effect size
```

```
# [1] -0.42559
```

- Dutch speakers have significantly more frontal tongue positions for /θ/-words than for /t/-words

- We rejected the null hypothesis for the Dutch group
- But this is
**incorrect**!- We used an
*inappropriate*test: \(t\)-test while distribution was non-normal and contained large outliers - Using a non-parametric test (next lexture) results in retaining \(H_0\):
- Dutch speakers do not have more frontal tongue positions for /θ/-words than for /t/-words

- We used an
**Lesson: take note of test assumptions!**

- Using multiple tests risks finding significance through sheer chance
- Suppose you run two tests (as we did here), always using \(\alpha\) = 0.05
- Chance of finding one or more significant values (family-wise error rate) is: \(1 - (1 - \alpha)^2\) = \(1 - 0.95^2 = 0.0975\) (almost twice as high as we'd like!)

- To guarantee a family-wise error rate of 0.05, we should divide \(\alpha\) by the number of tests: Bonferroni correction

- In this lecture, we've covered
- the \(t\)-test (three variants)
- how to calculate the effect size (Cohen's \(d\))
- how to report results of a statistical test
- the problem of multiple testing

- the \(t\)-test (three variants)
**Experiment yourself**: http://eolomea.let.rug.nl/Statistiek-I/HC4 (login with s-nr)- Next lecture:
**Non-parametric alternatives**

Thank you for your attention!