Relation between population (mean) and sample (mean)
Confidence interval for population mean based on sample mean
Testing a hypothesis about the population using a sample
One-sided hypothesis vs. two-sided hypothesis
Statistical significance
Error types
Introduction
Selecting a sample from a population includes an element of chance: which individuals are studied?
Question of this lecture: How to reason about the population using a sample?
Anwered using the Central Limit Theorem
Central Limit Theorem
Suppose we would gather many different samples from the population, then the distribution of the sample means will always be normally distributed
The mean of these sample means (\(\bar{x}\)) will be the population mean (\(m_{\bar{x}} = \mu\))
The standard deviation of the sample means (i.e. standard error \(\textit{SE}\)) is dependent on the sample size \(n\) and the population standard deviation \(\sigma\) : \(\textit{SE} = s_{\bar{x}} = \frac{\sigma}{\sqrt{n}}\)
Standard deviation vs. standard error
Standard deviation of population (\(\sigma\)):
Relate individual to population
Standard deviation of sample means = standard error (\(\sigma / \sqrt{n}\))
Relate sample to population
Question 2
Reasoning about the population (1)
Given that the distribution of sample means is normally distributed \(N(\mu,\sigma/\sqrt{n})\), having one randomly selected sample allows us to reason about the population
Requirement: sample is representative (unbiased sample)
Random selection helps avoid bias
Question 3
Reasoning about the population (2)
Given a representative sample:
We estimate the population mean as equal to the sample mean (best guess)
How certain we are of this estimate depends on the standard error: \(\sigma/\sqrt{n}\)
Increasing sample size \(n\) reduces uncertainty
Hard work pays off (in exactness), but it doesn’t pay off quickly: \(\sqrt(n)\)
Sample means are normally distributed (CLT):
We can relate a sample mean to the population mean by using characteristics of the normal distribution
Normal distribution
We know the probability of a sample mean \(\bar{x}\) having a value close to the population mean \(\mu\):
Sample means can be related to the population in two ways:
Using a confidence interval
An interval which is calculated in such a way that a large proportion of the calculated intervals contains the true population mean
Using a hypothesis test
Tests if hypothesis about population is compatible with sample result
Confidence interval
Definition: there is an \(x\)% probability that when computing an \(x\)% confidence interval (CI) on the basis of a sample, it contains \(\mu\)
The CI can be seen as an estimate of plausible values of \(\mu\)
(For those who are interested: there is much confusion about interpreting CIs)
Confidence interval: example (1)
Consider the following example: You want to know how many hours per week a student of the university spends speaking English. The standard deviation \(\sigma\) for the university is 1 hr/wk.
You collect data from 100 randomly chosen students
You calculate the sample mean \(m = 5\) hr/wk (note: \(m\) = \(\bar{x}\))
You therefore estimate the population mean \(\mu = 5\) hr/wk and standard error \(\textit{SE} = 1/\sqrt{100} = 0.1\) hr/wk
What is the 95% confidence interval (CI) of the mean?
Confidence interval: example (2)
According to the CLT, the sample means are normally distributed
95% of the sample means lie within \(m \pm 2\,\textit{SE}\)
(i.e. actually it is \(m \pm 1.96\,\textit{SE}\), but we round this to \(m \pm 2\,\textit{SE}\))
With \(m\) = 5 and \(\textit{SE}\) = 0.1, 95% CI is 5 \(\pm\) 2 \(\times\) 0.1 = (4.8 hr/wk, 5.2 hr/wk)
Question 4
Hypotheses
Instead of using them for confidence intervals we often interpret samples as hypothesis tests about populations
Examples of hypotheses
Answering online lecture questions is related to the course grade
Females and males differ in their English proficiency
Nouns take longer to read than verbs
Hypothesis testing (1)
Testing these hypotheses requires empirical and variable data
Empirical: based on observation rather than theory alone
Variable: individual cases vary
Hypotheses can be derived from theory, but also from observations if theory is incomplete
Hypothesis testing (2)
We start from a research question: Is answering online lecture questions related to the course grade?
Which we then formulate as a hypothesis (i.e. a statement): Answering online lecture questions is related to the course grade
For statistics to be useful, this needs to be translated to a concrete form: Students answering online lecture questions score higher than those who do not
Hypothesis testing (3)
Students answering online lecture questions score higher than those who do not
What is meant by this? Allstudents answering online lecture questions score higher than those who do not?
Probably not, the data is variable, there are other factors:
Attention level of each student
Difficulty of the lecture
If the questions were answered seriously
We need statistics to abstract away from the variability of the observations (i.e. unsystematic variation)
Hypothesis testing (4)
Students answering online lecture questions score higher than those who do not
Meaning:
Not: Allstudents answering online lecture questions score higher than those who do not
But: On average,students answering online lecture questions score higher than those who do not
Testing a hypothesis using a sample (1)
On average, students answering online lecture questions score higher than those who do not
This hypothesis must be studied on the basis of a sample, i.e. a limited number of students following a course with online lecture questions
Of course we’re interested in the population, i.e. all students who followed a course with online lecture questions
Testing a hypothesis using a sample (2)
The hypothesis concerns the population, but it is studied through a representative sample
Students answering online lecture questions score higher than those who do not (study based on 30 students who answered the questions and 30 who did not)
Females have higher English proficiency than males (study based on 40 males and 40 females)
Nouns take longer to read than verbs (studied on the basis of 35 people’s reading of 100 nouns and verbs)
Question 5
Analysis: when is a difference real?
Given a testable hypothesis: Students answering online lecture questions score higher than those who do not
You collect the final course grade for 30 randomly selected students who answered the online questions and 30 who did not
Will any difference in average grade (in the right direction) be proof?
Probably not: very small differences might be due to chance (unsystematic variation)
Therefore we use statistics to analyze the results
Statistically significant results are those unlikely to be due to chance
Comparing a sample to population: \(z\)-test
\(z\)-test allows assessing difference between sample and population
\(\mu\) and \(\sigma\) for the population should be known (standardized tests: e.g., IQ test)
Sample mean \(m\) is compared to population mean \(\mu\)
Example of \(z\)-test
You think Computer Assisted Language Learning (CALL) may be effective for kids
You give a standard test of language proficiency (\(\mu\) = 70, \(\sigma\) = 14) to 49 randomly chosen childen who followed a CALL program
You find \(m\) = 74
You calculate \(\textit{SE}\) = \(\sigma/\sqrt{n} = 14/\sqrt{49} = 2\)
74 is 2 \(\textit{SE}\) above the population mean: at the 97.5th percentile
Conclusions of \(z\)-test
Group with CALL scored 2 \(\textit{SE}\) above mean (\(z\)-score of 2)
Chance of this (or more extreme score) is only 2.5%, so very unlikely that this is due to chance
Conclusion: CALL programs are probably helping
However, it is also possible that CALL is not helping, but the effect is caused by some other factor
Such as the sample including many proficient kids
This is a confounding factor: an influential hidden variable (a variable not used in a study)
Question 6
Importance of sample size
Suppose we would have used 9 children as opposed to 49, at what percentile would a sample mean of \(m\) = 74 be?
\(m\) = 74 is less than 1 \(\textit{SE}\) above the mean, i.e. at less than the 84th percentile
Sample means of at least this value are found by chance more than 16% of the time: not enough reason to suspect a CALL effect
\(z\)-test in R
sigma <-14; mu <-70; m <-74; n <-9
(se <- sigma/sqrt(n))
[1] 4.67
(zval <- (m - mu) / se)
[1] 0.857
pnorm(zval) # yields percentile: p(z < zval)
[1] 0.804
Statistical reasoning: two hypotheses (1)
Rather than one hypothesis, we create two hypotheses about the data:
The null hypothesis (\(H_0\)) and the alternative hypothesis (\(H_a\))
The null hypothesis states that there is no relationship between two measured phenomena (e.g., CALL program and test score), while the alternative hypothesis states there is
Statistical reasoning: two hypotheses (2)
For the CALL example (49 children):
\(H_0\): \(\mu_{CALL} = 70\) (the population mean of people using CALL is 70)
\(H_a\): \(\mu_{CALL} > 70\) (the population mean of people using CALL is higher than 70)
While \(m\) = 74, suggests that \(H_a\) is right, this might be due to chance, so we would need enough evidence (i.e. low \(\textit{SE}\)) to accept it over the null hypothesis
Logically, \(H_0\) is the inverse of \(H_a\), and we’d expect \(H_0\): \(\mu_{CALL} \leq 70\), but we usually see ‘\(=\)’ in formulations
The chance of observing a sample at least this extreme given \(H_0\) is true is 0.025
This is the \(p\)-value (measured significance level)
If \(H_0\) were correct and kids with CALL experience had the same language proficiency as others, the observed sample would be expected only 2.5% of the time
Strong evidence against the null hypothesis
Statistically significant?
We have determined \(H_0\), \(H_a\) and the \(p\)-value
The classical hypothesis test assesses how unlikely a sample must be for a test to count as significant
We compare the \(p\)-value against this threshold significance level or \(\alpha\)-level
If the \(p\)-value is lower than the \(\alpha\)-level (usually 0.05, but it may be lower as well), we regard the result as significant and reject the null hypothesis
Statistically significant: summary
The \(p\)-value is the chance of encountering the sample, given that the null hypothesis is true
The \(\alpha\)-level is the threshold for the \(p\)-value, below which we regard the result as significant
If the result is significant, we reject \(H_0\) and assume \(H_a\) is true
Sometimes \(H_a\) might predict not lower or higher, but just different
For example, you use a standardized test for aphasia in NL developed in the UK
The developers claim that for non-aphasics, the distribution is \(N(100,10)\)
You specify \(H_0\): \(\mu = 100\) and \(H_a\): \(\mu \neq 100\)
Two-sided \(z\)-test (2)
With a significance level \(\alpha\) of 0.05, both very high (2.5% highest) and very low (2.5% lowest) values give reason to reject \(H_0\)
Two-sided \(z\)-test: example
Consider a sample of 81 Dutch people who took the UK aphasia test, \(N(100,10)\)
The mean score of the test in the sample is 98
Is there reason to believe the Dutch population differs from the UK population?
Two-sided \(z\)-test: calculation
pnorm(98,mean=100,sd=10/sqrt(81))
[1] 0.0359
Two-sided test: reject \(H_0\) for 2.5% lowest and 2.5% highest values (when \(\alpha\) = 0.05)
(one-tailed) \(p\)-value > 0.025: \(H_0\) not rejected
With a one-sided test (\(H_a\): \(\mu < 100\)), \(H_0\) would have been rejected (\(p\) < 0.05)
Statistical significance and confidence interval
Statistical significance and a confidence interval (CI) are linked
A 95% CI based on the sample mean \(m\) represents the values for \(\mu\) for which the difference between \(\mu\) and \(m\) is not significant (at the 0.05 significance threshold for a two-sided test)
A value outside of the CI indicates a statistically significant difference
Two-sided \(z\)-test: calculation using confidence interval
mu <-100se <-10/sqrt(81)(conf <-c(mu -2* se, mu +2* se)) # 95% CI: 2 SE below and above mean
[1] 97.8 102.2
The value 98 lies within the 95% confidence interval: not significant at \(\alpha\) = 0.05 for a two-tailed test
Hypothesis: average performance of students answering online lecture questions is higher than of those who do not
Identifying hypotheses
Alternative hypothesis\(H_a\) (original hypothesis) is contrasted with null hypothesis\(H_0\) (hypothesis that nothing out of the ordinary is going on)
\(H_a\): higher performance for students answering online lecture questions
\(H_0\): answering online lecture questions does not impact performance
Logically \(H_0\) should imply \(\neg H_a\)
Possible errors
Of course, you could be wrong (e.g., due to an unrepresentative sample)!
\(H_0\)
true
false
accepted
correct
type-II error
rejected
type-I error
correct
Hypothesis testing focuses on type-I errors
\(p\)-value: chance of type-I error
\(\alpha\)-level: boundary of acceptable level of type-I error
Type-II errors
\(\beta\): chance of type-II error
\(1 - \beta\): power of statistical test
More sensitive (and useful) tests have more power to detect an effect
Possible errors: easier to remember
False positive: incorrect positive (accepting \(H_a\)) result
False negative: incorrect negative (not rejecting \(H_0\)) result
How to formulate the results?
Results with \(p = 0.051\) are not very different from \(p = 0.049\), but we need a boundary
An \(\alpha\)-level of \(0.05\) is low as the “burden of proof” is on the alternative
If \(p = 0.051\), we haven’t proven\(H_0\), as we just failed to show that it’s very wrong
When we cannot reject \(H_0\), we indicate that we have “retained \(H_0\)”
Note about multiple testing
Using multiple tests risks finding significance through sheer chance
Suppose you run two tests, always using \(\alpha\) = 0.05
Chance of finding one or more significant values (family-wise error rate) is: \(1 - (1 - \alpha)^2\) = \(1 - 0.95^2 = 0.0975\) (almost twice as high as we’d like!)
To guarantee a family-wise error rate of 0.05, we should divide \(\alpha\) by the number of tests: Bonferroni correction
Recap
In this lecture, we’ve covered
how to reason about the population using a sample (CLT)
how to calculate a confidence interval
how to specify a concrete testable hypothesis based on a research question
how to specify the null hypothesis
how to conduct a \(z\)-test and use the results to evaluate a hypothesis
what statistical significance entails
how to evaluate if a result is statistically signficant given a specific \(\alpha\)-level
the difference between a one-tailed and a two-tailed test