Statistiek I

Sampling

Martijn Wieling

Question 1: last lecture

Last lecture

Descriptive vs. inferential statistics
Sample vs. population
(Types of) variables
Distribution of a variable
Measures of central tendency and spread
Standardized scores
Checking for a normal distribution

This lecture

Reasoning about the population using a sample
- Relation between population (mean) and sample (mean)
- Confidence interval for population mean based on sample mean
- Testing a hypothesis about the population using a sample
  - One-sided hypothesis vs. two-sided hypothesis
- Statistical significance
- Error types

Introduction

Selecting a sample from a population includes an element of chance: which individuals are studied?
Question of this lecture: How to reason about the population using a sample?
- Anwered using the Central Limit Theorem

Central Limit Theorem

Suppose we would gather many different samples from the population, then the distribution of the sample means will always be normally distributed
- The mean of these sample means (\(\bar{x}\)) will be the population mean (\(m_{\bar{x}} = \mu\))
- The standard deviation of the sample means (i.e. standard error \(\textit{SE}\)) is dependent on the sample size \(n\) and the population standard deviation \(\sigma\) : \(\textit{SE} = s_{\bar{x}} = \frac{\sigma}{\sqrt{n}}\)

Standard deviation vs. standard error

Standard deviation of population (\(\sigma\)):
- Relate individual to population
Standard deviation of sample means = standard error (\(\sigma / \sqrt{n}\))
- Relate sample to population

Question 2

Reasoning about the population (1)

Given that the distribution of sample means is normally distributed \(N(\mu,\sigma/\sqrt{n})\), having one randomly selected sample allows us to reason about the population
Requirement: sample is representative (unbiased sample)
- Random selection helps avoid bias

Question 3

Reasoning about the population (2)

Given a representative sample:
- We estimate the population mean as equal to the sample mean (best guess)
- How certain we are of this estimate depends on the standard error: \(\sigma/\sqrt{n}\)
  - Increasing sample size \(n\) reduces uncertainty
    - Hard work pays off (in exactness), but it doesn’t pay off quickly: \(\sqrt(n)\)
  - Sample means are normally distributed (CLT):
    - We can relate a sample mean to the population mean by using characteristics of the normal distribution

Normal distribution

We know the probability of a sample mean \(\bar{x}\) having a value close to the population mean \(\mu\):

\(P(\mu - SE \leq x \leq \mu + SE) \approx 68\%\)     (34 + 34)
\(P(\mu - 2SE \leq x \leq \mu + 2SE) \approx 95\%\)     (34 + 34 + 13.5 + 13.5)
\(P(\mu - 3SE \leq x \leq \mu + 3SE) \approx 99.7\%\)     (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)

Reasoning about the population (3)

Sample means can be related to the population in two ways:
- Using a confidence interval
  - An interval which is calculated in such a way that a large proportion of the calculated intervals contains the true population mean
- Using a hypothesis test
  - Tests if hypothesis about population is compatible with sample result

Confidence interval

Definition: there is an \(x\)% probability that when computing an \(x\)% confidence interval (CI) on the basis of a sample, it contains \(\mu\)
The CI can be seen as an estimate of plausible values of \(\mu\)
- (For those who are interested: there is much confusion about interpreting CIs)

Confidence interval: example (1)

Consider the following example:
You want to know how many hours per week a student of the university spends speaking English. The standard deviation \(\sigma\) for the university is 1 hr/wk.
- You collect data from 100 randomly chosen students
- You calculate the sample mean \(m = 5\) hr/wk (note: \(m\) = \(\bar{x}\))
- You therefore estimate the population mean \(\mu = 5\) hr/wk and standard error \(\textit{SE} = 1/\sqrt{100} = 0.1\) hr/wk
What is the 95% confidence interval (CI) of the mean?

Confidence interval: example (2)

According to the CLT, the sample means are normally distributed

95% of the sample means lie within \(m \pm 2\,\textit{SE}\)
- (i.e. actually it is \(m \pm 1.96\,\textit{SE}\), but we round this to \(m \pm 2\,\textit{SE}\))
With \(m\) = 5 and \(\textit{SE}\) = 0.1, 95% CI is 5 \(\pm\) 2 \(\times\) 0.1 = (4.8 hr/wk, 5.2 hr/wk)

Question 4

Hypotheses

Instead of using them for confidence intervals we often interpret samples as hypothesis tests about populations
Examples of hypotheses
- Answering online lecture questions is related to the course grade
- Females and males differ in their English proficiency
- Nouns take longer to read than verbs

Hypothesis testing (1)

Testing these hypotheses requires empirical and variable data
- Empirical: based on observation rather than theory alone
- Variable: individual cases vary
Hypotheses can be derived from theory, but also from observations if theory is incomplete

Hypothesis testing (2)

We start from a research question:
Is answering online lecture questions related to the course grade?
Which we then formulate as a hypothesis (i.e. a statement):
Answering online lecture questions is related to the course grade
For statistics to be useful, this needs to be translated to a concrete form:
Students answering online lecture questions score higher than those who do not

Hypothesis testing (3)

Students answering online lecture questions score higher than those who do not
What is meant by this?
All students answering online lecture questions score higher than those who do not?
- Probably not, the data is variable, there are other factors:
  - Attention level of each student
  - Difficulty of the lecture
  - If the questions were answered seriously
We need statistics to abstract away from the variability of the observations (i.e. unsystematic variation)

Hypothesis testing (4)

Students answering online lecture questions score higher than those who do not
Meaning:
- Not: All students answering online lecture questions score higher than those who do not
- But: On average, students answering online lecture questions score higher than those who do not

Testing a hypothesis using a sample (1)

On average, students answering online lecture questions score higher than those who do not
This hypothesis must be studied on the basis of a sample, i.e. a limited number of students following a course with online lecture questions
- Of course we’re interested in the population, i.e. all students who followed a course with online lecture questions

Testing a hypothesis using a sample (2)

The hypothesis concerns the population, but it is studied through a representative sample
- Students answering online lecture questions score higher than those who do not
  (study based on 30 students who answered the questions and 30 who did not)
- Females have higher English proficiency than males
  (study based on 40 males and 40 females)
- Nouns take longer to read than verbs
  (studied on the basis of 35 people’s reading of 100 nouns and verbs)

Question 5

Analysis: when is a difference real?

Given a testable hypothesis:
Students answering online lecture questions score higher than those who do not
- You collect the final course grade for 30 randomly selected students who answered the online questions and 30 who did not
Will any difference in average grade (in the right direction) be proof?
- Probably not: very small differences might be due to chance (unsystematic variation)
Therefore we use statistics to analyze the results
- Statistically significant results are those unlikely to be due to chance

Comparing a sample to population: \(z\)-test

\(z\)-test allows assessing difference between sample and population
- \(\mu\) and \(\sigma\) for the population should be known (standardized tests: e.g., IQ test)
Sample mean \(m\) is compared to population mean \(\mu\)

Example of \(z\)-test

You think Computer Assisted Language Learning (CALL) may be effective for kids
You give a standard test of language proficiency (\(\mu\) = 70, \(\sigma\) = 14) to 49 randomly chosen childen who followed a CALL program
- You find \(m\) = 74
- You calculate \(\textit{SE}\) = \(\sigma/\sqrt{n} = 14/\sqrt{49} = 2\)
- 74 is 2 \(\textit{SE}\) above the population mean: at the 97.5th percentile

Conclusions of \(z\)-test

Group with CALL scored 2 \(\textit{SE}\) above mean (\(z\)-score of 2)
- Chance of this (or more extreme score) is only 2.5%, so very unlikely that this is due to chance
Conclusion: CALL programs are probably helping
- However, it is also possible that CALL is not helping, but the effect is caused by some other factor
  - Such as the sample including many proficient kids
  - This is a confounding factor: an influential hidden variable (a variable not used in a study)

Question 6

Importance of sample size

Suppose we would have used 9 children as opposed to 49, at what percentile would a sample mean of \(m\) = 74 be?
- \(\textit{SE}\) = \(\sigma/\sqrt{n} = 14/\sqrt{9} \approx 4.7\)
- \(m\) = 74 is less than 1 \(\textit{SE}\) above the mean, i.e. at less than the 84th percentile
  - Sample means of at least this value are found by chance more than 16% of the time: not enough reason to suspect a CALL effect

\(z\)-test in `R`

sigma <- 14; mu <- 70; m <- 74; n <- 9

(se <- sigma/sqrt(n))

[1] 4.67

(zval <- (m - mu) / se)

[1] 0.857

pnorm(zval) # yields percentile: p(z < zval)

[1] 0.804

Statistical reasoning: two hypotheses (1)

Rather than one hypothesis, we create two hypotheses about the data:
- The null hypothesis (\(H_0\)) and the alternative hypothesis (\(H_a\))
- The null hypothesis states that there is no relationship between two measured phenomena (e.g., CALL program and test score), while the alternative hypothesis states there is

Statistical reasoning: two hypotheses (2)

For the CALL example (49 children):
- \(H_0\): \(\mu_{CALL} = 70\) (the population mean of people using CALL is 70)
- \(H_a\): \(\mu_{CALL} > 70\) (the population mean of people using CALL is higher than 70)
- While \(m\) = 74, suggests that \(H_a\) is right, this might be due to chance, so we would need enough evidence (i.e. low \(\textit{SE}\)) to accept it over the null hypothesis
- Logically, \(H_0\) is the inverse of \(H_a\), and we’d expect \(H_0\): \(\mu_{CALL} \leq 70\), but we usually see ‘\(=\)’ in formulations

Statistical reasoning (1)

\(H_0\): \(\mu_{CALL} = 70\) \(H_a\): \(\mu_{CALL} > 70\)

The reasoning goes as follows:
- Suppose \(H_0\) is true, what is the chance \(p\) of observing a sample with \(m \geq\) 74?
- To determine this, we convert 74 to a \(z\)-score: \(z = (m - \mu) / \textit{SE}\) \(= (74-70)/2 = 2\)
- And find the associated \(p\)-value:

1 - pnorm(2) # pnorm(2) yields p(z < 2) => 1 - pnorm(2) = p(z >= 2)

[1] 0.0228

pnorm(2,lower.tail=F) # alternative formulation for p(z >= 2)

[1] 0.0228

Statistical reasoning (2)

\(H_0\): \(\mu_{CALL} = 70\) \(H_a\): \(\mu_{CALL} > 70\)

In R we can also calculate the probability directly without conversion to \(z\)-scores by supplying the mean and standard error (sd parameter):

pnorm(74,mean=70,sd=2,lower.tail=F)

[1] 0.0228

Statistical reasoning (3)

\(H_0\): \(\mu_{CALL} = 70\) \(H_a\): \(\mu_{CALL} > 70\)

\(P(z \geq 2) \approx 0.025\)
- The chance of observing a sample at least this extreme given \(H_0\) is true is 0.025
- This is the \(p\)-value (measured significance level)
- If \(H_0\) were correct and kids with CALL experience had the same language proficiency as others, the observed sample would be expected only 2.5% of the time
  - Strong evidence against the null hypothesis

Statistically significant?

We have determined \(H_0\), \(H_a\) and the \(p\)-value
The classical hypothesis test assesses how unlikely a sample must be for a test to count as significant
We compare the \(p\)-value against this threshold significance level or \(\alpha\)-level
If the \(p\)-value is lower than the \(\alpha\)-level (usually 0.05, but it may be lower as well), we regard the result as significant and reject the null hypothesis

Statistically significant: summary

The \(p\)-value is the chance of encountering the sample, given that the null hypothesis is true
The \(\alpha\)-level is the threshold for the \(p\)-value, below which we regard the result as significant
- If the result is significant, we reject \(H_0\) and assume \(H_a\) is true

Question 7

Visualizing the answer to question 7

\(m = 74\), \(\mu = 70\), \(\sigma = 14\), \(n = 49\), \(\textit{SE} = 14/\sqrt{49} = 2 \implies z = \frac{m - \mu}{\textit{SE}} = \frac{74 - 70}{2} = 2\)

Steps for assessing statistical significance

Specify \(H_0\) and \(H_a\)
Specify test statistic (e.g., mean) and underlying distribution (assuming \(H_0\))
Specify the \(\alpha\)-level at which \(H_0\) will be rejected
Determine the value of the statistic (e.g., mean) on the basis of a sample
Calculate the \(p\)-value and compare to \(\alpha\)
- \(p\)-value \(< \alpha\): reject \(H_0\) (significant result)
- \(p\)-value \(\geq \alpha\): do not reject \(H_0\) (non-significant result)

Critical values

Critical values: those values of the sample statistic resulting in a rejection of \(H_0\)
E.g., if \(\alpha\) is set at 0.05, the critical region is \(P(z) < 0.05\), i.e. \(z \geq 1.64\)
We can transform this to raw values using the \(z\) formula
\(z = (x-\mu)/SE\)
\(1.64 = (x-70)/2\)
\(3.3 = x-70\)
\(x = 73.3\)
Thus a sample mean of at least 73.3 will result in rejection of \(H_0\)

Calculating critical values in R

# critical z-value
qnorm(p = 0.05, lower.tail = F)

[1] 1.64

# critical value
qnorm(p = 0.05, mean = 70, sd = 2, lower.tail = F)

[1] 73.3

\(z\)-test

The CALL example is a \(z\)-test: based on a normal distribution with known \(\mu\) and \(\sigma\)
On the basis of the sample mean \(m\), we calculate the \(z\)-value: \(z = (m - \mu) / (\sigma / \sqrt{n})\)
We obtain the \(p\)-value linked with the \(z\)-value and compare that to the \(\alpha\)-level

One-sided \(z\)-test

There are different forms of \(z\)-tests:
- \(H_a\) predicts high \(m\): CALL improves language ability
- \(H_a\) predicts low \(m\): Eating broccoli lowers cholesterol levels

Two-sided \(z\)-test (1)

Sometimes \(H_a\) might predict not lower or higher, but just different
For example, you use a standardized test for aphasia in NL developed in the UK
- The developers claim that for non-aphasics, the distribution is \(N(100,10)\)
- You specify \(H_0\): \(\mu = 100\) and \(H_a\): \(\mu \neq 100\)

Two-sided \(z\)-test (2)

With a significance level \(\alpha\) of 0.05, both very high (2.5% highest) and very low (2.5% lowest) values give reason to reject \(H_0\)

Two-sided \(z\)-test: example

Consider a sample of 81 Dutch people who took the UK aphasia test, \(N(100,10)\)
The mean score of the test in the sample is 98
Is there reason to believe the Dutch population differs from the UK population?

Two-sided \(z\)-test: calculation

pnorm(98,mean=100,sd=10/sqrt(81))

[1] 0.0359

Two-sided test: reject \(H_0\) for 2.5% lowest and 2.5% highest values (when \(\alpha\) = 0.05)
- (one-tailed) \(p\)-value > 0.025: \(H_0\) not rejected
With a one-sided test (\(H_a\): \(\mu < 100\)), \(H_0\) would have been rejected (\(p\) < 0.05)

Statistical significance and confidence interval

Statistical significance and a confidence interval (CI) are linked
A 95% CI based on the sample mean \(m\) represents the values for \(\mu\) for which the difference between \(\mu\) and \(m\) is not significant (at the 0.05 significance threshold for a two-sided test)
- A value outside of the CI indicates a statistically significant difference

Two-sided \(z\)-test: calculation using confidence interval

mu <- 100
se <- 10 / sqrt(81)
(conf <- c(mu - 2 * se, mu + 2 * se)) # 95% CI: 2 SE below and above mean

[1]  97.8 102.2

The value 98 lies within the 95% confidence interval: not significant at \(\alpha\) = 0.05 for a two-tailed test

Significance and sample size

Recall our CALL example: \(H_0\): \(\mu_{CALL} = 70\), \(H_a\): \(\mu_{CALL} > 70\)
With a sample of 49, the sample mean \(m\) was 74 at a significance level of \(p\) \(\approx\) 0.025 (i.e. one-tailed)
- This was significant at the \(\alpha\)-level of 0.05, but not 0.01
If you are certain about \(m\) = 74 and wanted significance at the 0.01 \(\alpha\)-level, you could increase the sample size

Chasing significance

Suppose the statistic (\(m = 74\)) stayed the same with a sample size \(n\) of 100
The standard error would be \(14 / \sqrt(100) = 1.4\)
And the resulting \(p\)-value would be:

pnorm(74, mean = 70, sd = 1.4, lower.tail = F)

[1] 0.00214

So with a sample size of 100, the result would be significant at the \(\alpha\) = 0.01 level
But would it make sense to collect the additional data?

Understanding significance (1)

Is it sensible to collect the extra data to “push” a result to significance?
- No. At least, usually not.
The real result (effect size) is the difference (4 points), nearly 0.3 \(\sigma\) (4 / 14)

Understanding significance (2)

“Statistically significant” implies that an effect probably is not due to chance, but the effect can be very small
- If you want to know whether you should buy CALL software to learn a language, statistical significance does not tell you this
- This is a two-edged sword, if an effect was not statistically significant, it does not mean nothing important is going on
  - You are just not sure: it could be a chance effect

Question 8

Misuse of significance

Garbage in, garbage out: statistics won’t help an experiment with a poor design, or where data was poorly collected
No significance hunting: hypotheses should be formulated before data collection and analysis
- Modern danger: if there are many potential variables, it is likely that a few turn out to be significant
  - Specific tests are necessary to correct for this

Recap: hypothesis testing

A statistical hypothesis concerns a population (not a sample!) and involves a statistic (such as mean, frequency, etc.)
- Population: all students attending a course using online lecture questions
- Parameter (statistic): (average) course performance
- Hypothesis: average performance of students answering online lecture questions is higher than of those who do not

Identifying hypotheses

Alternative hypothesis \(H_a\) (original hypothesis) is contrasted with null hypothesis \(H_0\) (hypothesis that nothing out of the ordinary is going on)
- \(H_a\): higher performance for students answering online lecture questions
- \(H_0\): answering online lecture questions does not impact performance
Logically \(H_0\) should imply \(\neg H_a\)

Possible errors

Of course, you could be wrong (e.g., due to an unrepresentative sample)!

\(H_0\)	true	false
accepted	correct	type-II error
rejected	type-I error	correct

Hypothesis testing focuses on type-I errors
- \(p\)-value: chance of type-I error
- \(\alpha\)-level: boundary of acceptable level of type-I error
Type-II errors
- \(\beta\): chance of type-II error
- \(1 - \beta\): power of statistical test
  - More sensitive (and useful) tests have more power to detect an effect

Possible errors: easier to remember

False positive: incorrect positive (accepting \(H_a\)) result
False negative: incorrect negative (not rejecting \(H_0\)) result

How to formulate the results?

Results with \(p = 0.051\) are not very different from \(p = 0.049\), but we need a boundary
- An \(\alpha\)-level of \(0.05\) is low as the “burden of proof” is on the alternative
If \(p = 0.051\), we haven’t proven \(H_0\), as we just failed to show that it’s very wrong
- When we cannot reject \(H_0\), we indicate that we have “retained \(H_0\)”

Note about multiple testing

Using multiple tests risks finding significance through sheer chance
Suppose you run two tests, always using \(\alpha\) = 0.05
- Chance of finding one or more significant values (family-wise error rate) is: \(1 - (1 - \alpha)^2\) = \(1 - 0.95^2 = 0.0975\) (almost twice as high as we’d like!)
To guarantee a family-wise error rate of 0.05, we should divide \(\alpha\) by the number of tests: Bonferroni correction

Recap

In this lecture, we’ve covered
- how to reason about the population using a sample (CLT)
- how to calculate a confidence interval
- how to specify a concrete testable hypothesis based on a research question
- how to specify the null hypothesis
- how to conduct a \(z\)-test and use the results to evaluate a hypothesis
- what statistical significance entails
- how to evaluate if a result is statistically signficant given a specific \(\alpha\)-level
- the difference between a one-tailed and a two-tailed test
- the different error types
- risk of multiple testing
For practice: https://martijnwieling.shinyapps.io/sampling/
Next lecture: introduction to linear regression

Please evaluate this lecture

Exam question

Questions?

Thank you for your attention!

https://www.martijnwieling.nl

m.b.wieling@rug.nl

Statistiek I

Question 1: last lecture

Last lecture

This lecture

Introduction

Central Limit Theorem

Standard deviation vs. standard error

Question 2

Reasoning about the population (1)

Question 3

Reasoning about the population (2)

Normal distribution

Reasoning about the population (3)

Confidence interval

Confidence interval: example (1)

Confidence interval: example (2)

Question 4

Hypotheses

Hypothesis testing (1)

Hypothesis testing (2)

Hypothesis testing (3)

Hypothesis testing (4)

Testing a hypothesis using a sample (1)

Testing a hypothesis using a sample (2)

Question 5

Analysis: when is a difference real?

Comparing a sample to population: \(z\)-test

Example of \(z\)-test

Conclusions of \(z\)-test

Question 6

Importance of sample size

\(z\)-test in R

Statistical reasoning: two hypotheses (1)

Statistical reasoning: two hypotheses (2)

Statistical reasoning (1)

Statistical reasoning (2)

Statistical reasoning (3)

Statistically significant?

Statistically significant: summary

Question 7

Visualizing the answer to question 7

Steps for assessing statistical significance

Critical values

Calculating critical values in R

\(z\)-test

One-sided \(z\)-test

Two-sided \(z\)-test (1)

Two-sided \(z\)-test (2)

Two-sided \(z\)-test: example

Two-sided \(z\)-test: calculation

Statistical significance and confidence interval

Two-sided \(z\)-test: calculation using confidence interval

Significance and sample size

Chasing significance

Understanding significance (1)

Understanding significance (2)

Question 8

Misuse of significance

Recap: hypothesis testing

Identifying hypotheses

Possible errors

Possible errors: easier to remember

How to formulate the results?

Note about multiple testing

Recap

Please evaluate this lecture

Exam question

Questions?

\(z\)-test in `R`