# Statistiek I

## Sampling

Martijn Wieling
University of Groningen

## Last lecture

• Descriptive vs. inferential statistics
• Sample vs. population
• (Types of) variables
• Distribution of a variable
• Measures of central tendency
• Standardized scores
• Checking for a normal distribution

## This lecture

• Reasoning about the population using a sample
• Relation between population (mean) and sample (mean)
• Confidence interval for population mean based on sample mean
• Testing a hypothesis about the population using a sample
• One-sided hypothesis vs. two-sided hypothesis
• Statistical significance
• Error types

## Introduction

• Selecting a sample from a population includes an element of chance: which individuals are studied?
• Question of this lecture: How to reason about the population using a sample?
• Anwered using the Central Limit Theorem

## Central Limit Theorem

• Suppose we would gather many different samples from the population, then the distribution of the sample means will always be normally distributed
• The mean of these sample means ($\bar{x}$) will be the population mean ($m_{\bar{x}} = \mu$)
• The standard deviation of the sample means (standard error SE) is dependent on the sample size $n$ and the population standard deviation $\sigma$ : SE $= s_{\bar{x}} = \frac{\sigma}{\sqrt{n}}$

## Standard deviation vs. standard error

• Standard deviation ($\sigma$): relate an individual to a population
• Standard error ($\frac{\sigma}{\sqrt{n}}$): relate a sample to a population

## Reasoning about the population (1)

• Given that the distribution of sample means is normally distributed $N(\mu,\sigma/\sqrt{n})$, having one randomly selected sample allows us to reason about the population
• Requirement: sample is representative (unbiased sample)
• Random selection helps avoid bias

## Reasoning about the population (2)

• Given a representative sample:
• We estimate the population mean as equal to the sample mean (best guess)
• How certain we are of this estimate depends on the standard error: $\sigma/\sqrt{n}$
• Increasing sample size $n$ reduces uncertainty
• Hard work pays off (in exactness), but it doesn't pay of quickly: $\sqrt(n)$
• Sample means are normally distributed (CLT):
• We can relate a sample mean to the population mean by using characteristics of the normal distribution

## Normal distribution

• We know the probability of a sample mean $\bar{x}$ having a value close to the population mean $\mu$:

$P(\mu - SE \leq x \leq \mu + SE) \approx 68\%$     (34 + 34)
$P(\mu - 2SE \leq x \leq \mu + 2SE) \approx 95\%$     (34 + 34 + 13.5 + 13.5)
$P(\mu - 3SE \leq x \leq \mu + 3SE) \approx 99.7\%$     (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)

## Reasoning about the population (3)

• Sample means can be related to the population in two ways:
• Using a confidence interval
• An interval which is calculated in such a way that a large proportion of the calculated intervals contains the true population mean
• Using a hypothesis test
• Tests if hypothesis about population is compatible with sample result

## Confidence interval

• Definition: there is an $x$% probability that when computing an $x$% confidence interval (CI) on the basis of a sample, it contains $\mu$
• The CI can be seen as an estimate of plausible values of $\mu$
• (For those who are interested: there is a lot of confusion about interpreting CIs)

## Confidence interval: example (1)

• Consider the following example:
You want to know how many hours per week a student of the university spends speaking English. The standard deviation $\sigma$ for the university is 1 hr/wk.
• You collect data from 100 randomly chosen students
• You calculate the sample mean $m = 5$ hr/wk (N.B. in my notation: $m$ = $\bar{x}$)
• You therefore estimate the population mean $\mu = 5$ hr/wk and SE $= 1/\sqrt{100} = 0.1$ hr/wk
• What is the 95% confidence interval (CI) of the mean?

## Confidence interval: example (2)

• According to the CLT, the sample means are normally distributed

• 95% of the sample means lie within $m \pm$ 2 SE
• (i.e. actually it is $m \pm$ 1.96 SE, but we round this to $m \pm$ 2 SE)
• With $m$ = 5 and SE = 0.1, 95% CI is 5 $\pm$ 2$\times$0.1 = (4.8 hr/wk, 5.2 hr/wk)

## Hypotheses

• We often interpret samples as hypothesis tests about populations
• Examples of hypotheses
• Women and men differ in their English proficiency
• Nouns take longer to read than verbs

## Hypothesis testing (1)

• Testing these hypotheses requires empirical and variable data
• Empirical: based on observation rather than theory alone
• Variable: individual cases vary
• Hypotheses can be derived from theory, but also from observations if theory is incomplete

## Hypothesis testing (2)

• We start from a research question:
• Which we then formulate as a hypothesis (i.e. a statement):
• For statistics to be useful, this needs to be translated to a concrete form:
Students answering online lecture questions score higher than those who do not

## Hypothesis testing (3)

• Students answering online lecture questions score higher than those who do not
• What is meant by this?
All students answering online lecture questions score higher than those who do not?
• Probably not, the data is variable, there are other factors:
• Attention level of each student
• Difficulty of the lecture
• If the questions were answered seriously
• We need statistics to abstract away from the variability of the observations (i.e. unsystematic variation)

## Hypothesis testing (4)

• Students answering online lecture questions score higher than those who do not
• Meaning:
• Not: All students answering online lecture questions score higher than those who do not
• But: On average, students answering online lecture questions score higher than those who do not

## Testing a hypothesis using a sample (1)

• On average, students answering online lecture questions score higher than those who do not
• This hypothesis must be studied on the basis of a sample, i.e. a limited number of students following a course with online lecture questions
• Of course we're interested in the population, i.e. all students who followed a course with online lecture questions

## Testing a hypothesis using a sample (2)

• The hypothesis concerns the population, but it is studied through a representative sample
• Students answering online lecture questions score higher than those who do not
(study based on 30 students who answered the questions and 30 who did not)
• Women have higher English proficiency than men
(study based on 40 men and 40 women)
• Nouns take longer to read than verbs
(studied on the basis of 35 people's reading of 100 nouns and verbs)

## Analysis: when is a difference real?

• Given a testable hypothesis:
Students answering online lecture questions score higher than those who do not
• You collect the final course grade for 30 randomly selected students who answered the online questions and 30 who did not
• Will any difference in average grade (in the right direction) be proof?
• Probably not: very small differences might be due to chance (unsystematic variation)
• Therefore we use statistics to analyze the results
• Statistically significant results are those unlikely to be due to chance

## Comparing a sample to population: $z$-test

• $z$-test allows assessing difference between sample and population
• $\mu$ and $\sigma$ for the population should be known (standardized tests: e.g., IQ test)
• Sample mean $m$ is compared to population mean $\mu$

## Example of $z$-test

• You think Computer Assisted Language Learning may be effective for kids
• You give a standard test of language proficiency ($\mu$ = 70, $\sigma$ = 14) to 49 randomly chosen childen who followed a CALL program
• You find $m$ = 74
• You calculate SE = $\sigma/\sqrt{n} = 14/\sqrt{49} = 2$
• 74 is 2 SE above the population mean: at the 97.5th percentile

## Conclusions of $z$-test

• Group with CALL scored 2 SE above mean ($z$-score of 2)
• Chance of this (or more extreme score) is only 2.5%, so very unlikely that this is due to chance
• Conclusion: CALL programs are probably helping
• However, it is also possible that CALL is not helping, but the effect is caused by some other factor
• Such as the sample including lots of proficient kids
• This is a confounding factor: an influential hidden variable (a variable not used in a study)

## Importance of sample size

• Suppose we would have used 9 children as opposed to 49, at what percentile would a sample mean of $m$ = 74 be?
• SE = $\sigma/\sqrt{n} = 14/\sqrt{9} \approx 4.7$
• $m$ = 74 is less than 1 SE above the mean, i.e. at less than the 84th percentile
• Sample means of this value are found by chance more than 16% of the time (i.e. likely due to chance): not enough reason to suspect an effect of CALL

## $z$-test in R

sigma <- 14; mu <- 70; m <- 74; n <- 9

(se <- sigma/sqrt(n))

# [1] 4.67

(zval <- (m - mu)/se)

# [1] 0.857

pnorm(zval)  # yields percentile: p(z < zval)

# [1] 0.804


## Statistical reasoning: two hypotheses (1)

• Rather than one hypothesis, we create two hypotheses about the data:
• The null hypothesis ($H_0$) and the alternative hypothesis ($H_a$)
• The null hypothesis states that there is no relationship between two measured phenomena (e.g., CALL program and test score), while the alternative hypothesis states there is

## Statistical reasoning: two hypotheses (2)

• For the CALL example (49 children):
• $H_0$: $\mu_{CALL} = 70$ (the population mean of people using CALL is 70)
• $H_a$: $\mu_{CALL} > 70$ (the population mean of people using CALL is higher than 70)
• While $m$ = 74, suggests that $H_a$ is right, this might be due to chance, so we would need enough evidence (i.e. low SE) to accept it over the null hypothesis
• Logically, $H_0$ is the inverse of $H_a$, and we'd expect $H_0$: $\mu_{CALL} \leq 70$, but we usually see '$=$' in formulations

## Statistical reasoning (1)

$H_0$: $\mu_{CALL} = 70$              $H_a$: $\mu_{CALL} > 70$

• The reasoning goes as follows:
• Suppose $H_0$ is true, what is the chance $p$ of observing a sample with $m \geq$ 74?
• To determine this, we convert 74 to a $z$-score: $z = (m - \mu) /$SE = (74-70)/2 = 2
• And find the associated $p$-value:
1 - pnorm(2)  # pnorm(2) yields p(z < 2) => 1 - pnorm(2) = p(z >= 2)

# [1] 0.0228

pnorm(2, lower.tail = F)  # alternative formulation for p(z >= 2)

# [1] 0.0228


## Statistical reasoning (2)

$H_0$: $\mu_{CALL} = 70$              $H_a$: $\mu_{CALL} > 70$

• In R we can also calculate the probability directly without conversion to $z$-scores by supplying the mean and standard error (sd parameter):
pnorm(74, mean = 70, sd = 2, lower.tail = F)

# [1] 0.0228


## Statistical reasoning (3)

$H_0$: $\mu_{CALL} = 70$              $H_a$: $\mu_{CALL} > 70$

• $P(z \geq 2) \approx 0.025$
• The chance of observing a sample at least this extreme given $H_0$ is true is 0.025
• This is the $p$-value (measured significance level)
• If $H_0$ were correct and kids with CALL exp. had the same language proficiency as others, the observed sample would be expected only 2.5% of the time
• Strong evidence against the null hypothesis

## Statistically significant?

• We have determined $H_0$, $H_a$ and the $p$-value
• The classical hypothesis test assesses how unlikely a sample must be for a test to count as significant
• We compare the $p$-value against this threshold significance level or $\alpha$-level
• If the $p$-value is lower than the $\alpha$-level (usually 0.05, but it may be lower as well), we regard the result as significant and reject the null hypothesis

## Statistically significant: summary

• The $p$-value is the chance of encountering the sample, given that the null hypothesis is true
• The $\alpha$-level is the threshold for the $p$-value, below which we regard the result as significant
• If result significant, we reject $H_0$ and assume $H_a$ is true

## Visualizing the answer to question 7

$m = 74$, $\mu = 70$, $\sigma = 14$, $n = 49$, $\textrm{SE} = 14/\sqrt{49} = 2 \implies z = \frac{m - \mu}{\textrm{SE}} = \frac{74 - 70}{2} = 2$

## Steps for assessing statistical significance

1. Specify $H_0$ and $H_a$
2. Specify test statistic (e.g., mean) and underlying distribution (assuming $H_0$)
3. Specify the $\alpha$-level at which $H_0$ will be rejected
4. Determine the value of the statistic (e.g., mean) on the basis of a sample
5. Calculate the $p$-value and compare to $\alpha$
• $p$-value $< \alpha$: reject $H_0$ (significant result)
• $p$-value $\geq \alpha$: do not reject $H_0$ (non-significant result)

## Critical values

• Critical values: those values of the sample statistic resulting in a rejection of $H_0$
• E.g., if $\alpha$ is set at 0.05, the critical region is $P(z) < 0.05$, i.e. $z \geq 1.64$
• We can transform this to raw values using the $z$ formula $z = (x-\mu)/SE\\ 1.64 = (x-70)/2\\ 3.3 = x-70\\ x = 73.3$
• Thus a sample mean of at least 73.3 will result in rejection of $H_0$

## Calculating critical values in R

# critical z-value
qnorm(p = 0.05, lower.tail = F)

# [1] 1.64

# critical value
qnorm(p = 0.05, mean = 70, sd = 2, lower.tail = F)

# [1] 73.3


## $z$-test

• The CALL example is a $z$-test: based on a normal distribution with known $\mu$ and $\sigma$
• On the basis of the sample mean $m$, we calculate the $z$-value: $z = (m - \mu) / (\sigma / \sqrt{n})$
• We obtain the $p$-value linked with the $z$-value and compare that to the $\alpha$-level

## One-sided $z$-test

• There are different forms of $z$-tests:
• $H_a$ predicts high $m$: CALL improves language ability
• $H_a$ predicts low $m$: Eating broccoli lowers cholesterol levels

## Two-sided $z$-test (1)

• Sometimes $H_a$ might predict not lower or higher, but just different
• For example, you use a statistical test for aphasia in NL developed in the UK
• The developers claim that for non-aphasics, the distribution is $N(100,10)$
• You specify $H_0$: $\mu = 100$ and $H_a$: $\mu \neq 100$

## Two-sided $z$-test (2)

• With a significance level $\alpha$ of 0.05, both very high (2.5% highest) and very low (2.5% lowest) values give reason to reject $H_0$

## Two-sided $z$-test: example

• Consider a sample of 81 Dutch people who took the UK aphasia test, $N(100,10)$
• The mean score of the test in the sample is 98
• Is there reason to believe the Dutch population differs from the UK population?

## Two-sided $z$-test: calculation

pnorm(98, mean = 100, sd = 10/sqrt(81))

# [1] 0.0359

• Two-sided test: reject $H_0$ for 2.5% lowest and 2.5% highest values (when $\alpha$ = 0.05)
• $p$-value > 0.025: $H_0$ not rejected
• With a one-sided test ($H_a$: $\mu < 100$), $H_0$ would have been rejected ($p$ < 0.05)

## Statistical significance and confidence interval

• Statistical significance and a confidence interval (CI) are linked
• A 95% CI based on the sample mean $m$ represents the values for $\mu$ for which the difference between $\mu$ and $m$ is not significant (at the 0.05 significance threshold for a two-sided test)
• A value outside of the CI indicates a statistically significant difference

## Two-sided $z$-test: calculation using confidence interval

mu <- 100
se <- 10 / sqrt(81)
(conf <- c(mu - 2 * se, mu + 2 * se)) # 95% CI: 2 SE below and above mean

# [1]  97.8 102.2

• The value 98 lies within the 95% confidence interval: not significant at $\alpha$ = 0.05 for a two-tailed test

## Significance and sample size

• Recall our CALL example: $H_0$: $\mu_{CALL} = 70$, $H_a$: $\mu_{CALL} > 70$
• With a sample of 49, the sample mean $m$ was 74 at a significance level of $p$ $\approx$ 0.025 (i.e. one-tailed)
• This was significant at the $\alpha$-level of 0.05, but not 0.01
• If you are certain about $m$ = 74 and wanted significance at the 0.01 $\alpha$-level, you could increase the sample size

## Chasing significance

• Suppose the statistic ($m = 74$) stayed the same with a sample size $n$ of 100
• The standard error would be $14 / \sqrt(100) = 1.4$
• And the resulting $p$-value would be:
pnorm(74, mean = 70, sd = 1.4, lower.tail = F)

# [1] 0.00214

• So with a sample size of 100, the result would be significant at the $\alpha$ = 0.01 level
• Would it make sense to collect the additional data?

## Understanding significance (1)

• Is it sensible to collect the extra data to "push" a result to significance?
• No. At least, usually not.
• The real result (effect size) is the difference (4 points), nearly 0.3 $\sigma$ (4 / 14)

## Understanding significance (2)

• "Statistically significant" implies that an effect probably is not due to chance, but the effect can be very small
• If you want to know whether you should buy CALL software to learn a language, statistically significant does not tell you this
• This is a two-edged sword, if an effect was not statistically significant, it does not mean nothing important is going on
• You are just not sure: it could be a chance effect

## Misuse of significance

• Garbage in, garbage out: statistics won't help an experiment with a poor design, or where data was poorly collected
• No significance hunting: hypotheses should be formulated before data collection and analysis
• Modern danger: if there are many potential variables, it is likely that a few turn out to be significant
• Specific tests are necessary to correct for this

## Some remarks about hypothesis testing

• A statistical hypothesis concerns a population (not a sample!) and involves a statistic (such as mean, frequency, etc.)
• Population: all students attending a course using online lecture questions
• Parameter (statistic): (average) course performance
• Hypothesis: average performance of students answering online lecture questions is higher than those who do not

## Identifying hypotheses

• Alternative hypothesis $H_a$ (original hypothesis) is contrasted with null hypothesis $H_0$ (hypothesis that nothing out of the ordinary is going on)
• $H_a$: average performance of students answering online lecture questions higher
• $H_0$: answering online lecture questions does not impact performance
• Logically $H_0$ should imply $\neg H_a$

## Possible errors

Of course, you could be wrong (e.g., due to an unrepresentative sample)!

$H_0$ true false
accepted correct type II error
rejected type I error correct
• Hypothesis testing focuses on type I errors
• $p$-value: chance of type I error
• $\alpha$-level: boundary of acceptable level of type I error
• Type II errors (not covered further in this course)
• $\beta$: chance of type II error
• $1 - \beta$: power of statistical test
• More sensitive (and useful) tests have more power to detect an effect

## Possible errors: easier to remember

• False positive: incorrect positive (accepting $H_a$) result
• False negative: incorrect negative (not rejecting $H_0$) result

## How to formulate the results?

• Results with $p = 0.051$ are not very different from $p = 0.049$, but we need a boundary
• An $\alpha$-level of $0.05$ is low as the "burden of proof" is on the alternative
• If $p = 0.051$ we haven't proven $H_0$, only failed to show that it's really wrong
• This is called "retaining $H_0$"

## Recap

• In this lecture, we've covered
• the difference between the population and a sample
• how to calculate a confidence interval
• how to specify a concrete testable hypothesis based on a research question
• how to specify the null hypothesis
• how to conduct a $z$-test and use the results to evaluate a hypothesis
• what statistical significance entails
• how to evaluate if a result is statistically signficant given a specific $\alpha$-level
• the difference between a one-tailed and a two-tailed test
• the different error types
• For practice: http://eolomea.let.rug.nl/Statistiek-I/HC3 (login: s-nr, lowercase s!)
• Next lecture: $t$-tests