# Statistiek I

## Sampling

Martijn Wieling
University of Groningen

## Last lecture

• Descriptive vs. inferential statistics
• Sample vs. population
• (Types of) variables
• Distribution of a variable
• Measures of central tendency and spread
• Standardized scores
• Checking for a normal distribution

## This lecture

• Reasoning about the population using a sample
• Relation between population (mean) and sample (mean)
• Confidence interval for population mean based on sample mean
• Testing a hypothesis about the population using a sample
• One-sided hypothesis vs. two-sided hypothesis
• Statistical significance
• Error types

## Introduction

• Selecting a sample from a population includes an element of chance: which individuals are studied?
• Question of this lecture: How to reason about the population using a sample?
• Anwered using the Central Limit Theorem

## Central Limit Theorem

• Suppose we would gather many different samples from the population, then the distribution of the sample means will always be normally distributed
• The mean of these sample means ($$\bar{x}$$) will be the population mean ($$m_{\bar{x}} = \mu$$)
• The standard deviation of the sample means (standard error SE) is dependent on the sample size $$n$$ and the population standard deviation $$\sigma$$ : SE $$= s_{\bar{x}} = \frac{\sigma}{\sqrt{n}}$$

## Standard deviation vs. standard error

• Standard deviation of population ($$\sigma$$):
• Relate individual to population
• Standard deviation of sample means = standard error ($$\sigma / \sqrt{n}$$)
• Relate sample to population

## Reasoning about the population (1)

• Given that the distribution of sample means is normally distributed $$N(\mu,\sigma/\sqrt{n})$$, having one randomly selected sample allows us to reason about the population
• Requirement: sample is representative (unbiased sample)
• Random selection helps avoid bias

## Reasoning about the population (2)

• Given a representative sample:
• We estimate the population mean as equal to the sample mean (best guess)
• How certain we are of this estimate depends on the standard error: $$\sigma/\sqrt{n}$$
• Increasing sample size $$n$$ reduces uncertainty
• Hard work pays off (in exactness), but it doesn't pay off quickly: $$\sqrt(n)$$
• Sample means are normally distributed (CLT):
• We can relate a sample mean to the population mean by using characteristics of the normal distribution

## Normal distribution

• We know the probability of a sample mean $$\bar{x}$$ having a value close to the population mean $$\mu$$:

$$P(\mu - SE \leq x \leq \mu + SE) \approx 68\%$$     (34 + 34)
$$P(\mu - 2SE \leq x \leq \mu + 2SE) \approx 95\%$$     (34 + 34 + 13.5 + 13.5)
$$P(\mu - 3SE \leq x \leq \mu + 3SE) \approx 99.7\%$$     (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)

## Reasoning about the population (3)

• Sample means can be related to the population in two ways:
• Using a confidence interval
• An interval which is calculated in such a way that a large proportion of the calculated intervals contains the true population mean
• Using a hypothesis test
• Tests if hypothesis about population is compatible with sample result

## Confidence interval

• Definition: there is an $$x$$% probability that when computing an $$x$$% confidence interval (CI) on the basis of a sample, it contains $$\mu$$
• The CI can be seen as an estimate of plausible values of $$\mu$$
• (For those who are interested: there is a lot of confusion about interpreting CIs)

## Confidence interval: example (1)

• Consider the following example:
You want to know how many hours per week a student of the university spends speaking English. The standard deviation $$\sigma$$ for the university is 1 hr/wk.
• You collect data from 100 randomly chosen students
• You calculate the sample mean $$m = 5$$ hr/wk (N.B. in my notation: $$m$$ = $$\bar{x}$$)
• You therefore estimate the population mean $$\mu = 5$$ hr/wk and standard error SE $$= 1/\sqrt{100} = 0.1$$ hr/wk
• What is the 95% confidence interval (CI) of the mean?

## Confidence interval: example (2)

• According to the CLT, the sample means are normally distributed

• 95% of the sample means lie within $$m \pm$$ 2 SE
• (i.e. actually it is $$m \pm$$ 1.96 SE, but we round this to $$m \pm$$ 2 SE)
• With $$m$$ = 5 and SE = 0.1, 95% CI is 5 $$\pm$$ 2$\times$0.1 = (4.8 hr/wk, 5.2 hr/wk)

## Hypotheses

• Instead of using them for confidence intervals we often interpret samples as hypothesis tests about populations
• Examples of hypotheses
• Answering online lecture questions is related to the course grade
• Women and men differ in their English proficiency
• Nouns take longer to read than verbs

## Hypothesis testing (1)

• Testing these hypotheses requires empirical and variable data
• Empirical: based on observation rather than theory alone
• Variable: individual cases vary
• Hypotheses can be derived from theory, but also from observations if theory is incomplete

## Hypothesis testing (2)

• We start from a research question:
Is answering online lecture questions related to the course grade?
• Which we then formulate as a hypothesis (i.e. a statement):
Answering online lecture questions is related to the course grade
• For statistics to be useful, this needs to be translated to a concrete form:
Students answering online lecture questions score higher than those who do not

## Hypothesis testing (3)

• Students answering online lecture questions score higher than those who do not
• What is meant by this?
All students answering online lecture questions score higher than those who do not?
• Probably not, the data is variable, there are other factors:
• Attention level of each student
• Difficulty of the lecture
• If the questions were answered seriously
• We need statistics to abstract away from the variability of the observations (i.e. unsystematic variation)

## Hypothesis testing (4)

• Students answering online lecture questions score higher than those who do not
• Meaning:
• Not: All students answering online lecture questions score higher than those who do not
• But: On average, students answering online lecture questions score higher than those who do not

## Testing a hypothesis using a sample (1)

• On average, students answering online lecture questions score higher than those who do not
• This hypothesis must be studied on the basis of a sample, i.e. a limited number of students following a course with online lecture questions
• Of course we're interested in the population, i.e. all students who followed a course with online lecture questions

## Testing a hypothesis using a sample (2)

• The hypothesis concerns the population, but it is studied through a representative sample
• Students answering online lecture questions score higher than those who do not
(study based on 30 students who answered the questions and 30 who did not)
• Women have higher English proficiency than men
(study based on 40 men and 40 women)
• Nouns take longer to read than verbs
(studied on the basis of 35 people's reading of 100 nouns and verbs)

## Analysis: when is a difference real?

• Given a testable hypothesis:
Students answering online lecture questions score higher than those who do not
• You collect the final course grade for 30 randomly selected students who answered the online questions and 30 who did not
• Will any difference in average grade (in the right direction) be proof?
• Probably not: very small differences might be due to chance (unsystematic variation)
• Therefore we use statistics to analyze the results
• Statistically significant results are those unlikely to be due to chance

## Comparing a sample to population: $$z$$-test

• $$z$$-test allows assessing difference between sample and population
• $$\mu$$ and $$\sigma$$ for the population should be known (standardized tests: e.g., IQ test)
• Sample mean $$m$$ is compared to population mean $$\mu$$

## Example of $$z$$-test

• You think Computer Assisted Language Learning may be effective for kids
• You give a standard test of language proficiency ($$\mu$$ = 70, $$\sigma$$ = 14) to 49 randomly chosen childen who followed a CALL program
• You find $$m$$ = 74
• You calculate SE = $$\sigma/\sqrt{n} = 14/\sqrt{49} = 2$$
• 74 is 2 SE above the population mean: at the 97.5th percentile

## Conclusions of $$z$$-test

• Group with CALL scored 2 SE above mean ($$z$$-score of 2)
• Chance of this (or more extreme score) is only 2.5%, so very unlikely that this is due to chance
• Conclusion: CALL programs are probably helping
• However, it is also possible that CALL is not helping, but the effect is caused by some other factor
• Such as the sample including many proficient kids
• This is a confounding factor: an influential hidden variable (a variable not used in a study)

## Importance of sample size

• Suppose we would have used 9 children as opposed to 49, at what percentile would a sample mean of $$m$$ = 74 be?
• SE = $$\sigma/\sqrt{n} = 14/\sqrt{9} \approx 4.7$$
• $$m$$ = 74 is less than 1 SE above the mean, i.e. at less than the 84th percentile
• Sample means of at least this value are found by chance more than 16% of the time: not enough reason to suspect a CALL effect

## $$z$$-test in R

sigma <- 14; mu <- 70; m <- 74; n <- 9

(se <- sigma/sqrt(n))

# [1] 4.67

(zval <- (m - mu)/se)

# [1] 0.857

pnorm(zval)  # yields percentile: p(z < zval)

# [1] 0.804


## Statistical reasoning: two hypotheses (1)

• Rather than one hypothesis, we create two hypotheses about the data:
• The null hypothesis ($$H_0$$) and the alternative hypothesis ($$H_a$$)
• The null hypothesis states that there is no relationship between two measured phenomena (e.g., CALL program and test score), while the alternative hypothesis states there is

## Statistical reasoning: two hypotheses (2)

• For the CALL example (49 children):
• $$H_0$$: $$\mu_{CALL} = 70$$ (the population mean of people using CALL is 70)
• $$H_a$$: $$\mu_{CALL} > 70$$ (the population mean of people using CALL is higher than 70)
• While $$m$$ = 74, suggests that $$H_a$$ is right, this might be due to chance, so we would need enough evidence (i.e. low SE) to accept it over the null hypothesis
• Logically, $$H_0$$ is the inverse of $$H_a$$, and we'd expect $$H_0$$: $$\mu_{CALL} \leq 70$$, but we usually see '$$=$$' in formulations

## Statistical reasoning (1)

$$H_0$$: $$\mu_{CALL} = 70$$              $$H_a$$: $$\mu_{CALL} > 70$$

• The reasoning goes as follows:
• Suppose $$H_0$$ is true, what is the chance $$p$$ of observing a sample with $$m \geq$$ 74?
• To determine this, we convert 74 to a $$z$$-score: $z = (m - \mu) /$SE = (74-70)/2 = 2
• And find the associated $$p$$-value:
1 - pnorm(2)  # pnorm(2) yields p(z < 2) => 1 - pnorm(2) = p(z >= 2)

# [1] 0.0228

pnorm(2, lower.tail = F)  # alternative formulation for p(z >= 2)

# [1] 0.0228


## Statistical reasoning (2)

$$H_0$$: $$\mu_{CALL} = 70$$              $$H_a$$: $$\mu_{CALL} > 70$$

• In R we can also calculate the probability directly without conversion to $$z$$-scores by supplying the mean and standard error (sd parameter):
pnorm(74, mean = 70, sd = 2, lower.tail = F)

# [1] 0.0228


## Statistical reasoning (3)

$$H_0$$: $$\mu_{CALL} = 70$$              $$H_a$$: $$\mu_{CALL} > 70$$

• $$P(z \geq 2) \approx 0.025$$
• The chance of observing a sample at least this extreme given $$H_0$$ is true is 0.025
• This is the $$p$$-value (measured significance level)
• If $$H_0$$ were correct and kids with CALL exp. had the same language proficiency as others, the observed sample would be expected only 2.5% of the time
• Strong evidence against the null hypothesis

## Statistically significant?

• We have determined $$H_0$$, $$H_a$$ and the $$p$$-value
• The classical hypothesis test assesses how unlikely a sample must be for a test to count as significant
• We compare the $$p$$-value against this threshold significance level or $$\alpha$$-level
• If the $$p$$-value is lower than the $$\alpha$$-level (usually 0.05, but it may be lower as well), we regard the result as significant and reject the null hypothesis

## Statistically significant: summary

• The $$p$$-value is the chance of encountering the sample, given that the null hypothesis is true
• The $$\alpha$$-level is the threshold for the $$p$$-value, below which we regard the result as significant
• If result significant, we reject $$H_0$$ and assume $$H_a$$ is true

## Visualizing the answer to question 7

$$m = 74$$, $$\mu = 70$$, $$\sigma = 14$$, $$n = 49$$, $$\textrm{SE} = 14/\sqrt{49} = 2 \implies z = \frac{m - \mu}{\textrm{SE}} = \frac{74 - 70}{2} = 2$$