Martijn Wieling

University of Groningen

- Descriptive vs. inferential statistics
- Sample vs. population
- (Types of) variables
- Distribution of a variable
- Measures of central tendency
- Standardized scores
- Checking for a normal distribution

- Reasoning about the population using a sample
- Relation between population (mean) and sample (mean)
- Confidence interval for population mean based on sample mean
- Testing a hypothesis about the population using a sample
- One-sided hypothesis vs. two-sided hypothesis

- Statistical significance
- Error types

- Selecting a sample from a population includes an element of chance: which individuals are studied?
- Question of this lecture:
**How to reason about the population using a sample?**- Anwered using the
**Central Limit Theorem**

- Anwered using the

- Suppose we would gather many different samples from the population, then the distribution of the sample means will
**always**be normally distributed- The mean of these sample means (\(\bar{x}\)) will be the population mean (\(m_{\bar{x}} = \mu\))
- The standard deviation of the sample means (standard error
*SE*) is dependent on the sample size \(n\) and the population standard deviation \(\sigma\) :*SE*\(= s_{\bar{x}} = \frac{\sigma}{\sqrt{n}}\)

- Standard deviation (\(\sigma\)): relate an
**individual**to a population - Standard error (\(\frac{\sigma}{\sqrt{n}}\)): relate a
**sample**to a population

- Given that the distribution of sample means is normally distributed \(N(\mu,\sigma/\sqrt{n})\), having one randomly selected sample allows us to reason about the population
- Requirement: sample is
**representative**(unbiased sample)

- Random selection helps avoid bias

- Given a representative sample:
- We estimate the population mean as equal to the sample mean (best guess)
- How certain we are of this estimate depends on the standard error: \(\sigma/\sqrt{n}\)
- Increasing sample size \(n\) reduces uncertainty
- Hard work pays off (in exactness), but it doesn't pay of quickly: \(\sqrt(n)\)

- Sample means are normally distributed (CLT):
- We can relate a
**sample**mean to the**population**mean by using characteristics of the normal distribution

- We can relate a

- Increasing sample size \(n\) reduces uncertainty

- We know the probability of a sample mean \(\bar{x}\) having a value close to the population mean \(\mu\):

\(P(\mu - SE \leq x \leq \mu + SE) \approx 68\%\) (34 + 34)

\(P(\mu - 2SE \leq x \leq \mu + 2SE) \approx 95\%\) (34 + 34 + 13.5 + 13.5)

\(P(\mu - 3SE \leq x \leq \mu + 3SE) \approx 99.7\%\) (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)

- Sample means can be related to the population in two ways:
- Using a
**confidence interval**- An interval which is calculated in such a way that a large proportion of the calculated intervals contains the true population mean

- Using a
**hypothesis test**- Tests if hypothesis about population is compatible with sample result

- Using a

**Definition**: there is an \(x\)% probability that when computing an \(x\)% confidence interval (CI) on the basis of a sample, it contains \(\mu\)- The CI can be seen as an estimate of plausible values of \(\mu\)
- (For those who are interested: there is a lot of confusion about interpreting CIs)

- Consider the following example:
*You want to know how many hours per week a student of the university spends speaking English. The standard deviation \(\sigma\) for the university is 1 hr/wk*.- You collect data from 100 randomly chosen students
- You calculate the sample mean \(m = 5\) hr/wk (N.B. in my notation: \(m\) = \(\bar{x}\))
- You therefore estimate the population mean \(\mu = 5\) hr/wk and
*SE*\(= 1/\sqrt{100} = 0.1\) hr/wk

- What is the 95% confidence interval (CI) of the mean?

- According to the CLT, the sample means are normally distributed

- 95% of the sample means lie within \(m \pm\) 2
*SE*- (i.e. actually it is \(m \pm\) 1.96
*SE*, but we round this to \(m \pm\) 2*SE*)

- (i.e. actually it is \(m \pm\) 1.96
- With \(m\) = 5 and
*SE*= 0.1, 95% CI is 5 \(\pm\) 2$\times$0.1 = (4.8 hr/wk, 5.2 hr/wk)

- We often interpret samples as hypothesis tests about populations
- Examples of hypotheses
*Answering online lecture questions is related to the course grade**Women and men differ in their English proficiency**Nouns take longer to read than verbs*

- Testing these hypotheses requires
**empirical**and**variable**data- Empirical: based on observation rather than theory alone
- Variable: individual cases vary

- Hypotheses can be derived from theory, but also from observations if theory is incomplete

- We start from a research question:
*Is answering online lecture questions related to the course grade?* - Which we then formulate as a hypothesis (i.e. a statement):
*Answering online lecture questions is related to the course grade* - For statistics to be useful, this needs to be translated to a concrete form:
*Students answering online lecture questions score higher than those who do not*

*Students answering online lecture questions score higher than those who do not*- What is meant by this?

*All**students answering online lecture questions score higher than those who do not*?- Probably not, the data is variable, there are other factors:
- Attention level of each student
- Difficulty of the lecture
- If the questions were answered seriously

- Probably not, the data is variable, there are other factors:
- We need statistics to abstract away from the variability of the observations (i.e. unsystematic variation)

*Students answering online lecture questions score higher than those who do not*- Meaning:
- Not:
*All**students answering online lecture questions score higher than those who do not* - But:
*On average**, students answering online lecture questions score higher than those who do not*

- Not:

*On average, students answering online lecture questions score higher than those who do not*- This hypothesis
**must**be studied on the basis of a sample, i.e. a limited number of students following a course with online lecture questions- Of course we're interested in the population, i.e. all students who followed a course with online lecture questions

- The hypothesis concerns the population, but it is studied through a
**representative sample***Students answering online lecture questions score higher than those who do not*

(study based on 30 students who answered the questions and 30 who did not)*Women have higher English proficiency than men*

(study based on 40 men and 40 women)*Nouns take longer to read than verbs*

(studied on the basis of 35 people's reading of 100 nouns and verbs)

- Given a testable hypothesis:
*Students answering online lecture questions score higher than those who do not*- You collect the final course grade for 30 randomly selected students who answered the online questions and 30 who did not

- Will any difference in average grade (in the right direction) be proof?
- Probably not: very small differences might be due to
**chance**(unsystematic variation)

- Probably not: very small differences might be due to
- Therefore we use
**statistics**to analyze the results**Statistically significant**results are those unlikely to be due to chance

- \(z\)-test allows assessing difference between sample and population
- \(\mu\) and \(\sigma\) for the population should be known (standardized tests: e.g., IQ test)

- Sample mean \(m\) is compared to population mean \(\mu\)

- You think Computer Assisted Language Learning may be effective for kids
- You give a standard test of language proficiency (\(\mu\) = 70, \(\sigma\) = 14) to 49 randomly chosen childen who followed a CALL program
- You find \(m\) = 74
- You calculate
*SE*= \(\sigma/\sqrt{n} = 14/\sqrt{49} = 2\) - 74 is 2
*SE*above the population mean: at the 97.5th percentile

- Group with CALL scored 2
*SE*above mean (\(z\)-score of 2)- Chance of this (or more extreme score) is only 2.5%, so very unlikely that this is due to chance

- Conclusion: CALL programs are probably helping
- However, it is also possible that CALL is not helping, but the effect is caused by some other factor
- Such as the sample including lots of proficient kids
- This is a
**confounding**factor: an influential**hidden**variable (a variable not used in a study)

- However, it is also possible that CALL is not helping, but the effect is caused by some other factor

- Suppose we would have used 9 children as opposed to 49, at what percentile would a sample mean of \(m\) = 74 be?
*SE*= \(\sigma/\sqrt{n} = 14/\sqrt{9} \approx 4.7\)- \(m\) = 74 is less than 1
*SE*above the mean, i.e. at less than the 84th percentile- Sample means of this value are found by chance more than 16% of the time (i.e. likely due to chance): not enough reason to suspect an effect of CALL

```
sigma <- 14; mu <- 70; m <- 74; n <- 9
```

```
(se <- sigma/sqrt(n))
```

```
# [1] 4.67
```

```
(zval <- (m - mu)/se)
```

```
# [1] 0.857
```

```
pnorm(zval) # yields percentile: p(z < zval)
```

```
# [1] 0.804
```

- Rather than one hypothesis, we create
**two hypotheses**about the data:- The null hypothesis (\(H_0\)) and the alternative hypothesis (\(H_a\))
- The null hypothesis states that there is no relationship between two measured phenomena (e.g., CALL program and test score), while the alternative hypothesis states there is

- For the CALL example (49 children):
- \(H_0\): \(\mu_{CALL} = 70\) (the population mean of people using CALL is 70)
- \(H_a\): \(\mu_{CALL} > 70\) (the population mean of people using CALL is higher than 70)
- While \(m\) = 74, suggests that \(H_a\) is right, this might be due to chance, so we would need enough evidence (i.e. low
*SE*) to accept it over the null hypothesis - Logically, \(H_0\) is the inverse of \(H_a\), and we'd expect \(H_0\): \(\mu_{CALL} \leq 70\), but we usually see '\(=\)' in formulations

\(H_0\): \(\mu_{CALL} = 70\) \(H_a\): \(\mu_{CALL} > 70\)

- The reasoning goes as follows:
- Suppose \(H_0\) is true, what is the chance \(p\) of observing a sample with \(m \geq\) 74?
- To determine this, we convert 74 to a \(z\)-score: $z = (m - \mu) / $
*SE*= (74-70)/2 = 2 - And find the associated \(p\)-value:

```
1 - pnorm(2) # pnorm(2) yields p(z < 2) => 1 - pnorm(2) = p(z >= 2)
```

```
# [1] 0.0228
```

```
pnorm(2, lower.tail = F) # alternative formulation for p(z >= 2)
```

```
# [1] 0.0228
```

\(H_0\): \(\mu_{CALL} = 70\) \(H_a\): \(\mu_{CALL} > 70\)

- In
`R`

we can also calculate the probability directly without conversion to \(z\)-scores by supplying the mean and standard error (`sd`

parameter):

```
pnorm(74, mean = 70, sd = 2, lower.tail = F)
```

```
# [1] 0.0228
```

\(H_0\): \(\mu_{CALL} = 70\) \(H_a\): \(\mu_{CALL} > 70\)

- \(P(z \geq 2) \approx 0.025\)
- The chance of observing a sample at least this extreme given \(H_0\) is true is 0.025
- This is the \(p\)-value (measured significance level)
- If \(H_0\) were correct and kids with CALL exp. had the same language proficiency as others, the observed sample would be expected only 2.5% of the time
- Strong evidence
**against**the null hypothesis

- Strong evidence

- We have determined \(H_0\), \(H_a\) and the \(p\)-value
- The classical hypothesis test assesses how
**unlikely**a sample must be for a test to count as significant - We compare the \(p\)-value against this threshold significance level or \(\alpha\)-level
- If the \(p\)-value is
**lower**than the \(\alpha\)-level (usually 0.05, but it may be lower as well), we regard the result as significant and reject the null hypothesis

- The \(p\)-value is the chance of encountering the sample, given that the null hypothesis is true
- The \(\alpha\)-level is the threshold for the \(p\)-value, below which we regard the result as significant
- If result significant, we reject \(H_0\) and assume \(H_a\) is true

\(m = 74\), \(\mu = 70\), \(\sigma = 14\), \(n = 49\), \(\textrm{SE} = 14/\sqrt{49} = 2 \implies z = \frac{m - \mu}{\textrm{SE}} = \frac{74 - 70}{2} = 2\)

- Specify \(H_0\) and \(H_a\)
- Specify test statistic (e.g., mean) and underlying distribution (assuming \(H_0\))
- Specify the \(\alpha\)-level at which \(H_0\) will be rejected
- Determine the value of the statistic (e.g., mean) on the basis of a sample
- Calculate the \(p\)-value and compare to \(\alpha\)
- \(p\)-value \(< \alpha\): reject \(H_0\) (significant result)
- \(p\)-value \(\geq \alpha\): do not reject \(H_0\) (non-significant result)

- Critical values: those values of the sample statistic resulting in a rejection of \(H_0\)
- E.g., if \(\alpha\) is set at 0.05, the critical region is \(P(z) < 0.05\), i.e. \(z \geq 1.64\)
- We can transform this to raw values using the \(z\) formula \[z = (x-\mu)/SE\\ 1.64 = (x-70)/2\\ 3.3 = x-70\\ x = 73.3\]
- Thus a sample mean of at least 73.3 will result in rejection of \(H_0\)

```
# critical z-value
qnorm(p = 0.05, lower.tail = F)
```

```
# [1] 1.64
```

```
# critical value
qnorm(p = 0.05, mean = 70, sd = 2, lower.tail = F)
```

```
# [1] 73.3
```

- The CALL example is a \(z\)-test: based on a normal distribution with known \(\mu\) and \(\sigma\)
- On the basis of the sample mean \(m\), we calculate the \(z\)-value: \(z = (m - \mu) / (\sigma / \sqrt{n})\)
- We obtain the \(p\)-value linked with the \(z\)-value and compare that to the \(\alpha\)-level

- There are different forms of \(z\)-tests:
- \(H_a\) predicts high \(m\): CALL improves language ability
- \(H_a\) predicts low \(m\): Eating broccoli lowers cholesterol levels

- Sometimes \(H_a\) might predict not lower or higher, but just
**different** - For example, you use a statistical test for aphasia in NL developed in the UK
- The developers claim that for non-aphasics, the distribution is \(N(100,10)\)
- You specify \(H_0\): \(\mu = 100\) and \(H_a\): \(\mu \neq 100\)

- With a significance level \(\alpha\) of 0.05, both very high (2.5% highest)
**and**very low (2.5% lowest) values give reason to reject \(H_0\)

- Consider a sample of 81 Dutch people who took the UK aphasia test, \(N(100,10)\)
- The mean score of the test in the sample is 98
- Is there reason to believe the Dutch population differs from the UK population?

```
pnorm(98, mean = 100, sd = 10/sqrt(81))
```

```
# [1] 0.0359
```

- Two-sided test: reject \(H_0\) for 2.5% lowest and 2.5% highest values (when \(\alpha\) = 0.05)
- \(p\)-value > 0.025: \(H_0\) not rejected

- With a one-sided test (\(H_a\): \(\mu < 100\)), \(H_0\) would have been rejected (\(p\) < 0.05)

- Statistical significance and a confidence interval (CI) are linked
- A 95% CI based on the sample mean \(m\) represents the values for \(\mu\) for which the difference between \(\mu\) and \(m\) is not significant (at the 0.05 significance threshold for a two-sided test)
- A value outside of the CI indicates a statistically significant difference

```
mu <- 100
se <- 10 / sqrt(81)
(conf <- c(mu - 2 * se, mu + 2 * se)) # 95% CI: 2 SE below and above mean
```

```
# [1] 97.8 102.2
```

- The value 98 lies within the 95% confidence interval: not significant at \(\alpha\) = 0.05 for a two-tailed test

- Recall our CALL example: \(H_0\): \(\mu_{CALL} = 70\), \(H_a\): \(\mu_{CALL} > 70\)
- With a sample of 49, the sample mean \(m\) was 74 at a significance level of \(p\) \(\approx\) 0.025 (i.e. one-tailed)
- This was significant at the \(\alpha\)-level of 0.05, but not 0.01

- If you are certain about \(m\) = 74 and wanted significance at the 0.01 \(\alpha\)-level, you
**could**increase the sample size

- Suppose the statistic (\(m = 74\)) stayed the same with a sample size \(n\) of 100
- The standard error would be \(14 / \sqrt(100) = 1.4\)
- And the resulting \(p\)-value would be:

```
pnorm(74, mean = 70, sd = 1.4, lower.tail = F)
```

```
# [1] 0.00214
```

- So with a sample size of 100, the result would be significant at the \(\alpha\) = 0.01 level
- Would it make sense to collect the additional data?

- Is it sensible to collect the extra data to "push" a result to significance?
- No. At least, usually not.

- The real result (effect size) is the difference (4 points), nearly 0.3 \(\sigma\) (4 / 14)

- "Statistically significant" implies that an effect probably is not due to chance, but the effect can be
**very small**- If you want to know whether you should buy CALL software to learn a language, statistically significant does not tell you this
- This is a two-edged sword, if an effect was not statistically significant, it does not mean nothing important is going on
- You are just not sure: it could be a chance effect

**Garbage in, garbage out**: statistics won't help an experiment with a poor design, or where data was poorly collected**No significance hunting**: hypotheses should be formulated before data collection and analysis- Modern danger: if there are many potential variables, it is
*likely*that a few turn out to be significant- Specific tests are necessary to correct for this

- Modern danger: if there are many potential variables, it is

- A statistical hypothesis concerns a population (not a sample!) and involves a statistic (such as mean, frequency, etc.)
- Population: all students attending a course using online lecture questions
- Parameter (statistic): (average) course performance
- Hypothesis: average performance of students answering online lecture questions is higher than those who do not

**Alternative hypothesis**\(H_a\) (original hypothesis) is contrasted with**null hypothesis**\(H_0\) (hypothesis that nothing out of the ordinary is going on)- \(H_a\): average performance of students answering online lecture questions higher
- \(H_0\): answering online lecture questions does not impact performance

- Logically \(H_0\) should imply \(\neg H_a\)

Of course, you could be wrong (e.g., due to an unrepresentative sample)!

\(H_0\) | true | false |
---|---|---|

accepted | correct | type II error |

rejected | type I error | correct |

- Hypothesis testing focuses on
**type I errors**- \(p\)-value: chance of type I error
- \(\alpha\)-level: boundary of acceptable level of type I error

- Type II errors (not covered further in this course)
- \(\beta\): chance of type II error
- \(1 - \beta\): power of statistical test
- More sensitive (and useful) tests have more power to detect an effect

- False positive: incorrect positive (accepting \(H_a\)) result
- False negative: incorrect negative (not rejecting \(H_0\)) result

- Results with \(p = 0.051\) are not very different from \(p = 0.049\), but we need a boundary
- An \(\alpha\)-level of \(0.05\) is low as the "burden of proof" is on the alternative

- If \(p = 0.051\) we haven't
**proven**\(H_0\), only failed to show that it's really wrong- This is called "retaining \(H_0\)"

- In this lecture, we've covered
- the difference between the population and a sample
- how to calculate a confidence interval
- how to specify a concrete testable hypothesis based on a research question
- how to specify the null hypothesis
- how to conduct a \(z\)-test and use the results to evaluate a hypothesis
- what statistical significance entails
- how to evaluate if a result is statistically signficant given a specific \(\alpha\)-level
- the difference between a one-tailed and a two-tailed test
- the different error types

**For practice**: http://eolomea.let.rug.nl/Statistiek-I/HC3 (login: s-nr, lowercase s!)- Next lecture:
**\(t\)-tests**

Thank you for your attention!