Statistiek I

Sampling

Martijn Wieling

Question 1: last lecture

Last lecture

  • Descriptive vs. inferential statistics
  • Sample vs. population
  • (Types of) variables
  • Distribution of a variable
  • Measures of central tendency and spread
  • Standardized scores
  • Checking for a normal distribution

This lecture

  • Reasoning about the population using a sample
    • Relation between population (mean) and sample (mean)
    • Confidence interval for population mean based on sample mean
    • Testing a hypothesis about the population using a sample
      • One-sided hypothesis vs. two-sided hypothesis
    • Statistical significance
    • Error types

Introduction

  • Selecting a sample from a population includes an element of chance: which individuals are studied?
  • Question of this lecture: How to reason about the population using a sample?
    • Anwered using the Central Limit Theorem

Central Limit Theorem

  • Suppose we would gather many different samples from the population, then the distribution of the sample means will always be normally distributed
    • The mean of these sample means (\(\bar{x}\)) will be the population mean (\(m_{\bar{x}} = \mu\))
    • The standard deviation of the sample means (i.e. standard error \(\textit{SE}\)) is dependent on the sample size \(n\) and the population standard deviation \(\sigma\) : \(\textit{SE} = s_{\bar{x}} = \frac{\sigma}{\sqrt{n}}\)

Standard deviation vs. standard error

  • Standard deviation of population (\(\sigma\)):
    • Relate individual to population
  • Standard deviation of sample means = standard error (\(\sigma / \sqrt{n}\))
    • Relate sample to population

Question 2

Reasoning about the population (1)

  • Given that the distribution of sample means is normally distributed \(N(\mu,\sigma/\sqrt{n})\), having one randomly selected sample allows us to reason about the population
  • Requirement: sample is representative (unbiased sample)
    • Random selection helps avoid bias

Question 3

Reasoning about the population (2)

  • Given a representative sample:
    • We estimate the population mean as equal to the sample mean (best guess)
    • How certain we are of this estimate depends on the standard error: \(\sigma/\sqrt{n}\)
      • Increasing sample size \(n\) reduces uncertainty
        • Hard work pays off (in exactness), but it doesn’t pay off quickly: \(\sqrt(n)\)
      • Sample means are normally distributed (CLT):
        • We can relate a sample mean to the population mean by using characteristics of the normal distribution

Normal distribution

  • We know the probability of a sample mean \(\bar{x}\) having a value close to the population mean \(\mu\):

\(P(\mu - SE \leq x \leq \mu + SE) \approx 68\%\)     (34 + 34)
\(P(\mu - 2SE \leq x \leq \mu + 2SE) \approx 95\%\)     (34 + 34 + 13.5 + 13.5)
\(P(\mu - 3SE \leq x \leq \mu + 3SE) \approx 99.7\%\)     (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)

Reasoning about the population (3)

  • Sample means can be related to the population in two ways:
    • Using a confidence interval
      • An interval which is calculated in such a way that a large proportion of the calculated intervals contains the true population mean
    • Using a hypothesis test
      • Tests if hypothesis about population is compatible with sample result

Confidence interval

  • Definition: there is an \(x\)% probability that when computing an \(x\)% confidence interval (CI) on the basis of a sample, it contains \(\mu\)
  • The CI can be seen as an estimate of plausible values of \(\mu\)
    • (For those who are interested: there is much confusion about interpreting CIs)

Confidence interval: example (1)

  • Consider the following example:
    You want to know how many hours per week a student of the university spends speaking English. The standard deviation \(\sigma\) for the university is 1 hr/wk.
    • You collect data from 100 randomly chosen students
    • You calculate the sample mean \(m = 5\) hr/wk (note: \(m\) = \(\bar{x}\))
    • You therefore estimate the population mean \(\mu = 5\) hr/wk and standard error \(\textit{SE} = 1/\sqrt{100} = 0.1\) hr/wk
  • What is the 95% confidence interval (CI) of the mean?

Confidence interval: example (2)

  • According to the CLT, the sample means are normally distributed

  • 95% of the sample means lie within \(m \pm 2\,\textit{SE}\)
    • (i.e. actually it is \(m \pm 1.96\,\textit{SE}\), but we round this to \(m \pm 2\,\textit{SE}\))
  • With \(m\) = 5 and \(\textit{SE}\) = 0.1, 95% CI is 5 \(\pm\) 2 \(\times\) 0.1 = (4.8 hr/wk, 5.2 hr/wk)

Question 4

Hypotheses

  • Instead of using them for confidence intervals we often interpret samples as hypothesis tests about populations
  • Examples of hypotheses
    • Answering online lecture questions is related to the course grade
    • Females and males differ in their English proficiency
    • Nouns take longer to read than verbs

Hypothesis testing (1)

  • Testing these hypotheses requires empirical and variable data
    • Empirical: based on observation rather than theory alone
    • Variable: individual cases vary
  • Hypotheses can be derived from theory, but also from observations if theory is incomplete

Hypothesis testing (2)

  • We start from a research question:
    Is answering online lecture questions related to the course grade?
  • Which we then formulate as a hypothesis (i.e. a statement):
    Answering online lecture questions is related to the course grade
  • For statistics to be useful, this needs to be translated to a concrete form:
    Students answering online lecture questions score higher than those who do not

Hypothesis testing (3)

  • Students answering online lecture questions score higher than those who do not
  • What is meant by this?
    All students answering online lecture questions score higher than those who do not?
    • Probably not, the data is variable, there are other factors:
      • Attention level of each student
      • Difficulty of the lecture
      • If the questions were answered seriously
  • We need statistics to abstract away from the variability of the observations (i.e. unsystematic variation)

Hypothesis testing (4)

  • Students answering online lecture questions score higher than those who do not
  • Meaning:
    • Not: All students answering online lecture questions score higher than those who do not
    • But: On average, students answering online lecture questions score higher than those who do not

Testing a hypothesis using a sample (1)

  • On average, students answering online lecture questions score higher than those who do not
  • This hypothesis must be studied on the basis of a sample, i.e. a limited number of students following a course with online lecture questions
    • Of course we’re interested in the population, i.e. all students who followed a course with online lecture questions

Testing a hypothesis using a sample (2)

  • The hypothesis concerns the population, but it is studied through a representative sample
    • Students answering online lecture questions score higher than those who do not
      (study based on 30 students who answered the questions and 30 who did not)
    • Females have higher English proficiency than males
      (study based on 40 males and 40 females)
    • Nouns take longer to read than verbs
      (studied on the basis of 35 people’s reading of 100 nouns and verbs)

Question 5

Analysis: when is a difference real?

  • Given a testable hypothesis:
    Students answering online lecture questions score higher than those who do not
    • You collect the final course grade for 30 randomly selected students who answered the online questions and 30 who did not
  • Will any difference in average grade (in the right direction) be proof?
    • Probably not: very small differences might be due to chance (unsystematic variation)
  • Therefore we use statistics to analyze the results
    • Statistically significant results are those unlikely to be due to chance

Comparing a sample to population: \(z\)-test

  • \(z\)-test allows assessing difference between sample and population
    • \(\mu\) and \(\sigma\) for the population should be known (standardized tests: e.g., IQ test)
  • Sample mean \(m\) is compared to population mean \(\mu\)

Example of \(z\)-test

  • You think Computer Assisted Language Learning (CALL) may be effective for kids
  • You give a standard test of language proficiency (\(\mu\) = 70, \(\sigma\) = 14) to 49 randomly chosen childen who followed a CALL program
    • You find \(m\) = 74
    • You calculate \(\textit{SE}\) = \(\sigma/\sqrt{n} = 14/\sqrt{49} = 2\)
    • 74 is 2 \(\textit{SE}\) above the population mean: at the 97.5th percentile

Conclusions of \(z\)-test

  • Group with CALL scored 2 \(\textit{SE}\) above mean (\(z\)-score of 2)
    • Chance of this (or more extreme score) is only 2.5%, so very unlikely that this is due to chance
  • Conclusion: CALL programs are probably helping
    • However, it is also possible that CALL is not helping, but the effect is caused by some other factor
      • Such as the sample including many proficient kids
      • This is a confounding factor: an influential hidden variable (a variable not used in a study)

Question 6

Importance of sample size

  • Suppose we would have used 9 children as opposed to 49, at what percentile would a sample mean of \(m\) = 74 be?
    • \(\textit{SE}\) = \(\sigma/\sqrt{n} = 14/\sqrt{9} \approx 4.7\)
    • \(m\) = 74 is less than 1 \(\textit{SE}\) above the mean, i.e. at less than the 84th percentile
      • Sample means of at least this value are found by chance more than 16% of the time: not enough reason to suspect a CALL effect

\(z\)-test in R

sigma <- 14; mu <- 70; m <- 74; n <- 9
(se <- sigma/sqrt(n))
[1] 4.67
(zval <- (m - mu) / se)
[1] 0.857
pnorm(zval) # yields percentile: p(z < zval)
[1] 0.804

Statistical reasoning: two hypotheses (1)

  • Rather than one hypothesis, we create two hypotheses about the data:
    • The null hypothesis (\(H_0\)) and the alternative hypothesis (\(H_a\))
    • The null hypothesis states that there is no relationship between two measured phenomena (e.g., CALL program and test score), while the alternative hypothesis states there is

Statistical reasoning: two hypotheses (2)

  • For the CALL example (49 children):
    • \(H_0\): \(\mu_{CALL} = 70\) (the population mean of people using CALL is 70)
    • \(H_a\): \(\mu_{CALL} > 70\) (the population mean of people using CALL is higher than 70)
    • While \(m\) = 74, suggests that \(H_a\) is right, this might be due to chance, so we would need enough evidence (i.e. low \(\textit{SE}\)) to accept it over the null hypothesis
    • Logically, \(H_0\) is the inverse of \(H_a\), and we’d expect \(H_0\): \(\mu_{CALL} \leq 70\), but we usually see ‘\(=\)’ in formulations

Statistical reasoning (1)

\(H_0\): \(\mu_{CALL} = 70\)              \(H_a\): \(\mu_{CALL} > 70\)

  • The reasoning goes as follows:
    • Suppose \(H_0\) is true, what is the chance \(p\) of observing a sample with \(m \geq\) 74?
    • To determine this, we convert 74 to a \(z\)-score: \(z = (m - \mu) / \textit{SE}\) \(= (74-70)/2 = 2\)
    • And find the associated \(p\)-value:
1 - pnorm(2) # pnorm(2) yields p(z < 2) => 1 - pnorm(2) = p(z >= 2)
[1] 0.0228
pnorm(2,lower.tail=F) # alternative formulation for p(z >= 2)
[1] 0.0228

Statistical reasoning (2)

\(H_0\): \(\mu_{CALL} = 70\)              \(H_a\): \(\mu_{CALL} > 70\)

  • In R we can also calculate the probability directly without conversion to \(z\)-scores by supplying the mean and standard error (sd parameter):
pnorm(74,mean=70,sd=2,lower.tail=F)
[1] 0.0228

Statistical reasoning (3)

\(H_0\): \(\mu_{CALL} = 70\)              \(H_a\): \(\mu_{CALL} > 70\)

  • \(P(z \geq 2) \approx 0.025\)
    • The chance of observing a sample at least this extreme given \(H_0\) is true is 0.025
    • This is the \(p\)-value (measured significance level)
    • If \(H_0\) were correct and kids with CALL experience had the same language proficiency as others, the observed sample would be expected only 2.5% of the time
      • Strong evidence against the null hypothesis

Statistically significant?

  • We have determined \(H_0\), \(H_a\) and the \(p\)-value
  • The classical hypothesis test assesses how unlikely a sample must be for a test to count as significant
  • We compare the \(p\)-value against this threshold significance level or \(\alpha\)-level
  • If the \(p\)-value is lower than the \(\alpha\)-level (usually 0.05, but it may be lower as well), we regard the result as significant and reject the null hypothesis

Statistically significant: summary

  • The \(p\)-value is the chance of encountering the sample, given that the null hypothesis is true
  • The \(\alpha\)-level is the threshold for the \(p\)-value, below which we regard the result as significant
    • If the result is significant, we reject \(H_0\) and assume \(H_a\) is true

Question 7

Visualizing the answer to question 7

\(m = 74\), \(\mu = 70\), \(\sigma = 14\), \(n = 49\), \(\textit{SE} = 14/\sqrt{49} = 2 \implies z = \frac{m - \mu}{\textit{SE}} = \frac{74 - 70}{2} = 2\)

Steps for assessing statistical significance

  1. Specify \(H_0\) and \(H_a\)
  2. Specify test statistic (e.g., mean) and underlying distribution (assuming \(H_0\))
  3. Specify the \(\alpha\)-level at which \(H_0\) will be rejected
  4. Determine the value of the statistic (e.g., mean) on the basis of a sample
  5. Calculate the \(p\)-value and compare to \(\alpha\)
    • \(p\)-value \(< \alpha\): reject \(H_0\) (significant result)
    • \(p\)-value \(\geq \alpha\): do not reject \(H_0\) (non-significant result)

Critical values

  • Critical values: those values of the sample statistic resulting in a rejection of \(H_0\)
  • E.g., if \(\alpha\) is set at 0.05, the critical region is \(P(z) < 0.05\), i.e. \(z \geq 1.64\)
  • We can transform this to raw values using the \(z\) formula
    \(z = (x-\mu)/SE\)
    \(1.64 = (x-70)/2\)
    \(3.3 = x-70\)
    \(x = 73.3\)
  • Thus a sample mean of at least 73.3 will result in rejection of \(H_0\)

Calculating critical values in R

# critical z-value
qnorm(p = 0.05, lower.tail = F)
[1] 1.64
# critical value
qnorm(p = 0.05, mean = 70, sd = 2, lower.tail = F) 
[1] 73.3

\(z\)-test

  • The CALL example is a \(z\)-test: based on a normal distribution with known \(\mu\) and \(\sigma\)
  • On the basis of the sample mean \(m\), we calculate the \(z\)-value: \(z = (m - \mu) / (\sigma / \sqrt{n})\)
  • We obtain the \(p\)-value linked with the \(z\)-value and compare that to the \(\alpha\)-level

One-sided \(z\)-test

  • There are different forms of \(z\)-tests:
    • \(H_a\) predicts high \(m\): CALL improves language ability
    • \(H_a\) predicts low \(m\): Eating broccoli lowers cholesterol levels  

Two-sided \(z\)-test (1)

  • Sometimes \(H_a\) might predict not lower or higher, but just different
  • For example, you use a standardized test for aphasia in NL developed in the UK
    • The developers claim that for non-aphasics, the distribution is \(N(100,10)\)
    • You specify \(H_0\): \(\mu = 100\) and \(H_a\): \(\mu \neq 100\)

Two-sided \(z\)-test (2)

  • With a significance level \(\alpha\) of 0.05, both very high (2.5% highest) and very low (2.5% lowest) values give reason to reject \(H_0\)

Two-sided \(z\)-test: example

  • Consider a sample of 81 Dutch people who took the UK aphasia test, \(N(100,10)\)
  • The mean score of the test in the sample is 98
  • Is there reason to believe the Dutch population differs from the UK population?

Two-sided \(z\)-test: calculation

pnorm(98,mean=100,sd=10/sqrt(81))
[1] 0.0359
  • Two-sided test: reject \(H_0\) for 2.5% lowest and 2.5% highest values (when \(\alpha\) = 0.05)
    • (one-tailed) \(p\)-value > 0.025: \(H_0\) not rejected
  • With a one-sided test (\(H_a\): \(\mu < 100\)), \(H_0\) would have been rejected (\(p\) < 0.05)

Statistical significance and confidence interval

  • Statistical significance and a confidence interval (CI) are linked
  • A 95% CI based on the sample mean \(m\) represents the values for \(\mu\) for which the difference between \(\mu\) and \(m\) is not significant (at the 0.05 significance threshold for a two-sided test)
    • A value outside of the CI indicates a statistically significant difference

Two-sided \(z\)-test: calculation using confidence interval

mu <- 100
se <- 10 / sqrt(81)
(conf <- c(mu - 2 * se, mu + 2 * se)) # 95% CI: 2 SE below and above mean
[1]  97.8 102.2
  • The value 98 lies within the 95% confidence interval: not significant at \(\alpha\) = 0.05 for a two-tailed test

Significance and sample size

  • Recall our CALL example: \(H_0\): \(\mu_{CALL} = 70\), \(H_a\): \(\mu_{CALL} > 70\)
  • With a sample of 49, the sample mean \(m\) was 74 at a significance level of \(p\) \(\approx\) 0.025 (i.e. one-tailed)
    • This was significant at the \(\alpha\)-level of 0.05, but not 0.01
  • If you are certain about \(m\) = 74 and wanted significance at the 0.01 \(\alpha\)-level, you could increase the sample size

Chasing significance

  • Suppose the statistic (\(m = 74\)) stayed the same with a sample size \(n\) of 100
  • The standard error would be \(14 / \sqrt(100) = 1.4\)
  • And the resulting \(p\)-value would be:
pnorm(74, mean = 70, sd = 1.4, lower.tail = F) 
[1] 0.00214
  • So with a sample size of 100, the result would be significant at the \(\alpha\) = 0.01 level
  • But would it make sense to collect the additional data?

Understanding significance (1)

  • Is it sensible to collect the extra data to “push” a result to significance?
    • No. At least, usually not.
  • The real result (effect size) is the difference (4 points), nearly 0.3 \(\sigma\) (4 / 14)

Understanding significance (2)

  • “Statistically significant” implies that an effect probably is not due to chance, but the effect can be very small
    • If you want to know whether you should buy CALL software to learn a language, statistical significance does not tell you this
    • This is a two-edged sword, if an effect was not statistically significant, it does not mean nothing important is going on
      • You are just not sure: it could be a chance effect

Question 8

Misuse of significance

  • Garbage in, garbage out: statistics won’t help an experiment with a poor design, or where data was poorly collected
  • No significance hunting: hypotheses should be formulated before data collection and analysis
    • Modern danger: if there are many potential variables, it is likely that a few turn out to be significant
      • Specific tests are necessary to correct for this

Recap: hypothesis testing

  • A statistical hypothesis concerns a population (not a sample!) and involves a statistic (such as mean, frequency, etc.)
    • Population: all students attending a course using online lecture questions
    • Parameter (statistic): (average) course performance
    • Hypothesis: average performance of students answering online lecture questions is higher than of those who do not

Identifying hypotheses

  • Alternative hypothesis \(H_a\) (original hypothesis) is contrasted with null hypothesis \(H_0\) (hypothesis that nothing out of the ordinary is going on)
    • \(H_a\): higher performance for students answering online lecture questions
    • \(H_0\): answering online lecture questions does not impact performance
  • Logically \(H_0\) should imply \(\neg H_a\)

Possible errors

Of course, you could be wrong (e.g., due to an unrepresentative sample)!

\(H_0\) true false
accepted correct type-II error
rejected type-I error correct
  • Hypothesis testing focuses on type-I errors
    • \(p\)-value: chance of type-I error
    • \(\alpha\)-level: boundary of acceptable level of type-I error
  • Type-II errors
    • \(\beta\): chance of type-II error
    • \(1 - \beta\): power of statistical test
      • More sensitive (and useful) tests have more power to detect an effect

Possible errors: easier to remember

  • False positive: incorrect positive (accepting \(H_a\)) result
  • False negative: incorrect negative (not rejecting \(H_0\)) result

How to formulate the results?

  • Results with \(p = 0.051\) are not very different from \(p = 0.049\), but we need a boundary
    • An \(\alpha\)-level of \(0.05\) is low as the “burden of proof” is on the alternative
  • If \(p = 0.051\), we haven’t proven \(H_0\), as we just failed to show that it’s very wrong
    • When we cannot reject \(H_0\), we indicate that we have “retained \(H_0\)

Note about multiple testing

  • Using multiple tests risks finding significance through sheer chance
  • Suppose you run two tests, always using \(\alpha\) = 0.05
    • Chance of finding one or more significant values (family-wise error rate) is: \(1 - (1 - \alpha)^2\) = \(1 - 0.95^2 = 0.0975\) (almost twice as high as we’d like!)
  • To guarantee a family-wise error rate of 0.05, we should divide \(\alpha\) by the number of tests: Bonferroni correction

Recap

  • In this lecture, we’ve covered
    • how to reason about the population using a sample (CLT)
    • how to calculate a confidence interval
    • how to specify a concrete testable hypothesis based on a research question
    • how to specify the null hypothesis
    • how to conduct a \(z\)-test and use the results to evaluate a hypothesis
    • what statistical significance entails
    • how to evaluate if a result is statistically signficant given a specific \(\alpha\)-level
    • the difference between a one-tailed and a two-tailed test
    • the different error types
    • risk of multiple testing
  • For practice: https://martijnwieling.shinyapps.io/sampling/
  • Next lecture: introduction to linear regression

Please evaluate this lecture

Exam question

Questions?

Thank you for your attention!

 

https://www.martijnwieling.nl

m.b.wieling@rug.nl