Basic concepts of statistics

Martijn Wieling (University of Groningen)

This lecture

Descriptive vs. inferential statistics
Sample vs. population
(Types of) variables
Distribution of a variable: central tendency and variation
Standardized scores
Checking for a normal distribution
Reasoning about the population using a sample
- Relation between population (mean) and sample (mean)
- Confidence interval for population mean based on sample mean
- Testing a hypothesis about the population using a sample
- Statistical significance
- Error types

Question 1

Why use statistics?

Why use statistics?
- Summarize data (descriptive statistics)
- Assess relationships in data (inferential statistics)

Descriptive vs. inferential statistics

Descriptive statistics:
- Statistics used to describe (sample) data without further conclusions
  - Measures of central tendency: Mean, median, mode
  - Measures of variation (or spread): range, IQR, variance, standard deviation
Inferential statistics:
- Describe data of sample in order to infer patterns in the population
  - Statistical tests: $t$-test, $\chi^2$-test, etc.

Sample vs. population

Why study a sample?

Studying the whole population is (frequently) practically impossible
Sample is a (selected) subset of population and thus more accessible
- Selection of representative sample is very important!

Question 2

Types of variables

Question 3

Characterizing nominal variables

Characterizing numerical variables: distribution

We are generally not interested in individual values of a variable, but rather all values and their frequency
This is captured by a distribution
- Famous distribution: Normal distribution (“bell-shaped” curve): e.g., IQ scores

Interpreting a density curve

The total area under a density curve is equal to 1
A density curve does not provide information about the frequency of one value
- E.g., there might be no one who has a value of exactly 6.1
It only provides information about an interval
- E.g., more than 50% of the values lie between 5.5 and 7.5

Interpreting a density curve: normal distribution

The normal distribution has convenient characteristics
- Completely symmetric
- Red area: (about) 68%
- Red and green area: (about) 95%

Characterizing the distribution of numerical variables

A distribution can also be characterized by measures of center and variation
- (skewness measures the symmetry of the distribution; not covered further)

Characterizing numerical variables: central tendency

Mode: most frequent element (for nominal data: only meaningful measure)
Median: when data is sorted from small to large, it is the middle value
Mean: arithmetical average

\[\bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n} = \frac{1}{n}\sum\limits_{i=1}^n x_i\]

Question 4

Measure of variation: quantiles

Quantiles: cutpoints to divide the sorted data in subsets of equal size
- Quartiles: three cutpoints to divide the data in four equal-sized sets
  - $q_1$ (1st quartile): cutpoint between 1st and 2nd group
  - $q_2$ (2nd quartile): cutpoint between 2nd and 3rd group (= median!)
  - $q_3$ (3rd quartile): cutpoint between 3rd and 4th group
- Percentiles: divide data in hundred equal-sized subsets
  - $q_1$ = 25th percentile
  - $q_2$ (= median) = 50th percentile
  - Score at $n$th percentile is better than $n$% of scores

Measure of variation: range

Minimum, maximum: lowest and highest value
Range: difference between minimum and maximum
Interquartile range (IQR): $q_3$ - $q_1$

Visualizing variation: box plot (box-and-whisker plot)

A box plot is used to visualize variation of a variable
- Box (IQR): $q_1$ (bottom), median (thickest line), $q_3$ (top)
  - (In example below, $q_1$ and median have the same value)
- Whiskers: maximum (top) and minimum (bottom) non-outlier value
- Circle(s): outliers (> 1.5 IQR distance from box)

Important measure of variation: variance

Deviation: difference between mean and individual value
Variance: average squared deviation
- Squared in order to make negative differences positive
- Population variance: \[\sigma^2 = \frac{1}{n}\sum\limits_{i=1}^n (x_i - \mu)^2\]
- As sample mean ($\bar{x}$ or $m$) is approximation of population mean ($\mu$), sample variance formula contains division by $n-1$ (results in slightly higher variance): \[s^2 = \frac{1}{n-1}\sum\limits_{i=1}^n (x_i - \bar{x})^2\]

Important measure of variation: standard deviation

Standard deviation is square root of variance \[\sigma = \sqrt{\sigma^2} = \sqrt{\frac{1}{n}\sum\limits_{i=1}^n (x_i - \mu)^2}\] \[s = \sqrt{s^2} = \sqrt{\frac{1}{n-1}\sum\limits_{i=1}^n (x_i - \bar{x})^2}\]

Question 5

Normal distribution and standard deviation (1)

$P(\mu - \sigma \leq x \leq \mu + \sigma) \approx 68\%$     (34 + 34)
$P(\mu - 2\sigma \leq x \leq \mu + 2\sigma) \approx 95\%$     (34 + 34 + 13.5 + 13.5)
$P(\mu - 3\sigma \leq x \leq \mu + 3\sigma) \approx 99.7\%$     (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)

Normal distribution and standard deviation (2)

$P(85 \leq \rm{IQ} \leq 115) \approx 68\%$     (34 + 34)
$P(70 \leq \rm{IQ} \leq 130) \approx 95\%$     (34 + 34 + 13.5 + 13.5)
$P(55 \leq \rm{IQ} \leq 145) \approx 99.7\%$     (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)

IQ scores are normally distributed with mean 100 and standard deviation 15

Standardized scores

Standardization helps facilitate interpretation
E.g., how to interpret: “Emma got a score of 112” and “Tom got a score of 105”
Interpretation should be done with respect to mean $\mu$ and standard deviation $\sigma$
- Raw scores can be transformed to standardized scores ($z$-scores or $z$-values) \[z = \frac{x - \mu}{\sigma} = \frac{\rm{deviation}}{\rm{standard}\,\rm{deviation}}\]
- Interpretation: difference of value from mean in number of standard deviations

Calculating standardized values

Suppose $\mu = 108$, $\sigma = 4$, then: \[z_{112} = \frac{x - \mu}{\sigma} = \frac{112 - 108}{4} = 1\] \[z_{105} = \frac{105 - 108}{4} = -0.75\]
$z$ shows distance from mean in number of standard deviations

Question 6

Distribution of standardized variables

If we transform all raw scores of a variable into $z$-scores using: \[z = \frac{x - \mu}{\sigma} = \frac{\rm{deviation}}{\rm{standard}\,\rm{deviation}}\]
We obtain a new transformed variable whose
- Mean is 0
- Standard deviation is 1
In sum: $z$-score = distance from $\mu$ in $\sigma$’s
$z$-scores are useful for interpretation and hypothesis testing

Standard normal distribution

$P(-1 \leq z \leq 1) \approx 68\%$     (34 + 34)
$P(-2 \leq z \leq 2) \approx 95\%$     (34 + 34 + 13.5 + 13.5)
$P(-3 \leq z \leq 3) \approx 99.7\%$     (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)

Question 7

For comparison: normal distribution

Checking normality assumption

Some statistical tests (e.g., $t$-test) require that the data is (roughly) normally distributed
How to test this?
- Using visual inspection of a normal quantile plot (or: quantile-quantile plot)
  - A straight line in this graph indicates a (roughly) normal distribution
- (Alternatively, you can use the Shapiro-Wilk test)

Normal quantile plot: how it works

Sort the data from smallest to largest to determine quantiles (e.g., percentiles)
- E.g., median for 50th percentile
Calculate $z$-values belonging to the quantiles (e.g., percentiles) of a standard normal distribution
- E.g., $z =$ 0 for 50th percentile, $z =$ 2 for 97.5th percentile, etc.
Plot data values ($y$-axis) against normal quantile values ($x$-axis)
- If points on (or close to) straight line: values normally distributed

Normal quantile plot example

Selecting a sample

Selecting a sample from a population includes an element of chance: which individuals are studied?
Important question: How to reason about the population using a sample?
- Anwered using the Central Limit Theorem

Central Limit Theorem

Suppose we would gather many different samples from the population, then the distribution of the sample means will always be normally distributed
- The mean of these sample means ($\bar{x}$) will be the population mean ($m_{\bar{x}} = \mu$)
- The standard deviation of the sample means (standard error SE) is dependent on the sample size $n$ and the population standard deviation $\sigma$ : SE $= s_{\bar{x}} = \frac{\sigma}{\sqrt{n}}$

Reasoning about the population (1)

Given that the distribution of sample means is normally distributed $N(\mu,\sigma/\sqrt{n})$, having one randomly selected sample allows us to reason about the population
Requirement: sample is representative (unbiased sample)
- Random selection helps avoid bias

Reasoning about the population (2)

Given a representative sample:
- We estimate the population mean as equal to the sample mean (best guess)
- How certain we are of this estimate depends on the standard error: $\sigma/\sqrt{n}$
  - Increasing sample size $n$ reduces uncertainty
    - Hard work pays off (in exactness), but it doesn’t pay off quickly: $\sqrt(n)$
  - Sample means are normally distributed (CLT):
    - We can relate a sample mean to the population mean by using characteristics of the normal distribution

Question 8

Normal distribution

We know the probability of an element $x$ having a value close to the mean $\mu$:

Normal distribution: $z$-scores

With standardized values: $z = (x - \mu)/\sigma \Rightarrow \mu = 0$ and $\sigma = 1$

$P(-1 \leq z \leq 1) \approx 68\%$     (34 + 34)
$P(-2 \leq z \leq 2) \approx 95\%$     (34 + 34 + 13.5 + 13.5)
$P(-3 \leq z \leq 3) \approx 99.7\%$     (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)

Reasoning about the population (3)

Sample means can be related to the population in two ways:
- Using a confidence interval
  - An interval which is calculated in such a way that a large proportion of the calculated intervals contains the true population mean
- Using a hypothesis test
  - Tests if hypothesis about population is compatible with sample result

Confidence interval

Definition: there is an $x$% probability that when computing an $x$% confidence interval (CI) on the basis of a sample, it contains $\mu$
The CI can be seen as an estimate of plausible values of $\mu$
- (For those who are interested: there is a lot of confusion about interpreting CIs)

Confidence interval: example (1)

Consider the following example:
You want to know how many hours per week a student of the university spends speaking English. The standard deviation $\sigma$ for the university is 1 hr/wk.
- You collect data from 100 randomly chosen students
- You calculate the sample mean $m = 5$ hr/wk (N.B. in my notation: $m$ = $\bar{x}$)
- You therefore estimate the population mean $\mu = 5$ hr/wk and standard error SE $= 1/\sqrt{100} = 0.1$ hr/wk
What is the 95% confidence interval (CI) of the mean?

Confidence interval: example (2)

According to the CLT, the sample means are normally distributed

95% of the sample means lie within $m \pm$ 2 SE
- (i.e. actually it is $m \pm$ 1.96 SE, but we round this to $m \pm$ 2 SE)
With $m$ = 5 and SE = 0.1, 95% CI is 5 $\pm$ 2$$0.1 = (4.8 hr/wk, 5.2 hr/wk)

Question 9

Hypotheses

Instead of using them for confidence intervals we often interpret samples as hypothesis tests about populations
Examples of hypotheses
- Answering online lecture questions is related to the course grade
- Females and males differ in their English proficiency
- Nouns take longer to read than verbs

Hypothesis testing (1)

Testing these hypotheses requires empirical and variable data
- Empirical: based on observation rather than theory alone
- Variable: individual cases vary
Hypotheses can be derived from theory, but also from observations if theory is incomplete

Hypothesis testing (2)

We start from a research question:
Is answering online lecture questions related to the course grade?
Which we then formulate as a hypothesis (i.e. a statement):
Answering online lecture questions is related to the course grade
For statistics to be useful, this needs to be translated to a concrete form:
Students answering online lecture questions score higher than those who do not

Hypothesis testing (3)

Students answering online lecture questions score higher than those who do not
What is meant by this?
All students answering online lecture questions score higher than those who do not?
- Probably not, the data is variable, there are other factors:
  - Attention level of each student
  - Difficulty of the lecture
  - If the questions were answered seriously
We need statistics to abstract away from the variability of the observations (i.e. unsystematic variation)

Hypothesis testing (4)

Students answering online lecture questions score higher than those who do not
Meaning:
- Not: All students answering online lecture questions score higher than those who do not
- But: On average, students answering online lecture questions score higher than those who do not

Testing a hypothesis using a sample (1)

On average, students answering online lecture questions score higher than those who do not
This hypothesis must be studied on the basis of a sample, i.e. a limited number of students following a course with online lecture questions
- Of course we’re interested in the population, i.e. all students who followed a course with online lecture questions

Testing a hypothesis using a sample (2)

The hypothesis concerns the population, but it is studied through a representative sample
- Students answering online lecture questions score higher than those who do not
  (study based on 30 students who answered the questions and 30 who did not)
- Females have higher English proficiency than males
  (study based on 40 males and 40 females)
- Nouns take longer to read than verbs
  (studied on the basis of 35 people’s reading of 100 nouns and verbs)

Question 10

Analysis: when is a difference real?

Given a testable hypothesis:
Students answering online lecture questions score higher than those who do not
- You collect the final course grade for 30 randomly selected students who answered the online questions and 30 who did not
Will any difference in average grade (in the right direction) be proof?
- Probably not: very small differences might be due to chance (unsystematic variation)
Therefore we use statistics to analyze the results
- Statistically significant results are those unlikely to be due to chance

Comparing a sample to population: $z$-test

$z$-test allows assessing difference between sample and population
- $\mu$ and $\sigma$ for the population should be known (standardized tests: e.g., IQ test)
Sample mean $m$ is compared to population mean $\mu$

Example of $z$-test

You think Computer Assisted Language Learning may be effective for kids
You give a standard test of language proficiency ($\mu$ = 70, $\sigma$ = 14) to 49 randomly chosen childen who followed a CALL program
- You find $m$ = 74
- You calculate SE = $\sigma/\sqrt{n} = 14/\sqrt{49} = 2$
- 74 is 2 SE above the population mean: at the 97.5th percentile

Conclusions of $z$-test

Group with CALL scored 2 SE above mean ($z$-score of 2)
- Chance of this (or more extreme score) is only 2.5%, so very unlikely that this is due to chance
Conclusion: CALL programs are probably helping
- However, it is also possible that CALL is not helping, but the effect is caused by some other factor
  - Such as the sample including many proficient kids
  - This is a confounding factor: an influential hidden variable (a variable not used in a study)

Question 11

Importance of sample size

Suppose we would have used 9 children as opposed to 49, at what percentile would a sample mean of $m$ = 74 be?
- SE = $\sigma/\sqrt{n} = 14/\sqrt{9} \approx 4.7$
- $m$ = 74 is less than 1 SE above the mean, i.e. at less than the 84th percentile
  - Sample means of at least this value are found by chance more than 16% of the time: not enough reason to suspect a CALL effect

Statistical reasoning: two hypotheses (1)

Rather than one hypothesis, we create two hypotheses about the data:
- The null hypothesis ($H_0$) and the alternative hypothesis ($H_a$)
- The null hypothesis states that there is no relationship between two measured phenomena (e.g., CALL program and test score), while the alternative hypothesis states there is

Statistical reasoning: two hypotheses (2)

For the CALL example (49 children):
- $H_0$: $\mu_{CALL} = 70$ (the population mean of people using CALL is 70)
- $H_a$: $\mu_{CALL} > 70$ (the population mean of people using CALL is higher than 70)
- While $m$ = 74, suggests that $H_a$ is right, this might be due to chance, so we would need enough evidence (i.e. low SE) to accept it over the null hypothesis
- Logically, $H_0$ is the inverse of $H_a$, and we’d expect $H_0$: $\mu_{CALL} \leq 70$, but we usually see ‘$=$’ in formulations

Statistical reasoning (1)

$H_0$: $\mu_{CALL} = 70$ $H_a$: $\mu_{CALL} > 70$

The reasoning goes as follows:
- Suppose $H_0$ is true, what is the chance $p$ of observing a sample with $m \geq$ 74?
- To determine this, we convert 74 to a $z$-score: $z = (m - \mu) / \textrm{SE} = (74-70)/2 = 2$
- And find the associated $p$-value (about 0.025)

Statistical reasoning (2)