# Basic concepts

Martijn Wieling
University of Groningen

## This lecture

• Descriptive vs. inferential statistics
• Sample vs. population
• (Types of) variables
• Distribution of a variable: central tendency and variation
• Standardized scores
• Checking for a normal distribution
• Reasoning about the population using a sample
• Relation between population (mean) and sample (mean)
• Confidence interval for population mean based on sample mean
• Testing a hypothesis about the population using a sample
• Statistical significance
• Error types

## Why use statistics?

• Why use statistics?
• Summarize data (descriptive statistics)
• Assess relationships in data (inferential statistics)

## Descriptive vs. inferential statistics

• Descriptive statistics:
• Statistics used to describe (sample) data without further conclusions
• Measures of central tendency: Mean, median, mode
• Measures of variation (or spread): range, IQR, variance, standard deviation
• Inferential statistics:
• Describe data of sample in order to infer patterns in the population
• Statistical tests: $t$-test, $\chi^2$-test, etc.

## Why study a sample?

• Studying the whole population is (frequently) practically impossible
• Sample is a (selected) subset of population and thus more accessible
• Selection of representative sample is very important!

## Characterizing numerical variables: distribution

• We are generally not interested in individual values of a variable, but rather all values and their frequency
• This is captured by a distribution
• Famous distribution: Normal distribution ("bell-shaped" curve): e.g., IQ scores

## Interpreting a density curve

• The total area under a density curve is equal to 1
• A density curve does not provide information about the frequency of one value
• E.g., there might be no one who has a value of exactly 6.1
• It only provides information about an interval
• E.g., more than 50% of the values lie between 5.5 and 7.5

## Interpreting a density curve: normal distribution

• The normal distribution has convenient characteristics
• Completely symmetric
• Red and green area: (about) 95%

## Characterizing the distribution of numerical variables

• A distribution can also be characterized by measures of center and variation
• (skewness measures the symmetry of the distribution; not covered further)

## Characterizing numerical variables: central tendency

• Mode: most frequent element (for nominal data: only meaningful measure)
• Median: when data is sorted from small to large, it is the middle value
• Mean: arithmetical average

$\bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n} = \frac{1}{n}\sum\limits_{i=1}^n x_i$

## Measure of variation: quantiles

• Quantiles: cutpoints to divide the sorted data in subsets of equal size
• Quartiles: three cutpoints to divide the data in four equal-sized sets
• $q_1$ (1st quartile): cutpoint between 1st and 2nd group
• $q_2$ (2nd quartile): cutpoint between 2nd and 3rd group (= median!)
• $q_3$ (3rd quartile): cutpoint between 3rd and 4th group
• Percentiles: divide data in hundred equal-sized subsets
• $q_1$ = 25th percentile
• $q_2$ (= median) = 50th percentile
• Score at $n$th percentile is better than $n$% of scores

## Measure of variation: range

• Minimum, maximum: lowest and highest value
• Range: difference between minimum and maximum
• Interquartile range (IQR): $q_3$ - $q_1$

## Visualizing variation: box plot (box-and-whisker plot)

• A box plot is used to visualize variation of a variable
• Box (IQR): $q_1$ (bottom), median (thickest line), $q_3$ (top)
• (In example below, $q_1$ and median have the same value)
• Whiskers: maximum (top) and minimum (bottom) non-outlier value
• Circle(s): outliers (> 1.5 IQR distance from box)

## Important measure of variation: variance

• Deviation: difference between mean and individual value
• Variance: average squared deviation
• Squared in order to make negative differences positive
• Population variance: $\sigma^2 = \frac{1}{n}\sum\limits_{i=1}^n (x_i - \mu)^2$
• As sample mean ($\bar{x}$ or $m$) is approximation of population mean ($\mu$), sample variance formula contains division by $n-1$ (results in slightly higher variance): $s^2 = \frac{1}{n-1}\sum\limits_{i=1}^n (x_i - \bar{x})^2$

## Important measure of variation: standard deviation

• Standard deviation is square root of variance $\sigma = \sqrt{\sigma^2} = \sqrt{\frac{1}{n}\sum\limits_{i=1}^n (x_i - \mu)^2}$ $s = \sqrt{s^2} = \sqrt{\frac{1}{n-1}\sum\limits_{i=1}^n (x_i - \bar{x})^2}$

## Normal distribution and standard deviation (1)

$P(\mu - \sigma \leq x \leq \mu + \sigma) \approx 68\%$     (34 + 34)
$P(\mu - 2\sigma \leq x \leq \mu + 2\sigma) \approx 95\%$     (34 + 34 + 13.5 + 13.5)
$P(\mu - 3\sigma \leq x \leq \mu + 3\sigma) \approx 99.7\%$     (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)

## Normal distribution and standard deviation (2)

$P(85 \leq \rm{IQ} \leq 115) \approx 68\%$     (34 + 34)
$P(70 \leq \rm{IQ} \leq 130) \approx 95\%$     (34 + 34 + 13.5 + 13.5)
$P(55 \leq \rm{IQ} \leq 145) \approx 99.7\%$     (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)

• IQ scores are normally distributed with mean 100 and standard deviation 15

## Standardized scores

• Standardization helps facilitate interpretation
• E.g., how to interpret: "Emma got a score of 112" and "Tom got a score of 105"
• Interpretation should be done with respect to mean $\mu$ and standard deviation $\sigma$
• Raw scores can be transformed to standardized scores ($z$-scores or $z$-values) $z = \frac{x - \mu}{\sigma} = \frac{\rm{deviation}}{\rm{standard}\,\rm{deviation}}$
• Interpretation: difference of value from mean in number of standard deviations

## Calculating standardized values

• Suppose $\mu = 108$, $\sigma = 4$, then: $z_{112} = \frac{x - \mu}{\sigma} = \frac{112 - 108}{4} = 1$ $z_{105} = \frac{105 - 108}{4} = -0.75$
• $z$ shows distance from mean in number of standard deviations

## Distribution of standardized variables

• If we transform all raw scores of a variable into $z$-scores using: $z = \frac{x - \mu}{\sigma} = \frac{\rm{deviation}}{\rm{standard}\,\rm{deviation}}$
• We obtain a new transformed variable whose
• Mean is 0
• Standard deviation is 1
• In sum: $z$-score = distance from $\mu$ in $\sigma$'s
• $z$-scores are useful for interpretation and hypothesis testing

## Standard normal distribution

$P(-1 \leq z \leq 1) \approx 68\%$     (34 + 34)
$P(-2 \leq z \leq 2) \approx 95\%$     (34 + 34 + 13.5 + 13.5)
$P(-3 \leq z \leq 3) \approx 99.7\%$     (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)

## For comparison: normal distribution

$P(\mu - \sigma \leq x \leq \mu + \sigma) \approx 68\%$     (34 + 34)
$P(\mu - 2\sigma \leq x \leq \mu + 2\sigma) \approx 95\%$     (34 + 34 + 13.5 + 13.5)
$P(\mu - 3\sigma \leq x \leq \mu + 3\sigma) \approx 99.7\%$     (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)

## Checking normality assumption

• Some statistical tests (e.g., $t$-test) require that the data is (roughly) normally distributed
• How to test this?
• Using visual inspection of a normal quantile plot (or: quantile-quantile plot)
• A straight line in this graph indicates a (roughly) normal distribution
• (Alternatively, you can use the Shapiro-Wilk test)

## Normal quantile plot: how it works

• Sort the data from smallest to largest to determine quantiles (e.g., percentiles)
• E.g., median for 50th percentile
• Calculate $z$-values belonging to the quantiles (e.g., percentiles) of a standard normal distribution
• E.g., $z =$ 0 for 50th percentile, $z =$ 2 for 97.5th percentile, etc.
• Plot data values ($y$-axis) against normal quantile values ($x$-axis)
• If points on (or close to) straight line: values normally distributed

## Selecting a sample

• Selecting a sample from a population includes an element of chance: which individuals are studied?
• Important question: How to reason about the population using a sample?
• Anwered using the Central Limit Theorem

## Central Limit Theorem

• Suppose we would gather many different samples from the population, then the distribution of the sample means will always be normally distributed
• The mean of these sample means ($\bar{x}$) will be the population mean ($m_{\bar{x}} = \mu$)
• The standard deviation of the sample means (standard error SE) is dependent on the sample size $n$ and the population standard deviation $\sigma$ : SE $= s_{\bar{x}} = \frac{\sigma}{\sqrt{n}}$

## Reasoning about the population (1)

• Given that the distribution of sample means is normally distributed $N(\mu,\sigma/\sqrt{n})$, having one randomly selected sample allows us to reason about the population
• Requirement: sample is representative (unbiased sample)
• Random selection helps avoid bias

## Reasoning about the population (2)

• Given a representative sample:
• We estimate the population mean as equal to the sample mean (best guess)
• How certain we are of this estimate depends on the standard error: $\sigma/\sqrt{n}$
• Increasing sample size $n$ reduces uncertainty
• Hard work pays off (in exactness), but it doesn't pay of quickly: $\sqrt(n)$
• Sample means are normally distributed (CLT):
• We can relate a sample mean to the population mean by using characteristics of the normal distribution

## Normal distribution

• We know the probability of an element $x$ having a value close to the mean $\mu$:

$P(\mu - \sigma \leq x \leq \mu + \sigma) \approx 68\%$     (34 + 34)
$P(\mu - 2\sigma \leq x \leq \mu + 2\sigma) \approx 95\%$     (34 + 34 + 13.5 + 13.5)
$P(\mu - 3\sigma \leq x \leq \mu + 3\sigma) \approx 99.7\%$     (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)

## Normal distribution: $z$-scores

• With standardized values: $z = (x - \mu)/\sigma \Rightarrow \mu = 0$ and $\sigma = 1$

$P(-1 \leq z \leq 1) \approx 68\%$     (34 + 34)
$P(-2 \leq z \leq 2) \approx 95\%$     (34 + 34 + 13.5 + 13.5)
$P(-3 \leq z \leq 3) \approx 99.7\%$     (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)

## Reasoning about the population (3)

• Sample means can be related to the population in two ways:
• Using a confidence interval
• An interval which is likely to contain the true population mean
• Using a hypothesis test
• Tests if hypothesis about population is compatible with sample result

## Confidence interval

• Definition: there is an $x$% probability that when computing an $x$% confidence interval on the basis of a sample, it contains $\mu$
• Confidence interval gives estimate of plausible values for the population mean

## Confidence interval: example (1)

• Consider the following example:
You want to know how many hours per week a student of the university spends speaking English. The standard deviation $\sigma$ for the university is 1 hr/wk.
• You collect data from 100 randomly chosen students
• You calculate the sample mean $m = 5$ hr/wk (N.B. in my notation: $m$ = $\bar{x}$)
• You therefore estimate the population mean $\mu = 5$ hr/wk and SE $= 1/\sqrt{100} = 0.1$ hr/wk
• What is the 95% confidence interval (CI) of the mean?

## Confidence interval: example (2)

• According to the CLT, the sample means are normally distributed

• 95% of the sample means lie within $m \pm$ 2 SE
• (i.e. actually it is $m \pm$ 1.96 SE, but we round this to $m \pm$ 2 SE)
• With $m$ = 5 and SE = 0.1, 95% CI is 5 $\pm$ 2$\times$0.1 = (4.8 hr/wk, 5.2 hr/wk)

## Hypotheses

• Instead of confidence intervals we often interpret samples as hypothesis tests about populations
• Examples of hypotheses
• Women and men differ in their English proficiency
• Nouns take longer to read than verbs

## Hypothesis testing (1)

• Testing these hypotheses requires empirical and variable data
• Empirical: based on observation rather than theory alone
• Variable: individual cases vary
• Hypotheses can be derived from theory, but also from observations if theory is incomplete

## Hypothesis testing (2)

• We start from a research question:
• Which we then formulate as a hypothesis (i.e. a statement):
• For statistics to be useful, this needs to be translated to a concrete form:
Students answering online lecture questions score higher than those who do not

## Hypothesis testing (3)

• Students answering online lecture questions score higher than those who do not
• What is meant by this?
All students answering online lecture questions score higher than those who do not?
• Probably not, the data is variable, there are other factors:
• Attention level of each student
• Difficulty of the lecture
• If the questions were answered seriously
• We need statistics to abstract away from the variability of the observations (i.e. unsystematic variation)

## Hypothesis testing (4)

• Students answering online lecture questions score higher than those who do not
• Meaning:
• Not: All students answering online lecture questions score higher than those who do not
• But: On average, students answering online lecture questions score higher than those who do not

## Testing a hypothesis using a sample (1)

• On average, students answering online lecture questions score higher than those who do not
• This hypothesis must be studied on the basis of a sample, i.e. a limited number of students following a course with online lecture questions
• Of course we're interested in the population, i.e. all students who followed a course with online lecture questions

## Testing a hypothesis using a sample (2)

• The hypothesis concerns the population, but it is studied through a representative sample
• Students answering online lecture questions score higher than those who do not
(study based on 30 students who answered the questions and 30 who did not)
• Women have higher English proficiency than men
(study based on 40 men and 40 women)
• Nouns take longer to read than verbs
(studied on the basis of 35 people's reading of 100 nouns and verbs)

## Analysis: when is a difference real?

• Given a testable hypothesis:
Students answering online lecture questions score higher than those who do not
• You collect the final course grade for 30 randomly selected students who answered the online questions and 30 who did not
• Will any difference in average grade (in the right direction) be proof?
• Probably not: very small differences might be due to chance (unsystematic variation)
• Therefore we use statistics to analyze the results
• Statistically significant results are those unlikely to be due to chance

## Comparing a sample to population: $z$-test

• $z$-test allows assessing difference between sample and population
• $\mu$ and $\sigma$ for the population should be known (standardized tests: e.g., IQ test)
• Sample mean $m$ is compared to population mean $\mu$

## Example of $z$-test

• You think Computer Assisted Language Learning may be effective for kids
• You give a standard test of language proficiency ($\mu$ = 70, $\sigma$ = 14) to 49 randomly chosen childen who followed a CALL program
• You find $m$ = 74
• You calculate SE = $\sigma/\sqrt{n} = 14/\sqrt{49} = 2$
• 74 is 2 SE above the population mean: at the 97.5th percentile

## Conclusions of $z$-test

• Group with CALL scored 2 SE above mean ($z$-score of 2)
• Chance of this (or more extreme result) is only 2.5%, so very unlikely that this is due to chance
• Conclusion: CALL programs are probably helping
• However, it is also possible that CALL is not helping, but the effect is caused by some other factor
• Such as the sample including lots of proficient kids
• This is a confounding factor: an influential hidden variable (a variable not used in a study)

## Importance of sample size

• Suppose we would have used 9 children as opposed to 49, at what percentile would a sample mean of $m$ = 74 be?
• SE = $\sigma/\sqrt{n} = 14/\sqrt{9} \approx 4.7$
• $m$ = 74 is less than 1 SE above the mean, i.e. at less than the 84th percentile
• Sample means of this value are found by chance more than 16% of the time (i.e. likely due to chance): not enough reason to suspect an effect of CALL

## Statistical reasoning: two hypotheses (1)

• Rather than one hypothesis, we create two hypotheses about the data:
• The null hypothesis ($H_0$) and the alternative hypothesis ($H_a$)
• The null hypothesis states that there is no relationship between two measured phenomena (e.g., CALL program and test score), while the alternative hypothesis states there is

## Statistical reasoning: two hypotheses (2)

• For the CALL example (49 children):
• $H_0$: $\mu_{CALL} = 70$ (the population mean of people using CALL is 70)
• $H_a$: $\mu_{CALL} > 70$ (the population mean of people using CALL is higher than 70)
• While $m$ = 74, suggests that $H_a$ is right, this might be due to chance, so we would need enough evidence (i.e. low SE) to accept it over the null hypothesis
• Logically, $H_0$ is the inverse of $H_a$, and we'd expect $H_0$: $\mu_{CALL} \leq 70$, but we usually see '$=$' in formulations

## Statistical reasoning (1)

$H_0$: $\mu_{CALL} = 70$              $H_a$: $\mu_{CALL} > 70$

• The reasoning goes as follows:
• Suppose $H_0$ is true, what is the chance $p$ of observing a sample with $m \geq$ 74?
• To determine this, we convert 74 to a $z$-score: $z = (m - \mu) /$SE = (74-70)/2 = 2
• And find the associated $p$-value (about 0.025)

## Statistical reasoning (2)

$H_0$: $\mu_{CALL} = 70$              $H_a$: $\mu_{CALL} > 70$

• $P(z \geq 2) \approx 0.025$
• The chance of observing a sample at least this extreme given $H_0$ is true is 0.025
• This is the $p$-value (measured significance level)
• If $H_0$ were correct and kids with CALL exp. had the same language proficiency as others, the observed sample would be expected only 2.5% of the time
• Strong evidence against the null hypothesis

## Statistically significant?

• We have determined $H_0$, $H_a$ and the $p$-value
• The classical hypothesis test assesses how unlikely a sample must be for a test to count as significant
• We compare the $p$-value against this threshold significance level or $\alpha$-level
• If the $p$-value is lower than the $\alpha$-level (usually 0.05, but it may be lower as well), we regard the result as significant and reject the null hypothesis

## Statistical significance: summary

• The $p$-value is the chance of encountering the sample, given that the null hypothesis is true
• The $\alpha$-level is the threshold for the $p$-value, below which we regard the result as significant
• If the result is significant, we reject $H_0$ and assume $H_a$ is true

## Steps for assessing statistical significance

1. Specify $H_0$ and $H_a$
2. Specify test statistic (e.g., mean) and underlying distribution (assuming $H_0$)
3. Specify the $\alpha$-level at which $H_0$ will be rejected
4. Determine the value of the statistic (e.g., mean) on the basis of a sample
5. Calculate the $p$-value and compare to $\alpha$
• $p$-value $< \alpha$: reject $H_0$ (significant result)
• $p$-value $\geq \alpha$: retain $H_0$ (non-significant result)

## Critical values

• Critical values: those values of the sample statistic resulting in a rejection of $H_0$
• E.g., if $\alpha$ is set at 0.05, the critical region is $P(z) < 0.05$, i.e. $z \geq 1.64$
• We can transform this to raw values using the $z$ formula $z = (x-\mu)/SE\\ 1.64 = (x-70)/2\\ 3.3 = x-70\\ x = 73.3$
• Thus a sample mean of at least 73.3 will result in rejection of $H_0$

## One-sided test

• There are different forms of statistical tests:
• $H_a$ predicts high $m$: CALL improves language ability
• $H_a$ predicts low $m$: Eating broccoli lowers cholesterol levels

## Two-sided test

• Sometimes $H_a$ might predict not lower or higher, but just different
• With a significance level $\alpha$ of 0.05, both very high (2.5% highest) and very low (2.5% lowest) values give reason to reject $H_0$

## Statistical significance and confidence interval

• Statistical significance and a confidence interval (CI) are linked
• A 95% CI based on the sample mean $m$ represents the values for $\mu$ for which the difference between $\mu$ and $m$ is not significant (at the 0.05 significance threshold for a two-sided test)
• A value outside of the CI indicates a statistically significant difference

## Chasing significance?

• If your result is not significant, you could try to obtain more data (reducing the standard error)
• Is it sensible to collect the extra data to "push" a result to significance?
• No. At least, usually not.
• The real result is the effect size (e.g., the difference between the groups)

## Understanding significance

• "Statistically significant" implies that an effect probably is not due to chance, but the effect can be very small
• If you want to know whether you should buy CALL software to learn a language, statistically significant does not tell you this
• This is a two-edged sword, if an effect was not statistically significant, it does not mean nothing important is going on
• You are just not sure: it could be a chance effect

## Misuse of significance

• Garbage in, garbage out: statistics won't help an experiment with a poor design, or where data was poorly collected
• No significance hunting: hypotheses should be formulated before data collection and analysis
• Modern danger: if there are many potential variables, it is likely that a few turn out to be significant
• Specific tests are necessary to correct for this

## Some remarks about hypothesis testing

• A statistical hypothesis concerns a population (not a sample!) and involves a statistic (such as mean, frequency, etc.)
• Population: all students attending a course using online lecture questions
• Parameter (statistic): (average) course performance
• Hypothesis: average performance of students answering online lecture questions is higher than those who do not

## Identifying hypotheses

• Alternative hypothesis $H_a$ (original hypothesis) is contrasted with null hypothesis $H_0$ (hypothesis that nothing out of the ordinary is going on)
• $H_a$: average performance of students answering online lecture questions higher
• $H_0$: answering online lecture questions does not impact performance
• Logically $H_0$ should imply $\neg H_a$

## Possible errors

Of course, you could be wrong (e.g., due to an unrepresentative sample)!

$H_0$ true false
retained correct type II error
rejected type I error correct
• Hypothesis testing focuses on type I errors:
• $p$-value: chance of type I error
• $\alpha$-level: boundary of acceptable level of type I error
• Type II errors:
• $\beta$: chance of type II error
• $1 - \beta$: power of statistical test
• More sensitive (and useful) tests have more power to detect an effect

## Possible errors: easier to remember

• False positive: incorrect positive (accepting $H_a$) result
• False negative: incorrect negative (not rejecting $H_0$) result

## How to formulate the results?

• Results with $p = 0.051$ are not very different from $p = 0.049$, but we need a boundary
• An $\alpha$-level of $0.05$ is low as the "burden of proof" is on the alternative
• If $p = 0.051$ we haven't proven $H_0$, only failed to show that it's really wrong
• This is called "retaining $H_0$"

## Recap

• In this lecture, we've covered
• Descriptive vs. inferential statistics
• Sample vs. population
• (Types of) variables
• Distribution of a variable: central tendency and variation
• Standardized scores
• Checking for a normal distribution
• Relation between population (mean) and sample (mean)
• Confidence interval for population mean based on sample mean
• Testing a hypothesis about the population using a sample
• Statistical significance
• Error types