Basic concepts of statistics
Martijn Wieling (University of Groningen)
This lecture
- Descriptive vs. inferential statistics
- Sample vs. population
- (Types of) variables
- Distribution of a variable: central tendency and variation
- Standardized scores
- Checking for a normal distribution
- Reasoning about the population using a sample
- Relation between population (mean) and sample (mean)
- Confidence interval for population mean based on sample mean
- Testing a hypothesis about the population using a sample
- Statistical significance
- Error types
Why use statistics?
- Why use statistics?
- Summarize data (descriptive statistics)
- Assess relationships in data (inferential statistics)
Descriptive vs. inferential statistics
- Descriptive statistics:
- Statistics used to describe (sample) data without further conclusions
- Measures of central tendency: Mean, median, mode
- Measures of variation (or spread): range, IQR, variance, standard deviation
- Inferential statistics:
- Describe data of sample in order to infer patterns in the population
- Statistical tests: \(t\)-test, \(\chi^2\)-test, etc.
Why study a sample?
- Studying the whole population is (frequently) practically impossible
- Sample is a (selected) subset of population and thus more accessible
- Selection of representative sample is very important!
Characterizing nominal variables
Characterizing numerical variables: distribution
- We are generally not interested in individual values of a variable, but rather all values and their frequency
- This is captured by a distribution
- Famous distribution: Normal distribution (“bell-shaped” curve): e.g., IQ scores
Interpreting a density curve
- The total area under a density curve is equal to 1
- A density curve does not provide information about the frequency of one value
- E.g., there might be no one who has a value of exactly 6.1
- It only provides information about an interval
- E.g., more than 50% of the values lie between 5.5 and 7.5
Interpreting a density curve: normal distribution
- The normal distribution has convenient characteristics
- Completely symmetric
- Red area: (about) 68%
- Red and green area: (about) 95%
Characterizing the distribution of numerical variables
- A distribution can also be characterized by measures of center and variation
- (skewness measures the symmetry of the distribution; not covered further)
Characterizing numerical variables: central tendency
- Mode: most frequent element (for nominal data: only meaningful measure)
- Median: when data is sorted from small to large, it is the middle value
- Mean: arithmetical average
\[\bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n} = \frac{1}{n}\sum\limits_{i=1}^n x_i\]
Measure of variation: quantiles
- Quantiles: cutpoints to divide the sorted data in subsets of equal size
- Quartiles: three cutpoints to divide the data in four equal-sized sets
- \(q_1\) (1st quartile): cutpoint between 1st and 2nd group
- \(q_2\) (2nd quartile): cutpoint between 2nd and 3rd group (= median!)
- \(q_3\) (3rd quartile): cutpoint between 3rd and 4th group
- Percentiles: divide data in hundred equal-sized subsets
- \(q_1\) = 25th percentile
- \(q_2\) (= median) = 50th percentile
- Score at \(n\)th percentile is better than \(n\)% of scores
Measure of variation: range
- Minimum, maximum: lowest and highest value
- Range: difference between minimum and maximum
- Interquartile range (IQR): \(q_3\) - \(q_1\)
Visualizing variation: box plot (box-and-whisker plot)
- A box plot is used to visualize variation of a variable
- Box (IQR): \(q_1\) (bottom), median (thickest line), \(q_3\) (top)
- (In example below, \(q_1\) and median have the same value)
- Whiskers: maximum (top) and minimum (bottom) non-outlier value
- Circle(s): outliers (> 1.5 IQR distance from box)
Important measure of variation: variance
- Deviation: difference between mean and individual value
- Variance: average squared deviation
- Squared in order to make negative differences positive
- Population variance: \[\sigma^2 = \frac{1}{n}\sum\limits_{i=1}^n (x_i - \mu)^2\]
- As sample mean (\(\bar{x}\) or \(m\)) is approximation of population mean (\(\mu\)), sample variance formula contains division by \(n-1\) (results in slightly higher variance): \[s^2 = \frac{1}{n-1}\sum\limits_{i=1}^n (x_i - \bar{x})^2\]
Important measure of variation: standard deviation
- Standard deviation is square root of variance \[\sigma = \sqrt{\sigma^2} = \sqrt{\frac{1}{n}\sum\limits_{i=1}^n (x_i - \mu)^2}\] \[s = \sqrt{s^2} = \sqrt{\frac{1}{n-1}\sum\limits_{i=1}^n (x_i - \bar{x})^2}\]
Normal distribution and standard deviation (1)
\(P(\mu - \sigma \leq x \leq \mu + \sigma) \approx 68\%\) (34 + 34)
\(P(\mu - 2\sigma \leq x \leq \mu + 2\sigma) \approx 95\%\) (34 + 34 + 13.5 + 13.5)
\(P(\mu - 3\sigma \leq x \leq \mu + 3\sigma) \approx 99.7\%\) (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)
Normal distribution and standard deviation (2)
\(P(85 \leq \rm{IQ} \leq 115) \approx 68\%\) (34 + 34)
\(P(70 \leq \rm{IQ} \leq 130) \approx 95\%\) (34 + 34 + 13.5 + 13.5)
\(P(55 \leq \rm{IQ} \leq 145) \approx 99.7\%\) (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)
- IQ scores are normally distributed with mean 100 and standard deviation 15
Standardized scores
- Standardization helps facilitate interpretation
- E.g., how to interpret: “Emma got a score of 112” and “Tom got a score of 105”
- Interpretation should be done with respect to mean \(\mu\) and standard deviation \(\sigma\)
- Raw scores can be transformed to standardized scores (\(z\)-scores or \(z\)-values) \[z = \frac{x - \mu}{\sigma} = \frac{\rm{deviation}}{\rm{standard}\,\rm{deviation}}\]
- Interpretation: difference of value from mean in number of standard deviations
Calculating standardized values
- Suppose \(\mu = 108\), \(\sigma = 4\), then: \[z_{112} = \frac{x - \mu}{\sigma} = \frac{112 - 108}{4} = 1\] \[z_{105} = \frac{105 - 108}{4} = -0.75\]
- \(z\) shows distance from mean in number of standard deviations
Distribution of standardized variables
- If we transform all raw scores of a variable into \(z\)-scores using: \[z = \frac{x - \mu}{\sigma} = \frac{\rm{deviation}}{\rm{standard}\,\rm{deviation}}\]
- We obtain a new transformed variable whose
- Mean is 0
- Standard deviation is 1
- In sum: \(z\)-score = distance from \(\mu\) in \(\sigma\)’s
- \(z\)-scores are useful for interpretation and hypothesis testing
Standard normal distribution
\(P(-1 \leq z \leq 1) \approx 68\%\) (34 + 34)
\(P(-2 \leq z \leq 2) \approx 95\%\) (34 + 34 + 13.5 + 13.5)
\(P(-3 \leq z \leq 3) \approx 99.7\%\) (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)
For comparison: normal distribution
\(P(\mu - \sigma \leq x \leq \mu + \sigma) \approx 68\%\) (34 + 34)
\(P(\mu - 2\sigma \leq x \leq \mu + 2\sigma) \approx 95\%\) (34 + 34 + 13.5 + 13.5)
\(P(\mu - 3\sigma \leq x \leq \mu + 3\sigma) \approx 99.7\%\) (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)
Checking normality assumption
- Some statistical tests (e.g., \(t\)-test) require that the data is (roughly) normally distributed
- How to test this?
- Using visual inspection of a normal quantile plot (or: quantile-quantile plot)
- A straight line in this graph indicates a (roughly) normal distribution
- (Alternatively, you can use the Shapiro-Wilk test)
Normal quantile plot: how it works
- Sort the data from smallest to largest to determine quantiles (e.g., percentiles)
- E.g., median for 50th percentile
- Calculate \(z\)-values belonging to the quantiles (e.g., percentiles) of a standard normal distribution
- E.g., \(z =\) 0 for 50th percentile, \(z =\) 2 for 97.5th percentile, etc.
- Plot data values (\(y\)-axis) against normal quantile values (\(x\)-axis)
- If points on (or close to) straight line: values normally distributed
Normal quantile plot example
Selecting a sample
- Selecting a sample from a population includes an element of chance: which individuals are studied?
- Important question: How to reason about the population using a sample?
- Anwered using the Central Limit Theorem
Central Limit Theorem
- Suppose we would gather many different samples from the population, then the distribution of the sample means will always be normally distributed
- The mean of these sample means (\(\bar{x}\)) will be the population mean (\(m_{\bar{x}} = \mu\))
- The standard deviation of the sample means (standard error SE) is dependent on the sample size \(n\) and the population standard deviation \(\sigma\) : SE \(= s_{\bar{x}} = \frac{\sigma}{\sqrt{n}}\)
Reasoning about the population (1)
- Given that the distribution of sample means is normally distributed \(N(\mu,\sigma/\sqrt{n})\), having one randomly selected sample allows us to reason about the population
- Requirement: sample is representative (unbiased sample)
- Random selection helps avoid bias
Reasoning about the population (2)
- Given a representative sample:
- We estimate the population mean as equal to the sample mean (best guess)
- How certain we are of this estimate depends on the standard error: \(\sigma/\sqrt{n}\)
- Increasing sample size \(n\) reduces uncertainty
- Hard work pays off (in exactness), but it doesn’t pay off quickly: \(\sqrt(n)\)
- Sample means are normally distributed (CLT):
- We can relate a sample mean to the population mean by using characteristics of the normal distribution
Normal distribution
- We know the probability of an element \(x\) having a value close to the mean \(\mu\):
\(P(\mu - \sigma \leq x \leq \mu + \sigma) \approx 68\%\) (34 + 34)
\(P(\mu - 2\sigma \leq x \leq \mu + 2\sigma) \approx 95\%\) (34 + 34 + 13.5 + 13.5)
\(P(\mu - 3\sigma \leq x \leq \mu + 3\sigma) \approx 99.7\%\) (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)
Normal distribution: \(z\)-scores
- With standardized values: \(z = (x - \mu)/\sigma \Rightarrow \mu = 0\) and \(\sigma = 1\)
\(P(-1 \leq z \leq 1) \approx 68\%\) (34 + 34)
\(P(-2 \leq z \leq 2) \approx 95\%\) (34 + 34 + 13.5 + 13.5)
\(P(-3 \leq z \leq 3) \approx 99.7\%\) (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)
Reasoning about the population (3)
- Sample means can be related to the population in two ways:
- Using a confidence interval
- An interval which is calculated in such a way that a large proportion of the calculated intervals contains the true population mean
- Using a hypothesis test
- Tests if hypothesis about population is compatible with sample result
Confidence interval
- Definition: there is an \(x\)% probability that when computing an \(x\)% confidence interval (CI) on the basis of a sample, it contains \(\mu\)
- The CI can be seen as an estimate of plausible values of \(\mu\)
- (For those who are interested: there is a lot of confusion about interpreting CIs)
Confidence interval: example (1)
- Consider the following example:
You want to know how many hours per week a student of the university spends speaking English. The standard deviation \(\sigma\) for the university is 1 hr/wk.
- You collect data from 100 randomly chosen students
- You calculate the sample mean \(m = 5\) hr/wk (N.B. in my notation: \(m\) = \(\bar{x}\))
- You therefore estimate the population mean \(\mu = 5\) hr/wk and standard error SE \(= 1/\sqrt{100} = 0.1\) hr/wk
- What is the 95% confidence interval (CI) of the mean?
Confidence interval: example (2)
- According to the CLT, the sample means are normally distributed
- 95% of the sample means lie within \(m \pm\) 2 SE
- (i.e. actually it is \(m \pm\) 1.96 SE, but we round this to \(m \pm\) 2 SE)
- With \(m\) = 5 and SE = 0.1, 95% CI is 5 \(\pm\) 2$$0.1 = (4.8 hr/wk, 5.2 hr/wk)
Hypotheses
- Instead of using them for confidence intervals we often interpret samples as hypothesis tests about populations
- Examples of hypotheses
- Answering online lecture questions is related to the course grade
- Females and males differ in their English proficiency
- Nouns take longer to read than verbs
Hypothesis testing (1)
- Testing these hypotheses requires empirical and variable data
- Empirical: based on observation rather than theory alone
- Variable: individual cases vary
- Hypotheses can be derived from theory, but also from observations if theory is incomplete
Hypothesis testing (2)
- We start from a research question:
Is answering online lecture questions related to the course grade?
- Which we then formulate as a hypothesis (i.e. a statement):
Answering online lecture questions is related to the course grade
- For statistics to be useful, this needs to be translated to a concrete form:
Students answering online lecture questions score higher than those who do not
Hypothesis testing (3)
- Students answering online lecture questions score higher than those who do not
- What is meant by this?
All students answering online lecture questions score higher than those who do not?
- Probably not, the data is variable, there are other factors:
- Attention level of each student
- Difficulty of the lecture
- If the questions were answered seriously
- We need statistics to abstract away from the variability of the observations (i.e. unsystematic variation)
Hypothesis testing (4)
- Students answering online lecture questions score higher than those who do not
- Meaning:
- Not: All students answering online lecture questions score higher than those who do not
- But: On average, students answering online lecture questions score higher than those who do not
Testing a hypothesis using a sample (1)
- On average, students answering online lecture questions score higher than those who do not
- This hypothesis must be studied on the basis of a sample, i.e. a limited number of students following a course with online lecture questions
- Of course we’re interested in the population, i.e. all students who followed a course with online lecture questions
Testing a hypothesis using a sample (2)
- The hypothesis concerns the population, but it is studied through a representative sample
- Students answering online lecture questions score higher than those who do not
(study based on 30 students who answered the questions and 30 who did not)
- Females have higher English proficiency than males
(study based on 40 males and 40 females)
- Nouns take longer to read than verbs
(studied on the basis of 35 people’s reading of 100 nouns and verbs)
Analysis: when is a difference real?
- Given a testable hypothesis:
Students answering online lecture questions score higher than those who do not
- You collect the final course grade for 30 randomly selected students who answered the online questions and 30 who did not
- Will any difference in average grade (in the right direction) be proof?
- Probably not: very small differences might be due to chance (unsystematic variation)
- Therefore we use statistics to analyze the results
- Statistically significant results are those unlikely to be due to chance
Comparing a sample to population: \(z\)-test
- \(z\)-test allows assessing difference between sample and population
- \(\mu\) and \(\sigma\) for the population should be known (standardized tests: e.g., IQ test)
- Sample mean \(m\) is compared to population mean \(\mu\)
Example of \(z\)-test
- You think Computer Assisted Language Learning may be effective for kids
- You give a standard test of language proficiency (\(\mu\) = 70, \(\sigma\) = 14) to 49 randomly chosen childen who followed a CALL program
- You find \(m\) = 74
- You calculate SE = \(\sigma/\sqrt{n} = 14/\sqrt{49} = 2\)
- 74 is 2 SE above the population mean: at the 97.5th percentile
Conclusions of \(z\)-test
- Group with CALL scored 2 SE above mean (\(z\)-score of 2)
- Chance of this (or more extreme score) is only 2.5%, so very unlikely that this is due to chance
- Conclusion: CALL programs are probably helping
- However, it is also possible that CALL is not helping, but the effect is caused by some other factor
- Such as the sample including many proficient kids
- This is a confounding factor: an influential hidden variable (a variable not used in a study)
Importance of sample size
- Suppose we would have used 9 children as opposed to 49, at what percentile would a sample mean of \(m\) = 74 be?
- SE = \(\sigma/\sqrt{n} = 14/\sqrt{9} \approx 4.7\)
- \(m\) = 74 is less than 1 SE above the mean, i.e. at less than the 84th percentile
- Sample means of at least this value are found by chance more than 16% of the time: not enough reason to suspect a CALL effect
Statistical reasoning: two hypotheses (1)
- Rather than one hypothesis, we create two hypotheses about the data:
- The null hypothesis (\(H_0\)) and the alternative hypothesis (\(H_a\))
- The null hypothesis states that there is no relationship between two measured phenomena (e.g., CALL program and test score), while the alternative hypothesis states there is
Statistical reasoning: two hypotheses (2)
- For the CALL example (49 children):
- \(H_0\): \(\mu_{CALL} = 70\) (the population mean of people using CALL is 70)
- \(H_a\): \(\mu_{CALL} > 70\) (the population mean of people using CALL is higher than 70)
- While \(m\) = 74, suggests that \(H_a\) is right, this might be due to chance, so we would need enough evidence (i.e. low SE) to accept it over the null hypothesis
- Logically, \(H_0\) is the inverse of \(H_a\), and we’d expect \(H_0\): \(\mu_{CALL} \leq 70\), but we usually see ‘\(=\)’ in formulations
Statistical reasoning (1)
\(H_0\): \(\mu_{CALL} = 70\) \(H_a\): \(\mu_{CALL} > 70\)
- The reasoning goes as follows:
- Suppose \(H_0\) is true, what is the chance \(p\) of observing a sample with \(m \geq\) 74?
- To determine this, we convert 74 to a \(z\)-score: \(z = (m - \mu) / \textrm{SE} = (74-70)/2 = 2\)
- And find the associated \(p\)-value (about 0.025)
Statistical reasoning (2)
\(H_0\): \(\mu_{CALL} = 70\) \(H_a\): \(\mu_{CALL} > 70\)
- \(P(z \geq 2) \approx 0.025\)
- The chance of observing a sample at least this extreme given \(H_0\) is true is 0.025
- This is the \(p\)-value (measured significance level)
- If \(H_0\) were correct and kids with CALL exp. had the same language proficiency as others, the observed sample would be expected only 2.5% of the time
- Strong evidence against the null hypothesis
Statistically significant?
- We have determined \(H_0\), \(H_a\) and the \(p\)-value
- The classical hypothesis test assesses how unlikely a sample must be for a test to count as significant
- We compare the \(p\)-value against this threshold significance level or \(\alpha\)-level
- If the \(p\)-value is lower than the \(\alpha\)-level (usually 0.05, but it may be lower as well), we regard the result as significant and reject the null hypothesis
Statistical significance: summary
- The \(p\)-value is the chance of encountering the sample, given that the null hypothesis is true
- The \(\alpha\)-level is the threshold for the \(p\)-value, below which we regard the result as significant
- If result significant, we reject \(H_0\) and assume \(H_a\) is true
Steps for assessing statistical significance
- Specify \(H_0\) and \(H_a\)
- Specify test statistic (e.g., mean) and underlying distribution (assuming \(H_0\))
- Specify the \(\alpha\)-level at which \(H_0\) will be rejected
- Determine the value of the statistic (e.g., mean) on the basis of a sample
- Calculate the \(p\)-value and compare to \(\alpha\)
- \(p\)-value \(< \alpha\): reject \(H_0\) (significant result)
- \(p\)-value \(\geq \alpha\): retain \(H_0\) (non-significant result)
Critical values
- Critical values: those values of the sample statistic resulting in a rejection of \(H_0\)
- E.g., if \(\alpha\) is set at 0.05, the critical region is \(P(z) < 0.05\), i.e. \(z \geq 1.64\)
- We can transform this to raw values using the \(z\) formula \[z = (x-\mu)/SE\\
1.64 = (x-70)/2\\
3.3 = x-70\\
x = 73.3\]
- Thus a sample mean of at least 73.3 will result in rejection of \(H_0\)
One-sided test
- There are different forms of statistical tests:
- \(H_a\) predicts high \(m\): CALL improves language ability
- \(H_a\) predicts low \(m\): Eating broccoli lowers cholesterol levels
Two-sided test
- Sometimes \(H_a\) might predict not lower or higher, but just different
- With a significance level \(\alpha\) of 0.05, both very high (2.5% highest) and very low (2.5% lowest) values give reason to reject \(H_0\)
Statistical significance and confidence interval
- Statistical significance and a confidence interval (CI) are linked
- A 95% CI based on the sample mean \(m\) represents the values for \(\mu\) for which the difference between \(\mu\) and \(m\) is not significant (at the 0.05 significance threshold for a two-sided test)
- A value outside of the CI indicates a statistically significant difference
Chasing significance?
- If your result is not significant, you could try to obtain more data (reducing the standard error)
- Is it sensible to collect the extra data to “push” a result to significance?
- No. At least, usually not.
- The real result is the effect size (e.g., the difference between the groups)
Understanding significance
- “Statistically significant” implies that an effect probably is not due to chance, but the effect can be very small
- If you want to know whether you should buy CALL software to learn a language, statistically significant does not tell you this
- This is a two-edged sword, if an effect was not statistically significant, it does not mean nothing important is going on
- You are just not sure: it could be a chance effect
Misuse of significance
- Garbage in, garbage out: statistics won’t help an experiment with a poor design, or where data was poorly collected
- No significance hunting: hypotheses should be formulated before data collection and analysis
- Modern danger: if there are many potential variables, it is likely that a few turn out to be significant
- Specific tests are necessary to correct for this
Recap: hypothesis testing
- A statistical hypothesis concerns a population (not a sample!) and involves a statistic (such as mean, frequency, etc.)
- Population: all students attending a course using online lecture questions
- Parameter (statistic): (average) course performance
- Hypothesis: average performance of students answering online lecture questions is higher than those who do not
Identifying hypotheses
- Alternative hypothesis \(H_a\) (original hypothesis) is contrasted with null hypothesis \(H_0\) (hypothesis that nothing out of the ordinary is going on)
- \(H_a\): average performance of students answering online lecture questions higher
- \(H_0\): answering online lecture questions does not impact performance
- Logically \(H_0\) should imply \(\neg H_a\)
Possible errors
Of course, you could be wrong (e.g., due to an unrepresentative sample)!
accepted |
correct |
type-II error |
rejected |
type-I error |
correct |
- Hypothesis testing focuses on type I errors:
- \(p\)-value: chance of type I error
- \(\alpha\)-level: boundary of acceptable level of type I error
- Type II errors:
- \(\beta\): chance of type II error
- \(1 - \beta\): power of statistical test
- More sensitive (and useful) tests have more power to detect an effect
Possible errors: easier to remember
- False positive: incorrect positive (accepting \(H_a\)) result
- False negative: incorrect negative (not rejecting \(H_0\)) result
Recap
- In this lecture, we’ve covered
- Descriptive vs. inferential statistics
- Sample vs. population
- (Types of) variables
- Distribution of a variable: central tendency and variation
- Standardized scores
- Checking for a normal distribution
- Relation between population (mean) and sample (mean)
- Confidence interval for population mean based on sample mean
- Testing a hypothesis about the population using a sample
- Statistical significance
- Error types
Questions?
Thank you for your attention!
https://www.martijnwieling.nl
m.b.wieling@rug.nl