Martijn Wieling

University of Groningen

- Descriptive vs. inferential statistics
- Sample vs. population
- (Types of) variables
- Distribution of a variable: central tendency and variation
- Standardized scores
- Checking for a normal distribution
- Reasoning about the population using a sample
- Relation between population (mean) and sample (mean)
- Confidence interval for population mean based on sample mean
- Testing a hypothesis about the population using a sample
- Statistical significance
- Error types

- Why use statistics?
- Summarize data (
*descriptive statistics*) - Assess relationships in data (
*inferential statistics*)

- Summarize data (

- Descriptive statistics:
- Statistics used to
**describe**(sample) data without further conclusions- Measures of
**central tendency**: Mean, median, mode - Measures of
**variation**(or spread): range, IQR, variance, standard deviation

- Measures of

- Statistics used to
- Inferential statistics:
- Describe data of
**sample**in order to infer patterns in the**population**- Statistical tests: \(t\)-test, \(\chi^2\)-test, etc.

- Describe data of

- Studying the whole population is (frequently) practically impossible
- Sample is a (selected) subset of population and thus more accessible
- Selection of
**representative**sample is very important!

- Selection of

- We are generally not interested in individual values of a variable, but rather all values and their frequency
- This is captured by a
**distribution**- Famous distribution:
**Normal distribution**("bell-shaped" curve): e.g., IQ scores

- Famous distribution:

- The total area under a density curve is equal to 1
- A density curve does not provide information about the frequency of one value
- E.g., there might be no one who has a value of exactly 6.1

- It only provides information about an
**interval**- E.g., more than 50% of the values lie between 5.5 and 7.5

- The normal distribution has convenient characteristics
- Completely symmetric
- Red area: (about) 68%
- Red and green area: (about) 95%

- A distribution can also be characterized by measures of
**center**and**variation**- (
*skewness*measures the symmetry of the distribution; not covered further)

- (

**Mode**: most frequent element (for nominal data:*only*meaningful measure)**Median**: when data is sorted from small to large, it is the middle value**Mean**: arithmetical average

\[\bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n} = \frac{1}{n}\sum\limits_{i=1}^n x_i\]

**Quantiles**: cutpoints to divide the sorted data in subsets of equal size- Quartiles: three cutpoints to divide the data in
**four**equal-sized sets- \(q_1\) (1st quartile): cutpoint between 1st and 2nd group
- \(q_2\) (2nd quartile): cutpoint between 2nd and 3rd group (=
**median**!) - \(q_3\) (3rd quartile): cutpoint between 3rd and 4th group

- Percentiles: divide data in
**hundred**equal-sized subsets- \(q_1\) = 25th percentile
- \(q_2\) (= median) = 50th percentile
- Score at $n$th percentile is better than \(n\)% of scores

- Quartiles: three cutpoints to divide the data in

**Minimum, maximum**: lowest and highest value**Range**: difference between minimum and maximum**Interquartile range**(IQR): \(q_3\) - \(q_1\)

- A
**box plot**is used to visualize variation of a variable- Box (IQR): \(q_1\) (bottom), median (thickest line), \(q_3\) (top)
- (In example below, \(q_1\) and median have the same value)

- Whiskers: maximum (top) and minimum (bottom) non-outlier value
- Circle(s): outliers (> 1.5 IQR distance from box)

- Box (IQR): \(q_1\) (bottom), median (thickest line), \(q_3\) (top)

- Deviation: difference between mean and individual value
- Variance: average
**squared**deviation- Squared in order to make negative differences positive
*Population variance*: \[\sigma^2 = \frac{1}{n}\sum\limits_{i=1}^n (x_i - \mu)^2\]- As sample mean (\(\bar{x}\) or \(m\)) is approximation of population mean (\(\mu\)),
*sample variance*formula contains division by \(n-1\) (results in slightly higher variance): \[s^2 = \frac{1}{n-1}\sum\limits_{i=1}^n (x_i - \bar{x})^2\]

- Standard deviation is square root of variance \[\sigma = \sqrt{\sigma^2} = \sqrt{\frac{1}{n}\sum\limits_{i=1}^n (x_i - \mu)^2}\] \[s = \sqrt{s^2} = \sqrt{\frac{1}{n-1}\sum\limits_{i=1}^n (x_i - \bar{x})^2}\]

\(P(\mu - \sigma \leq x \leq \mu + \sigma) \approx 68\%\) (34 + 34)

\(P(\mu - 2\sigma \leq x \leq \mu + 2\sigma) \approx 95\%\) (34 + 34 + 13.5 + 13.5)

\(P(\mu - 3\sigma \leq x \leq \mu + 3\sigma) \approx 99.7\%\) (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)

\(P(85 \leq \rm{IQ} \leq 115) \approx 68\%\) (34 + 34)

\(P(70 \leq \rm{IQ} \leq 130) \approx 95\%\) (34 + 34 + 13.5 + 13.5)

\(P(55 \leq \rm{IQ} \leq 145) \approx 99.7\%\) (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)

- IQ scores are normally distributed with mean 100 and standard deviation 15

- Standardization helps facilitate interpretation
- E.g., how to interpret: "Emma got a score of 112" and "Tom got a score of 105"
- Interpretation should be done with respect to mean \(\mu\) and standard deviation \(\sigma\)
- Raw scores can be transformed to
**standardized scores**(**\(z\)-scores**or**\(z\)-values**) \[z = \frac{x - \mu}{\sigma} = \frac{\rm{deviation}}{\rm{standard}\,\rm{deviation}}\] - Interpretation: difference of value from mean in number of standard deviations

- Raw scores can be transformed to

- Suppose \(\mu = 108\), \(\sigma = 4\), then: \[z_{112} = \frac{x - \mu}{\sigma} = \frac{112 - 108}{4} = 1\] \[z_{105} = \frac{105 - 108}{4} = -0.75\]
- \(z\) shows distance from mean in number of standard deviations

- If we transform
**all**raw scores of a variable into \(z\)-scores using: \[z = \frac{x - \mu}{\sigma} = \frac{\rm{deviation}}{\rm{standard}\,\rm{deviation}}\] - We obtain a new transformed variable whose
- Mean is 0
- Standard deviation is 1

- In sum: \(z\)-score = distance from \(\mu\) in \(\sigma\)'s
- \(z\)-scores are useful for interpretation and hypothesis testing