# Basic concepts

Martijn Wieling
University of Groningen

## This lecture

• Descriptive vs. inferential statistics
• Sample vs. population
• (Types of) variables
• Distribution of a variable: central tendency and variation
• Standardized scores
• Checking for a normal distribution
• Reasoning about the population using a sample
• Relation between population (mean) and sample (mean)
• Confidence interval for population mean based on sample mean
• Testing a hypothesis about the population using a sample
• Statistical significance
• Error types

## Why use statistics?

• Why use statistics?
• Summarize data (descriptive statistics)
• Assess relationships in data (inferential statistics)

## Descriptive vs. inferential statistics

• Descriptive statistics:
• Statistics used to describe (sample) data without further conclusions
• Measures of central tendency: Mean, median, mode
• Measures of variation (or spread): range, IQR, variance, standard deviation
• Inferential statistics:
• Describe data of sample in order to infer patterns in the population
• Statistical tests: $$t$$-test, $$\chi^2$$-test, etc.

## Why study a sample?

• Studying the whole population is (frequently) practically impossible
• Sample is a (selected) subset of population and thus more accessible
• Selection of representative sample is very important!

## Characterizing numerical variables: distribution

• We are generally not interested in individual values of a variable, but rather all values and their frequency
• This is captured by a distribution
• Famous distribution: Normal distribution ("bell-shaped" curve): e.g., IQ scores

## Interpreting a density curve

• The total area under a density curve is equal to 1
• A density curve does not provide information about the frequency of one value
• E.g., there might be no one who has a value of exactly 6.1
• It only provides information about an interval
• E.g., more than 50% of the values lie between 5.5 and 7.5

## Interpreting a density curve: normal distribution

• The normal distribution has convenient characteristics
• Completely symmetric
• Red area: (about) 68%
• Red and green area: (about) 95%

## Characterizing the distribution of numerical variables

• A distribution can also be characterized by measures of center and variation
• (skewness measures the symmetry of the distribution; not covered further)

## Characterizing numerical variables: central tendency

• Mode: most frequent element (for nominal data: only meaningful measure)
• Median: when data is sorted from small to large, it is the middle value
• Mean: arithmetical average

$\bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n} = \frac{1}{n}\sum\limits_{i=1}^n x_i$

## Measure of variation: quantiles

• Quantiles: cutpoints to divide the sorted data in subsets of equal size
• Quartiles: three cutpoints to divide the data in four equal-sized sets
• $$q_1$$ (1st quartile): cutpoint between 1st and 2nd group
• $$q_2$$ (2nd quartile): cutpoint between 2nd and 3rd group (= median!)
• $$q_3$$ (3rd quartile): cutpoint between 3rd and 4th group
• Percentiles: divide data in hundred equal-sized subsets
• $$q_1$$ = 25th percentile
• $$q_2$$ (= median) = 50th percentile
• Score at $n$th percentile is better than $$n$$% of scores

## Measure of variation: range

• Minimum, maximum: lowest and highest value
• Range: difference between minimum and maximum
• Interquartile range (IQR): $$q_3$$ - $$q_1$$

## Visualizing variation: box plot (box-and-whisker plot)

• A box plot is used to visualize variation of a variable
• Box (IQR): $$q_1$$ (bottom), median (thickest line), $$q_3$$ (top)
• (In example below, $$q_1$$ and median have the same value)
• Whiskers: maximum (top) and minimum (bottom) non-outlier value
• Circle(s): outliers (> 1.5 IQR distance from box)

## Important measure of variation: variance

• Deviation: difference between mean and individual value
• Variance: average squared deviation
• Squared in order to make negative differences positive
• Population variance: $\sigma^2 = \frac{1}{n}\sum\limits_{i=1}^n (x_i - \mu)^2$
• As sample mean ($$\bar{x}$$ or $$m$$) is approximation of population mean ($$\mu$$), sample variance formula contains division by $$n-1$$ (results in slightly higher variance): $s^2 = \frac{1}{n-1}\sum\limits_{i=1}^n (x_i - \bar{x})^2$

## Important measure of variation: standard deviation

• Standard deviation is square root of variance $\sigma = \sqrt{\sigma^2} = \sqrt{\frac{1}{n}\sum\limits_{i=1}^n (x_i - \mu)^2}$ $s = \sqrt{s^2} = \sqrt{\frac{1}{n-1}\sum\limits_{i=1}^n (x_i - \bar{x})^2}$

## Normal distribution and standard deviation (1)

$$P(\mu - \sigma \leq x \leq \mu + \sigma) \approx 68\%$$     (34 + 34)
$$P(\mu - 2\sigma \leq x \leq \mu + 2\sigma) \approx 95\%$$     (34 + 34 + 13.5 + 13.5)
$$P(\mu - 3\sigma \leq x \leq \mu + 3\sigma) \approx 99.7\%$$     (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)

## Normal distribution and standard deviation (2)

$$P(85 \leq \rm{IQ} \leq 115) \approx 68\%$$     (34 + 34)
$$P(70 \leq \rm{IQ} \leq 130) \approx 95\%$$     (34 + 34 + 13.5 + 13.5)
$$P(55 \leq \rm{IQ} \leq 145) \approx 99.7\%$$     (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)

• IQ scores are normally distributed with mean 100 and standard deviation 15

## Standardized scores

• Standardization helps facilitate interpretation
• E.g., how to interpret: "Emma got a score of 112" and "Tom got a score of 105"
• Interpretation should be done with respect to mean $$\mu$$ and standard deviation $$\sigma$$
• Raw scores can be transformed to standardized scores ($$z$$-scores or $$z$$-values) $z = \frac{x - \mu}{\sigma} = \frac{\rm{deviation}}{\rm{standard}\,\rm{deviation}}$
• Interpretation: difference of value from mean in number of standard deviations

## Calculating standardized values

• Suppose $$\mu = 108$$, $$\sigma = 4$$, then: $z_{112} = \frac{x - \mu}{\sigma} = \frac{112 - 108}{4} = 1$ $z_{105} = \frac{105 - 108}{4} = -0.75$
• $$z$$ shows distance from mean in number of standard deviations

## Distribution of standardized variables

• If we transform all raw scores of a variable into $$z$$-scores using: $z = \frac{x - \mu}{\sigma} = \frac{\rm{deviation}}{\rm{standard}\,\rm{deviation}}$
• We obtain a new transformed variable whose
• Mean is 0
• Standard deviation is 1
• In sum: $$z$$-score = distance from $$\mu$$ in $$\sigma$$'s
• $$z$$-scores are useful for interpretation and hypothesis testing