Basic concepts

Martijn Wieling
University of Groningen

This lecture

  • Descriptive vs. inferential statistics
  • Sample vs. population
  • (Types of) variables
  • Distribution of a variable: central tendency and variation
  • Standardized scores
  • Checking for a normal distribution
  • Reasoning about the population using a sample
    • Relation between population (mean) and sample (mean)
    • Confidence interval for population mean based on sample mean
    • Testing a hypothesis about the population using a sample
    • Statistical significance
    • Error types

Question 1

Why use statistics?

  • Why use statistics?
    • Summarize data (descriptive statistics)
    • Assess relationships in data (inferential statistics)

Descriptive vs. inferential statistics

  • Descriptive statistics:
    • Statistics used to describe (sample) data without further conclusions
      • Measures of central tendency: Mean, median, mode
      • Measures of variation (or spread): range, IQR, variance, standard deviation
  • Inferential statistics:
    • Describe data of sample in order to infer patterns in the population
      • Statistical tests: \(t\)-test, \(\chi^2\)-test, etc.

Sample vs. population

Why study a sample?

  • Studying the whole population is (frequently) practically impossible
  • Sample is a (selected) subset of population and thus more accessible
    • Selection of representative sample is very important!

Question 2

Types of variables

Question 3

Characterizing nominal variables

plot of chunk unnamed-chunk-1

Characterizing numerical variables: distribution

  • We are generally not interested in individual values of a variable, but rather all values and their frequency
  • This is captured by a distribution
    • Famous distribution: Normal distribution ("bell-shaped" curve): e.g., IQ scores

plot of chunk unnamed-chunk-2

Interpreting a density curve

plot of chunk unnamed-chunk-3

  • The total area under a density curve is equal to 1
  • A density curve does not provide information about the frequency of one value
    • E.g., there might be no one who has a value of exactly 6.1
  • It only provides information about an interval
    • E.g., more than 50% of the values lie between 5.5 and 7.5

Interpreting a density curve: normal distribution

plot of chunk unnamed-chunk-4

  • The normal distribution has convenient characteristics
    • Completely symmetric
    • Red area: (about) 68%
    • Red and green area: (about) 95%

Characterizing the distribution of numerical variables

  • A distribution can also be characterized by measures of center and variation
    • (skewness measures the symmetry of the distribution; not covered further)

Characterizing numerical variables: central tendency

  • Mode: most frequent element (for nominal data: only meaningful measure)
  • Median: when data is sorted from small to large, it is the middle value
  • Mean: arithmetical average

\[\bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n} = \frac{1}{n}\sum\limits_{i=1}^n x_i\]

Question 4

Measure of variation: quantiles

  • Quantiles: cutpoints to divide the sorted data in subsets of equal size
    • Quartiles: three cutpoints to divide the data in four equal-sized sets
      • \(q_1\) (1st quartile): cutpoint between 1st and 2nd group
      • \(q_2\) (2nd quartile): cutpoint between 2nd and 3rd group (= median!)
      • \(q_3\) (3rd quartile): cutpoint between 3rd and 4th group
    • Percentiles: divide data in hundred equal-sized subsets
      • \(q_1\) = 25th percentile
      • \(q_2\) (= median) = 50th percentile
      • Score at $n$th percentile is better than \(n\)% of scores

Measure of variation: range

  • Minimum, maximum: lowest and highest value
  • Range: difference between minimum and maximum
  • Interquartile range (IQR): \(q_3\) - \(q_1\)

Visualizing variation: box plot (box-and-whisker plot)

  • A box plot is used to visualize variation of a variable
    • Box (IQR): \(q_1\) (bottom), median (thickest line), \(q_3\) (top)
      • (In example below, \(q_1\) and median have the same value)
    • Whiskers: maximum (top) and minimum (bottom) non-outlier value
    • Circle(s): outliers (> 1.5 IQR distance from box) plot of chunk unnamed-chunk-5

Important measure of variation: variance

  • Deviation: difference between mean and individual value
  • Variance: average squared deviation
    • Squared in order to make negative differences positive
    • Population variance: \[\sigma^2 = \frac{1}{n}\sum\limits_{i=1}^n (x_i - \mu)^2\]
    • As sample mean (\(\bar{x}\) or \(m\)) is approximation of population mean (\(\mu\)), sample variance formula contains division by \(n-1\) (results in slightly higher variance): \[s^2 = \frac{1}{n-1}\sum\limits_{i=1}^n (x_i - \bar{x})^2\]

Important measure of variation: standard deviation

  • Standard deviation is square root of variance \[\sigma = \sqrt{\sigma^2} = \sqrt{\frac{1}{n}\sum\limits_{i=1}^n (x_i - \mu)^2}\] \[s = \sqrt{s^2} = \sqrt{\frac{1}{n-1}\sum\limits_{i=1}^n (x_i - \bar{x})^2}\]

Question 5

Normal distribution and standard deviation (1)

plot of chunk unnamed-chunk-6

\(P(\mu - \sigma \leq x \leq \mu + \sigma) \approx 68\%\)     (34 + 34)
\(P(\mu - 2\sigma \leq x \leq \mu + 2\sigma) \approx 95\%\)     (34 + 34 + 13.5 + 13.5)
\(P(\mu - 3\sigma \leq x \leq \mu + 3\sigma) \approx 99.7\%\)     (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)

Normal distribution and standard deviation (2)

plot of chunk unnamed-chunk-7

\(P(85 \leq \rm{IQ} \leq 115) \approx 68\%\)     (34 + 34)
\(P(70 \leq \rm{IQ} \leq 130) \approx 95\%\)     (34 + 34 + 13.5 + 13.5)
\(P(55 \leq \rm{IQ} \leq 145) \approx 99.7\%\)     (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)

  • IQ scores are normally distributed with mean 100 and standard deviation 15

Standardized scores

  • Standardization helps facilitate interpretation
  • E.g., how to interpret: "Emma got a score of 112" and "Tom got a score of 105"
  • Interpretation should be done with respect to mean \(\mu\) and standard deviation \(\sigma\)
    • Raw scores can be transformed to standardized scores (\(z\)-scores or \(z\)-values) \[z = \frac{x - \mu}{\sigma} = \frac{\rm{deviation}}{\rm{standard}\,\rm{deviation}}\]
    • Interpretation: difference of value from mean in number of standard deviations

Calculating standardized values

  • Suppose \(\mu = 108\), \(\sigma = 4\), then: \[z_{112} = \frac{x - \mu}{\sigma} = \frac{112 - 108}{4} = 1\] \[z_{105} = \frac{105 - 108}{4} = -0.75\]
  • \(z\) shows distance from mean in number of standard deviations

Question 6

Distribution of standardized variables

  • If we transform all raw scores of a variable into \(z\)-scores using: \[z = \frac{x - \mu}{\sigma} = \frac{\rm{deviation}}{\rm{standard}\,\rm{deviation}}\]
  • We obtain a new transformed variable whose
    • Mean is 0
    • Standard deviation is 1
  • In sum: \(z\)-score = distance from \(\mu\) in \(\sigma\)'s
  • \(z\)-scores are useful for interpretation and hypothesis testing

Standard normal distribution