Statistiek I

Descriptive statistics

Martijn Wieling
University of Groningen

Question 1: last lecture

Last lecture

  • Why use statistics?
    • Summarize data (descriptive statistics)
    • Assess relationships in data (inferential statistics)
  • Introduction to RStudio and R
    • Variables and functions
    • Import, view and modify data in R
    • Some visualization and statistics in R

This lecture

  • Descriptive statistics vs. inferential statistics
  • Sample vs. population
  • (Types of) variables
  • Distribution of a variable
  • Measures of central tendency
  • Measures of variation
  • Standardized scores
  • Checking for a normal distribution

Descriptive vs. inferential statistics

  • Descriptive statistics:
    • Statistics used to describe (sample) data without further conclusions
      • Measures of central tendency: Mean, median, mode
      • Measures of variation (or spread): range, IQR, variance, standard deviation
  • Inferential statistics:
    • Investigate data of sample in order to infer patterns in the population
      • Statistical tests: \(t\)-test, \(\chi^2\)-test, etc.

Question 2

Sample vs. population

Why study a sample?

  • Studying the whole population is (almost always) practically impossible
  • Sample is a (selected) subset of population and thus more accessible
    • Selection of representative sample is very important!

Question 3

Variables and values

  • Statistics always involves variables
    • Relations: involve two variables (e.g., English grade vs. English score)
  • The values of the variables indicate properties of the cases (i.e. the individuals or entities you study)
Variable Example values
Gender male, female, ...
English grade 4.6, 5.5, 6.3, 7.2, ...
Year of birth 1990, 1991, 1993, ...
Native language Dutch, German, English, ...

Tabular representation of data

  • Each case is shown in a row
  • Each variable in a column
  • For part of our data:
participant year gender bl_edu study english_grade english_score
405 2021 F N CIS 8 8.21
406 2021 F N LING 7 7.78
407 2021 F N OTHER 7 8.13
408 2021 F N OTHER 8 9.36
409 2021 M N IS 8 7.98
410 2021 F N LING 7 7.66

Types of variables: nominal and ordinal

  • Nominal (categorical) scale: unordered categories
    • Gender (frequently binary: two categories), Native language, etc.
  • Ordinal: ordered (ranked) scale, but amount of difference unclear
    • Rank of English profiency (in class), Likert scale (Rate on a scale from 1 to 5...)

Types of variables: numerical

  • Interval scale: numerical with meaningful difference but no true 0
    • Year of birth, temperature in Celsius
  • Ratio scale: numerical with meaningful difference and true 0
    • Number of questions correct, age
  • Scale of variable determines possible statistics
    • E.g., mean age is possible, but not mean native language

Question 4

Types of variables: summary

Question 5

Characterizing nom. variables: (relative) freq. table

table(dat$gender)  # absolute frequencies
# 
#   F   M 
# 346 154
prop.table(table(dat$gender))  # relative frequencies
# 
#     F     M 
# 0.692 0.308

Characterizing nominal variables: visualization

par(mfrow = c(1, 2))
barplot(table(dat$gender), col = c("pink", "lightblue"), main = "abs.frq.")
barplot(prop.table(table(dat$gender)), col = c("pink", "lightblue"), main = "rel.frq.")

plot of chunk unnamed-chunk-3

Bad practice: pie charts

pie(table(dat$study), col = c("red", "green", "blue", "yellow"))

plot of chunk unnamed-chunk-4

Question 6

Characterizing numerical variables: distribution

  • We are generally not interested in individual values of a variable, but rather all values and their frequency
  • This is captured by a distribution
    • Famous distribution: Normal distribution ("bell-shaped" curve): e.g., IQ scores

plot of chunk unnamed-chunk-5

Distribution

  • Distribution of a variable shows variability of variable (i.e. frequency of values)
table(dat$english_grade)
# 
#   5 5.5   6 6.5   7 7.5   8 8.5   9 9.5 
#   6   2  76  12 191  21 148  15  26   3

Visualizing a distribution: histogram

  • Histogram shows frequency of all values in groups: \((a,b]\)
    • Look for general pattern, symmetry, outliers
hist(dat$english_grade, xlab = "English grade", main = "")

plot of chunk unnamed-chunk-8

Visualizing a distribution: density curve

  • Density curve shows area proportional to the relative frequency
plot(density(dat$english_grade), main = "", xlab = "English grade")

plot of chunk unnamed-chunk-10

Interpreting a density curve

plot of chunk unnamed-chunk-11

  • The total area under a density curve is equal to 1
  • A density curve does not provide information about the frequency of one value
    • E.g., there might be no one who has scored a grade of exactly 6.1
  • It only provides information about an interval
    • E.g., more than 50% of the grades lie between 5.5 and 7.5

Interpreting a density curve: normal distribution

plot of chunk unnamed-chunk-12

  • The normal distribution has convenient characteristics
    • Completely symmetric
    • Red area: (about) 68%
    • Red and green area: (about) 95%

Characterizing the distribution of numerical variables

  • A distribution can also be characterized by measures of center and variation
    • (skewness measures the symmetry of the distribution; not covered in this course)

Characterizing numerical variables: central tendency

  • Mode: most frequent element (for nominal data: only meaningful measure)
  • Median: when data is sorted from small to large, it is the middle value
  • Mean: arithmetical average

\[\bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n} = \frac{1}{n}\sum\limits_{i=1}^n x_i\]

  • (You need to be able to calculate the mean of a series of values by hand, so it is useful to remember the formula)

Question 7

Measures of center may have very different values

Central tendency in R

mean(dat$english_grade)  # arithmetic average
# [1] 7.2828
median(dat$english_grade)
# [1] 7
# no built-in function to get mode: new function
my_mode <- function(x) {
    counts <- table(x)
    names(which(counts == max(counts)))
}
my_mode(dat$english_grade)
# [1] "7"

Measure of variation: quantiles

  • Quantiles: cutpoints to divide the sorted data in subsets of equal size
    • Quartiles: three cutpoints to divide the data in four equal-sized sets
      • \(q_1\) (1st quartile): cutpoint between 1st and 2nd group
      • \(q_2\) (2nd quartile): cutpoint between 2nd and 3rd group (= median!)
      • \(q_3\) (3rd quartile): cutpoint between 3rd and 4th group
    • Percentiles: divide data in hundred equal-sized subsets
      • \(q_1\) = 25th percentile
      • \(q_2\) (= median) = 50th percentile
      • Score at $n$th percentile is better than \(n\)% of scores
quantile(dat$english_grade)  # default: quartiles
#   0%  25%  50%  75% 100% 
#  5.0  7.0  7.0  8.0  9.5

Measure of variation: range

  • Minimum, maximum: lowest and highest value
  • Range: difference between minimum and maximum
  • Interquartile range (IQR): \(q_3\) - \(q_1\)

Visualizing variation: box plot (box-and-whisker plot)

  • A box plot is used to visualize variation of a variable
    • Box (IQR): \(q_1\) (bottom), median (thickest line), \(q_3\) (top)
      • (In example below, \(q_1\) and median have the same value)
    • Whiskers: maximum (top) and minimum (bottom) non-outlier value
    • Circle(s): outliers (> 1.5 IQR distance from box)
boxplot(dat$english_grade, col = "red")