Statistiek I

Descriptive statistics

Martijn Wieling

Question 1: last lecture

Last lecture

  • Why use statistics?
    • Summarize data (descriptive statistics)
    • Assess relationships in data (inferential statistics)
  • Introduction to RStudio and R
    • Variables and functions
    • Import, view and modify data in R
    • Some visualization and statistics in R

This lecture

  • Descriptive statistics vs. inferential statistics
  • Sample vs. population
  • (Types of) variables
  • Distribution of a variable
  • Measures of central tendency
  • Measures of variation
  • Standardized scores
  • Checking for a normal distribution

Descriptive vs. inferential statistics

  • Descriptive statistics:
    • Statistics used to describe (sample) data without further conclusions
      • Measures of central tendency: Mean, median, mode
      • Measures of variation (or spread): range, IQR, variance, standard deviation
  • Inferential statistics:
    • Investigate data of sample in order to infer patterns in the population
      • Statistical tests: regression, Cronbach’s \(\alpha\), etc.

Question 2

Sample vs. population

Why study a sample?

  • Studying the whole population is (almost always) practically impossible
  • Sample is a (selected) subset of population and thus more accessible
    • Selection of representative sample is very important!

Question 3

Variables and values

  • Statistics always involves variables
    • Relations: involve two variables (e.g., English grade vs. English score)
  • The values of the variables indicate properties of the cases (i.e. the individuals or entities you study)
Variable Example values
Sex male, female, …
English grade 4.6, 5.5, 6.3, 7.2, …
Year of birth 2000, 2001, 2003, …
Native language Dutch, German, English, …

Tabular representation of data

  • Each case is shown in a row
  • Each variable in a column
  • For part of our data:
participant year sex bl_edu study english_grade english_score
495 2024 F N LING 7.0 8.02
496 2024 M N IS 8.0 7.54
497 2024 F N LING 6.0 7.19
498 2024 F N LING 6.5 6.42
499 2024 M N IS 9.0 9.57
500 2024 M N LING 6.0 6.17

Types of variables: nominal and ordinal

  • Nominal (categorical) scale: unordered categories
    • Sex (frequently binary: two categories), native language, etc.
  • Ordinal: ordered (ranked) scale, but exact difference unclear
    • Rank of English profiency (in class), Likert scale (rate on a scale from 1 to 5…)

Types of variables: numerical

  • Interval scale: numerical values with meaningful difference but no true 0
    • Year of birth, temperature in Celsius
  • Ratio scale: numerical values with meaningful difference and true 0
    • Number of questions correct, age
  • Scale of variable determines possible statistics
    • E.g., mean age is possible, but not mean native language

Question 4

Types of variables: summary

Question 5

Characterizing nom. variables: (relative) freq. table

table(dat$sex) # absolute frequencies

  F   M 
346 154 
prop.table( table(dat$sex) ) # relative frequencies

    F     M 
0.692 0.308 

Characterizing nominal variables: visualization

par(mfrow=c(1,2))
barplot(table(dat$sex),col=c('pink','lightblue'), main='abs. frq.')
barplot(prop.table(table(dat$sex)), col=c('pink','lightblue'),main='rel. frq.')

Bad practice: pie charts

pie(table(dat$study),col=c("red","cyan","blue","yellow"))

Question 6

Characterizing numerical variables: distribution

  • We are generally not interested in individual values of a variable, but rather all values and their frequency
  • This is captured by a distribution
    • Famous distribution: Normal distribution (“bell-shaped” curve): e.g., IQ scores

Distribution

  • Distribution of a variable shows variability of variable (i.e. frequency of values)
table(dat$english_grade)

  5 5.5   6 6.5   7 7.5   8 8.5   9 9.5 
  6   2  76  12 191  21 148  15  26   3 

Visualizing a distribution: histogram

  • Histogram shows frequency of all values in groups: \((a,b]\)
    • Look for general pattern, symmetry, outliers
hist(dat$english_grade, xlab='English grade', main='')

Visualizing a distribution: density curve

  • Density curve shows area proportional to the relative frequency
plot(density(dat$english_grade),main='',xlab='English grade')

Interpreting a density curve

  • The total area under a density curve is equal to 1
  • A density curve does not provide information about the frequency of one value
    • E.g., there might be no one who has scored a grade of exactly 6.1
  • It only provides information about an interval
    • E.g., more than 50% of the grades lie between 5.5 and 7.5

Interpreting a density curve: normal distribution

  • The normal distribution has convenient characteristics
    • Completely symmetric
    • Red area: (about) 68%
    • Red and green area: (about) 95%

Characterizing the distribution of numerical variables

  • A distribution can also be characterized by measures of center and variation
    • (skewness measures the symmetry of the distribution; not covered in this course)

Characterizing numerical variables: central tendency

  • Mode: most frequent element (for nominal data: only meaningful measure)
  • Median: when data is sorted from small to large, it is the middle value
  • Mean: arithmetical average

\[\bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n} = \frac{1}{n}\sum\limits_{i=1}^n x_i\]

  • (You need to be able to calculate the mean of a series of values by hand, so it is useful to remember the formula)

Question 7

Measures of center may have very different values

Central tendency in R

mean(dat$english_grade) # arithmetic average
[1] 7.2828
median(dat$english_grade)
[1] 7
# no built-in function to get mode: new function
my_mode <- function(x) { 
    counts <- table(x)
    as.numeric(names(which(counts == max(counts))))
}
my_mode(dat$english_grade)
[1] 7

Measure of variation: quantiles

  • Quantiles: cutpoints to divide the sorted data in subsets of equal size
    • Quartiles: three cutpoints to divide the data in four equal-sized sets
      • \(q_1\) (1st quartile): cutpoint between 1st and 2nd group
      • \(q_2\) (2nd quartile): cutpoint between 2nd and 3rd group (= median!)
      • \(q_3\) (3rd quartile): cutpoint between 3rd and 4th group
    • Percentiles: divide data in hundred equal-sized subsets
      • \(q_1\) = 25th percentile
      • \(q_2\) (= median) = 50th percentile
      • Score at \(n\)th percentile is better than \(n\)% of scores
quantile(dat$english_grade) # default: quartiles
  0%  25%  50%  75% 100% 
 5.0  7.0  7.0  8.0  9.5 

Measure of variation: range

  • Minimum, maximum: lowest and highest value
  • Range: difference between minimum and maximum
  • Interquartile range (IQR): \(q_3\) - \(q_1\)

Visualizing variation: box plot (box-and-whisker plot)

  • A box plot is used to visualize variation of a variable
    • Box (IQR): \(q_1\) (bottom), median (thickest line), \(q_3\) (top)
      • (In example below, \(q_1\) and median have the same value)
    • Whiskers: maximum (top) and minimum (bottom) non-outlier value
    • Circle(s): outliers (> 1.5 IQR distance from box)
boxplot(dat$english_grade, col='red')

Important measure of variation: variance

  • Deviation: difference between mean and individual value
  • Variance: average squared deviation
    • Squared in order to make negative differences positive
    • Population variance (with \(\mu\) = population mean): \[\sigma^2 = \frac{1}{n}\sum\limits_{i=1}^n (x_i - \mu)^2\]
    • As sample mean (\(\bar{x}\)) is approximation of population mean (\(\mu\)), sample variance formula contains division by \(n-1\) (results in slightly higher variance): \[s^2 = \frac{1}{n-1}\sum\limits_{i=1}^n (x_i - \bar{x})^2\]

Important measure of variation: standard deviation

  • Standard deviation is square root of variance \[\sigma = \sqrt{\sigma^2} = \sqrt{\frac{1}{n}\sum\limits_{i=1}^n (x_i - \mu)^2}\] \[s = \sqrt{s^2} = \sqrt{\frac{1}{n-1}\sum\limits_{i=1}^n (x_i - \bar{x})^2}\]

Question 8

Measures of variation (or spread) in R (1)

min(dat$english_grade) # minimum value
[1] 5
max(dat$english_grade) # maximum value
[1] 9.5
range(dat$english_grade) # returns minimum and maximum value
[1] 5.0 9.5
diff(range(dat$english_grade)) # returns difference between min. and max.
[1] 4.5

Measures of variation (or spread) in R (2)

IQR(dat$english_grade) # interquartile range
[1] 1
var(dat$english_grade) # sample variance
[1] 0.75173
sd(dat$english_grade) # sample standard deviation
[1] 0.86702
sd(dat$english_grade) == sqrt(var(dat$english_grade)) # std. dev. = sqrt of var.?
[1] TRUE

Normal distribution and standard deviation (1)

\(P(\mu - \sigma \leq x \leq \mu + \sigma) \approx 68\%\)     (34 + 34)
\(P(\mu - 2\sigma \leq x \leq \mu + 2\sigma) \approx 95\%\)     (34 + 34 + 13.5 + 13.5)
\(P(\mu - 3\sigma \leq x \leq \mu + 3\sigma) \approx 99.7\%\)     (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)

  • It is important to remember these characteristics of the normal distribution!

Normal distribution and standard deviation (2)

\(P(85 \leq \rm{IQ} \leq 115) \approx 68\%\)     (34 + 34)
\(P(70 \leq \rm{IQ} \leq 130) \approx 95\%\)     (34 + 34 + 13.5 + 13.5)
\(P(55 \leq \rm{IQ} \leq 145) \approx 99.7\%\)     (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)

  • IQ scores are normally distributed with mean 100 and standard deviation 15

Standardized scores

  • Standardization helps facilitate interpretation
  • E.g., how to interpret: “Emma’s score is 112” and “David’s score is 105”
  • Interpretation should be done with respect to mean \(\mu\) and standard deviation \(\sigma\)
    • Raw scores can be transformed to standardized scores (\(z\)-scores or \(z\)-values) \[z = \frac{x - \mu}{\sigma} = \frac{\rm{deviation}}{\rm{standard}\,\rm{deviation}}\]
    • Interpretation: difference of value from mean in number of standard deviations
  • (Note that you have to be able to calculate \(z\)-scores!)

Calculating standardized values

  • Suppose \(\mu = 108\), \(\sigma = 4\), then: \[z_{112} = \frac{x - \mu}{\sigma} = \frac{112 - 108}{4} = 1\] \[z_{105} = \frac{105 - 108}{4} = -0.75\]
  • \(z\) shows distance from mean in number of standard deviations

Question 9

Distribution of standardized variables

  • If we transform all raw scores of a variable into \(z\)-scores using: \[z = \frac{x - \mu}{\sigma} = \frac{\rm{deviation}}{\rm{standard}\,\rm{deviation}}\]
  • We obtain a new transformed variable whose
    • Mean is 0
    • Standard deviation is 1
  • In sum: \(z\)-score = distance from \(\mu\) in \(\sigma\)’s
  • \(z\)-scores are useful for interpretation and hypothesis testing (next lecture)

Standardizing a variable in R

dat$english_grade.z <- scale(dat$english_grade) # scale: calculates z-scores
mean(dat$english_grade.z) # should be 0
[1] 0
sd(dat$english_grade.z) # should be 1
[1] 1

Standard normal distribution

 
\(P(-1 \leq z \leq 1) \approx 68\%\)     (34 + 34)
\(P(-2 \leq z \leq 2) \approx 95\%\)     (34 + 34 + 13.5 + 13.5)
\(P(-3 \leq z \leq 3) \approx 99.7\%\)     (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)

For comparison: normal distribution

 
\(P(\mu - \sigma \leq x \leq \mu + \sigma) \approx 68\%\)     (34 + 34)
\(P(\mu - 2\sigma \leq x \leq \mu + 2\sigma) \approx 95\%\)     (34 + 34 + 13.5 + 13.5)
\(P(\mu - 3\sigma \leq x \leq \mu + 3\sigma) \approx 99.7\%\)     (34 + 34 + 13.5 + 13.5 + 2.35 + 2.35)

Percentiles and \(z\)-scores

  • If distribution is normal then \(z\)-scores correspond to percentiles
  • The function ‘qnorm’ returns the \(z\)-values for a certain proportion (percentile / 100)
qnorm(95/100) # z-value associated with 95th percentile
[1] 1.6449

Question 10

\(z\)-scores and percentiles

  • The function ‘pnorm’ returns the proportion of data < a specified \(z\)-value
    • the percentile can be found by multiplying with 100
100 * pnorm(1.6449)
[1] 95

Some calculations with \(z\)-scores

  • What proportion of values have a \(z\)-value of at least 1.64?
    • \(P(z \geq 1.64)\)
1 - pnorm(1.64)
[1] 0.050503
  • What proportion of values are located between \(z\)-values between -2 and 2?
    • \(P(-2 \leq z < 2)\)
pnorm(2) - pnorm(-2)
[1] 0.9545

Step-by-step and visualization

pnorm(2)
[1] 0.97725
pnorm(-2)
[1] 0.02275

Question 11

Checking normality assumption

  • Some statistical tests require that data is (roughly) normally distributed
  • How to test this?
    • Using visual inspection of a normal quantile plot (or: quantile-quantile plot)
      • A straight line in this graph indicates a (roughly) normal distribution

Normal quantile plot: how it works

  • Sort the data from smallest to largest to determine quantiles (e.g., percentiles)
    • E.g., median for 50th percentile
  • Calculate \(z\)-values belonging to the quantiles (e.g., percentiles) of a standard normal distribution
    • E.g., \(z =\) 0 for 50th percentile, \(z =\) 2 for 97.5th percentile, etc.
  • Plot data values (\(y\)-axis) against normal quantile values (\(x\)-axis)
    • If points on (or close to) straight line: values normally distributed
  • (Note: you need to be able to interpret a quantile-quantile plot, but you don’t have to be able to construct the plot manually)

Normal quantile plot in R: English grades

qqnorm(dat$english_grade)
qqline(dat$english_grade)

  • Distribution not normal for the English grades

Normal quantile plot in R: English scores

qqnorm(dat$english_score)
qqline(dat$english_score)

  • Distribution roughly normal for the English scores

Recap

  • In this lecture, we’ve covered
    • Descriptive statistics vs. inferential statistics
    • Sample vs. population
    • Four types of variables
    • Distribution of a variable
    • Measures of central tendency
    • Measures of variation
    • Standardized scores
    • How to check for a normal distribution
  • In the lab session, you will experiment with descriptive statistics
  • Next lecture: Sampling

Please evaluate this lecture!

Exam question

Questions?

Thank you for your attention!

 

https://www.martijnwieling.nl

m.b.wieling@rug.nl