(You need to be able to calculate the mean of a series of values by hand, so it is useful to remember the formula)
Question 7
Measures of center may have very different values
Central tendency in R
mean(dat$english_grade) # arithmetic average
[1] 7.2828
median(dat$english_grade)
[1] 7
# no built-in function to get mode: new functionmy_mode <-function(x) { counts <-table(x)as.numeric(names(which(counts ==max(counts))))}my_mode(dat$english_grade)
[1] 7
Measure of variation: quantiles
Quantiles: cutpoints to divide the sorted data in subsets of equal size
Quartiles: three cutpoints to divide the data in four equal-sized sets
\(q_1\) (1st quartile): cutpoint between 1st and 2nd group
\(q_2\) (2nd quartile): cutpoint between 2nd and 3rd group (= median!)
\(q_3\) (3rd quartile): cutpoint between 3rd and 4th group
Percentiles: divide data in hundred equal-sized subsets
\(q_1\) = 25th percentile
\(q_2\) (= median) = 50th percentile
Score at \(n\)th percentile is better than \(n\)% of scores
A box plot is used to visualize variation of a variable
Box (IQR): \(q_1\) (bottom), median (thickest line), \(q_3\) (top)
(In example below, \(q_1\) and median have the same value)
Whiskers: maximum (top) and minimum (bottom) non-outlier value
Circle(s): outliers (> 1.5 IQR distance from box)
boxplot(dat$english_grade, col='red')
Important measure of variation: variance
Deviation: difference between mean and individual value
Variance: average squared deviation
Squared in order to make negative differences positive
Population variance (with \(\mu\) = population mean): \[\sigma^2 = \frac{1}{n}\sum\limits_{i=1}^n (x_i - \mu)^2\]
As sample mean (\(\bar{x}\)) is approximation of population mean (\(\mu\)), sample variance formula contains division by \(n-1\) (results in slightly higher variance): \[s^2 = \frac{1}{n-1}\sum\limits_{i=1}^n (x_i - \bar{x})^2\]
Important measure of variation: standard deviation
Standard deviation is square root of variance \[\sigma = \sqrt{\sigma^2} = \sqrt{\frac{1}{n}\sum\limits_{i=1}^n (x_i - \mu)^2}\]\[s = \sqrt{s^2} = \sqrt{\frac{1}{n-1}\sum\limits_{i=1}^n (x_i - \bar{x})^2}\]
Question 8
Measures of variation (or spread) in R (1)
min(dat$english_grade) # minimum value
[1] 5
max(dat$english_grade) # maximum value
[1] 9.5
range(dat$english_grade) # returns minimum and maximum value
[1] 5.0 9.5
diff(range(dat$english_grade)) # returns difference between min. and max.
[1] 4.5
Measures of variation (or spread) in R (2)
IQR(dat$english_grade) # interquartile range
[1] 1
var(dat$english_grade) # sample variance
[1] 0.75173
sd(dat$english_grade) # sample standard deviation
[1] 0.86702
sd(dat$english_grade) ==sqrt(var(dat$english_grade)) # std. dev. = sqrt of var.?
IQ scores are normally distributed with mean 100 and standard deviation 15
Standardized scores
Standardization helps facilitate interpretation
E.g., how to interpret: “Emma’s score is 112” and “David’s score is 105”
Interpretation should be done with respect to mean \(\mu\) and standard deviation \(\sigma\)
Raw scores can be transformed to standardized scores (\(z\)-scores or \(z\)-values) \[z = \frac{x - \mu}{\sigma} = \frac{\rm{deviation}}{\rm{standard}\,\rm{deviation}}\]
Interpretation: difference of value from mean in number of standard deviations
(Note that you have to be able to calculate \(z\)-scores!)
\(z\) shows distance from mean in number of standard deviations
Question 9
Distribution of standardized variables
If we transform all raw scores of a variable into \(z\)-scores using: \[z = \frac{x - \mu}{\sigma} = \frac{\rm{deviation}}{\rm{standard}\,\rm{deviation}}\]
We obtain a new transformed variable whose
Mean is 0
Standard deviation is 1
In sum: \(z\)-score = distance from \(\mu\) in \(\sigma\)’s
\(z\)-scores are useful for interpretation and hypothesis testing (next lecture)
Standardizing a variable in R
dat$english_grade.z <-scale(dat$english_grade) # scale: calculates z-scoresmean(dat$english_grade.z) # should be 0