Regression

Martijn Wieling
University of Groningen

This lecture

  • Correlation
  • Regression
    • Linear regression
    • Multiple regression
    • Interpreting interactions
    • Regression assumptions and model criticism

Question 1

Correlation

  • Quantify relation between two numerical variables (interval or ratio scale)
    • \(-1 \leq r \leq 1\) indicates strength (effect size) and direction

plot of chunk unnamed-chunk-1

Correlation: sensitivity to outliers

Correlation is no causation!

Linear regression

  • To assess relationship between numerical dependent variable and one (simple regression) or more (multiple regression) quantitative or categorical predictor variables
    • Measures impact of each individual variable on dependent variable, while controlling for other variables in the model
    • Note that regression is equivalent to ANOVA, but the focus is different: relation between numerical variables vs. group comparisons

Linear regression: formula

  • Linear regression captures relationship between dependent variable and independent variables using a formula
    • \(y_i = \beta_1 x_i + \beta_0 + \epsilon_i\)
    • With \(y_i\): dependent variable, \(x_i\): independent variable, \(\beta_0\): intercept (value of \(y_i\) when \(x_i\) equals 0), \(\beta_1\): coefficient (slope) for all \(x_i\), and \(\epsilon_i\): error (residuals; all residuals follow normal distribution with mean 0)

Residuals: difference between actual and fitted values

plot of chunk unnamed-chunk-2

Linear regression: slope and intercept

Visualization is essential!

plot of chunk unnamed-chunk-3

Dataset for this lecture

head(iris)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Area
# 1          5.1         3.5          1.4         0.2  setosa       0.14
# 2          4.9         3.0          1.4         0.2  setosa       0.14
# 3          4.7         3.2          1.3         0.2  setosa       0.13
# 4          4.6         3.1          1.5         0.2  setosa       0.15
# 5          5.0         3.6          1.4         0.2  setosa       0.14
# 6          5.4         3.9          1.7         0.4  setosa       0.34

Fitting a simple regression model in R

m0 <- lm(Petal.Area ~ Sepal.Length, data = iris)
summary(m0)
# 
# Call:
# lm(formula = Petal.Area ~ Sepal.Length, data = iris)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
# -2.671 -0.794 -0.099  0.730  3.489 
# 
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   -11.357      0.711   -16.0   <2e-16 ***
# Sepal.Length    2.439      0.120    20.3   <2e-16 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 1.22 on 148 degrees of freedom
# Multiple R-squared:  0.735,   Adjusted R-squared:  0.733 
# F-statistic:  410 on 1 and 148 DF,  p-value: <2e-16

Visualization

library(visreg)  # package containing visualization function visreg
visreg(m0)  # visualize regression line together with data points

plot of chunk unnamed-chunk-7

  • The blue regression line shows the predicted (fitted) values of the model

Numerical interpretation

  • \(y_i = \beta_1 x_i + \beta_0 + \epsilon_i\)
round(m0$coefficients, 2)
#  (Intercept) Sepal.Length 
#       -11.36         2.44
  • Petal.Area = 2.44 \(\times\) Sepal.Length + -11.36
  • For sepal length of 5.1, predicted (fitted) petal area: 2.44 \(\times\) 5.1 + -11.36 = 1.08
iris$FittedPA <- fitted(m0)
head(iris, 2)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Area FittedPA
# 1          5.1         3.5          1.4         0.2  setosa       0.14  1.08376
# 2          4.9         3.0          1.4         0.2  setosa       0.14  0.59589

Interpretation of intercept

#  (Intercept) Sepal.Length 
#       -11.36         2.44