Regression

Martijn Wieling
University of Groningen

This lecture

  • Correlation
  • Regression
    • Linear regression
    • Multiple regression
    • Interpreting interactions
    • Regression assumptions and model criticism

Question 1

Correlation

  • Quantify relation between two numerical variables (interval or ratio scale)
    • \(-1 \leq r \leq 1\) indicates strength (effect size) and direction

plot of chunk unnamed-chunk-1

Correlation: sensitivity to outliers

Correlation is no causation!

Linear regression

  • To assess relationship between numerical dependent variable and one (simple regression) or more (multiple regression) quantitative or categorical predictor variables
    • Measures impact of each individual variable on dependent variable, while controlling for other variables in the model
    • Note that regression is equivalent to ANOVA, but the focus is different: group comparisons vs. relation between numerical variables

Linear regression: formula

  • Linear regression captures relationship between dependent variable and independent variables using a formula
    • \(y_i = \beta_1 x_i + \beta_0 + \epsilon_i\)
    • With \(y_i\): dependent variable, \(x_i\): independent variable, \(\beta_0\): intercept (value of \(y_i\) when \(x_i\) equals 0), \(\beta_1\): coefficient (slope) for all \(x_i\), and \(\epsilon_i\): error (residuals; all residuals follow normal distribution with mean 0)

Residuals: difference between actual and fitted values

plot of chunk unnamed-chunk-2

Linear regression: slope and intercept

Visualization is essential!

plot of chunk unnamed-chunk-3

Dataset for this lecture

head(iris)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Petal.Area Species
# 1          5.1         3.5          1.4         0.2       0.14  setosa
# 2          4.9         3.0          1.4         0.2       0.14  setosa
# 3          4.7         3.2          1.3         0.2       0.13  setosa
# 4          4.6         3.1          1.5         0.2       0.15  setosa
# 5          5.0         3.6          1.4         0.2       0.14  setosa
# 6          5.4         3.9          1.7         0.4       0.34  setosa

Fitting a simple regression model in R

m0 <- lm(Petal.Area ~ Sepal.Length, data = iris)
summary(m0)
# 
# Call:
# lm(formula = Petal.Area ~ Sepal.Length, data = iris)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
# -2.671 -0.794 -0.099  0.730  3.489 
# 
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   -11.357      0.711   -16.0   <2e-16 ***
# Sepal.Length    2.439      0.120    20.3   <2e-16 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 1.22 on 148 degrees of freedom
# Multiple R-squared:  0.735,   Adjusted R-squared:  0.733 
# F-statistic:  410 on 1 and 148 DF,  p-value: <2e-16

Visualization

library(visreg)  # package containing visualization function visreg
visreg(m0)  # visualize regression line together with data points

plot of chunk unnamed-chunk-7

  • The blue regression line shows the predicted (fitted) values of the model

Numerical interpretation

  • \(y_i = \beta_1 x_i + \beta_0 + \epsilon_i\)
round(m0$coefficients, 2)
#  (Intercept) Sepal.Length 
#       -11.36         2.44
  • Petal.Area = 2.44 \(\times\) Sepal.Length + -11.36
  • For sepal length of 5.1, predicted (fitted) petal area: 2.44 \(\times\) 5.1 + -11.36 = 1.08
iris$FittedPA <- fitted(m0)
head(iris, 2)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Petal.Area Species FittedPA
# 1          5.1         3.5          1.4         0.2       0.14  setosa  1.08376
# 2          4.9         3.0          1.4         0.2       0.14  setosa  0.59589

Interpretation of intercept

#  (Intercept) Sepal.Length 
#       -11.36         2.44

plot of chunk unnamed-chunk-11

  • For sepal length of 0, predicted petal area: 2.44 \(\times\) 0 + -11.36 = -11.36
  • The estimate for the intercept is an artificial result: there are no irises with a sepal length of 0!

Question 2

Refitting the model with a centered predictor

iris$Sepal.Length.c <- iris$Sepal.Length - mean(iris$Sepal.Length)  # centered Sepal.Length
m0.c <- lm(Petal.Area ~ Sepal.Length.c, data = iris)

library(mosaic)  # contains msummary function
msummary(m0.c)  # more concise summary (slide friendly!)
#                Estimate Std. Error t value Pr(>|t|)    
# (Intercept)      2.8970     0.0994    29.1   <2e-16 ***
# Sepal.Length.c   2.4394     0.1204    20.3   <2e-16 ***
# 
# Residual standard error: 1.22 on 148 degrees of freedom
# Multiple R-squared:  0.735,   Adjusted R-squared:  0.733 
# F-statistic:  410 on 1 and 148 DF,  p-value: <2e-16
  • Only the estimate of the intercept changed (the estimate for Sepal.Length was 2.4394)

Fitted values are identical

iris$FittedPA.c <- fitted(m0.c)
head(iris[, c("Sepal.Length", "Sepal.Length.c", "FittedPA", "FittedPA.c")])
#   Sepal.Length Sepal.Length.c FittedPA FittedPA.c
# 1          5.1       -0.74333  1.08376    1.08376
# 2          4.9       -0.94333  0.59589    0.59589
# 3          4.7       -1.14333  0.10801    0.10801
# 4          4.6       -1.24333 -0.13593   -0.13593
# 5          5.0       -0.84333  0.83982    0.83982
# 6          5.4       -0.44333  1.81558    1.81558

Interpretation of intercept with centered predictor

#    (Intercept) Sepal.Length.c 
#           2.90           2.44
  • For a centered sepal length of 0 (an average sepal length of about 5.8), the predicted petal area is: 2.44 \(\times\) 0 + 2.9 = 2.9
  • The estimate for the intercept is now more natural: it reflects the value of the dependent variable (petal area) when the predictor (sepal length) is at its average value!

plot of chunk unnamed-chunk-14

Multiple regression

  • \(y = \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \beta_0 + \epsilon\)
    • (Note that the \(i\) subscripts were dropped here for simplicity)
    • \(\beta_1\) should be interpreted as the effect of \(x_1\), while controlling for the other variables (\(x_2\) ... \(x_n\)) in the model
    • \(\beta_2\) should be interpreted as the effect of \(x_2\), while controlling for the other variables in the model
    • \(\beta_0\) (Intercept) is the value when all \(x_1\), \(x_2\), ..., \(x_n\) are 0
      • This might be a value which has no meaning!
      • To give it meaning, centering (subtracting mean) numerical variables helps
      • When the variables are centered, the intercept corresponds with the value of the dependent variable when all predictor values are at their mean value

Question 3

Fitting a multiple regression model

m1 <- lm(Petal.Area ~ Sepal.Length.c + Sepal.Width.c, data = iris)
msummary(m1)
#                Estimate Std. Error t value Pr(>|t|)    
# (Intercept)      2.8970     0.0928   31.23  < 2e-16 ***
# Sepal.Length.c   2.3757     0.1132   20.99  < 2e-16 ***
# Sepal.Width.c   -1.0285     0.2150   -4.78  4.1e-06 ***
# 
# Residual standard error: 1.14 on 147 degrees of freedom
# Multiple R-squared:  0.771,   Adjusted R-squared:  0.768 
# F-statistic:  247 on 2 and 147 DF,  p-value: <2e-16
  • Note that we use centered variables to facilitate interpretation: when both sepal length and width are at their average value, the predicted petal area equals 2.897

Visualization

visreg(m1)

plot of chunk unnamed-chunk-18

2D visualization

(use plot.type = 'rgl' in visreg2d to obtain interactive plot)

visreg2d(m1, 'Sepal.Length.c', 'Sepal.Width.c')

plot of chunk unnamed-chunk-21