# Regression

Martijn Wieling
University of Groningen

## This lecture

• Correlation
• Regression
• Linear regression
• Multiple regression
• Interpreting interactions
• Regression assumptions and model criticism

## Correlation

• Quantify relation between two numerical variables (interval or ratio scale)
• $$-1 \leq r \leq 1$$ indicates strength (effect size) and direction

## Linear regression

• To assess relationship between numerical dependent variable and one (simple regression) or more (multiple regression) quantitative or categorical predictor variables
• Measures impact of each individual variable on dependent variable, while controlling for other variables in the model
• Note that regression is equivalent to ANOVA, but the focus is different: group comparisons vs. relation between numerical variables

## Linear regression: formula

• Linear regression captures relationship between dependent variable and independent variables using a formula
• $$y_i = \beta_1 x_i + \beta_0 + \epsilon_i$$
• With $$y_i$$: dependent variable, $$x_i$$: independent variable, $$\beta_0$$: intercept (value of $$y_i$$ when $$x_i$$ equals 0), $$\beta_1$$: coefficient (slope) for all $$x_i$$, and $$\epsilon_i$$: error (residuals; all residuals follow normal distribution with mean 0)

## Dataset for this lecture

head(iris)

#   Sepal.Length Sepal.Width Petal.Length Petal.Width Petal.Area Species
# 1          5.1         3.5          1.4         0.2       0.14  setosa
# 2          4.9         3.0          1.4         0.2       0.14  setosa
# 3          4.7         3.2          1.3         0.2       0.13  setosa
# 4          4.6         3.1          1.5         0.2       0.15  setosa
# 5          5.0         3.6          1.4         0.2       0.14  setosa
# 6          5.4         3.9          1.7         0.4       0.34  setosa


## Fitting a simple regression model in R

m0 <- lm(Petal.Area ~ Sepal.Length, data = iris)
summary(m0)

#
# Call:
# lm(formula = Petal.Area ~ Sepal.Length, data = iris)
#
# Residuals:
#    Min     1Q Median     3Q    Max
# -2.671 -0.794 -0.099  0.730  3.489
#
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)
# (Intercept)   -11.357      0.711   -16.0   <2e-16 ***
# Sepal.Length    2.439      0.120    20.3   <2e-16 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 1.22 on 148 degrees of freedom
# Multiple R-squared:  0.735,   Adjusted R-squared:  0.733
# F-statistic:  410 on 1 and 148 DF,  p-value: <2e-16


## Visualization

library(visreg)  # package containing visualization function visreg
visreg(m0)  # visualize regression line together with data points


• The blue regression line shows the predicted (fitted) values of the model

## Numerical interpretation

• $$y_i = \beta_1 x_i + \beta_0 + \epsilon_i$$
round(m0$coefficients, 2)  # (Intercept) Sepal.Length # -11.36 2.44  • Petal.Area = 2.44 $$\times$$ Sepal.Length + -11.36 • For sepal length of 5.1, predicted (fitted) petal area: 2.44 $$\times$$ 5.1 + -11.36 = 1.08 iris$FittedPA <- fitted(m0)

#   Sepal.Length Sepal.Width Petal.Length Petal.Width Petal.Area Species FittedPA
# 1          5.1         3.5          1.4         0.2       0.14  setosa  1.08376
# 2          4.9         3.0          1.4         0.2       0.14  setosa  0.59589


## Interpretation of intercept

#  (Intercept) Sepal.Length
#       -11.36         2.44


• For sepal length of 0, predicted petal area: 2.44 $$\times$$ 0 + -11.36 = -11.36
• The estimate for the intercept is an artificial result: there are no irises with a sepal length of 0!

iris$Sepal.Length.c <- iris$Sepal.Length - mean(iris$Sepal.Length) # centered Sepal.Length m0.c <- lm(Petal.Area ~ Sepal.Length.c, data = iris) library(mosaic) # contains msummary function msummary(m0.c) # more concise summary (slide friendly!)  # Estimate Std. Error t value Pr(>|t|) # (Intercept) 2.8970 0.0994 29.1 <2e-16 *** # Sepal.Length.c 2.4394 0.1204 20.3 <2e-16 *** # # Residual standard error: 1.22 on 148 degrees of freedom # Multiple R-squared: 0.735, Adjusted R-squared: 0.733 # F-statistic: 410 on 1 and 148 DF, p-value: <2e-16  • Only the estimate of the intercept changed (the estimate for Sepal.Length was 2.4394) ## Fitted values are identical iris$FittedPA.c <- fitted(m0.c)

#   Sepal.Length Sepal.Length.c FittedPA FittedPA.c
# 1          5.1       -0.74333  1.08376    1.08376
# 2          4.9       -0.94333  0.59589    0.59589
# 3          4.7       -1.14333  0.10801    0.10801
# 4          4.6       -1.24333 -0.13593   -0.13593
# 5          5.0       -0.84333  0.83982    0.83982
# 6          5.4       -0.44333  1.81558    1.81558


## Interpretation of intercept with centered predictor

#    (Intercept) Sepal.Length.c
#           2.90           2.44

• For a centered sepal length of 0 (an average sepal length of about 5.8), the predicted petal area is: 2.44 $$\times$$ 0 + 2.9 = 2.9
• The estimate for the intercept is now more natural: it reflects the value of the dependent variable (petal area) when the predictor (sepal length) is at its average value!

## Multiple regression

• $$y = \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \beta_0 + \epsilon$$
• (Note that the $$i$$ subscripts were dropped here for simplicity)
• $$\beta_1$$ should be interpreted as the effect of $$x_1$$, while controlling for the other variables ($$x_2$$ ... $$x_n$$) in the model
• $$\beta_2$$ should be interpreted as the effect of $$x_2$$, while controlling for the other variables in the model
• $$\beta_0$$ (Intercept) is the value when all $$x_1$$, $$x_2$$, ..., $$x_n$$ are 0
• This might be a value which has no meaning!
• To give it meaning, centering (subtracting mean) numerical variables helps
• When the variables are centered, the intercept corresponds with the value of the dependent variable when all predictor values are at their mean value

## Fitting a multiple regression model

m1 <- lm(Petal.Area ~ Sepal.Length.c + Sepal.Width.c, data = iris)
msummary(m1)

#                Estimate Std. Error t value Pr(>|t|)
# (Intercept)      2.8970     0.0928   31.23  < 2e-16 ***
# Sepal.Length.c   2.3757     0.1132   20.99  < 2e-16 ***
# Sepal.Width.c   -1.0285     0.2150   -4.78  4.1e-06 ***
#
# Residual standard error: 1.14 on 147 degrees of freedom
# Multiple R-squared:  0.771,   Adjusted R-squared:  0.768
# F-statistic:  247 on 2 and 147 DF,  p-value: <2e-16

• Note that we use centered variables to facilitate interpretation: when both sepal length and width are at their average value, the predicted petal area equals 2.897

## Visualization

visreg(m1)


## 2D visualization

#### (use plot.type = 'rgl' in visreg2d to obtain interactive plot)

visreg2d(m1, 'Sepal.Length.c', 'Sepal.Width.c')