# Regression

Martijn Wieling
University of Groningen

## This lecture

• Correlation
• Regression
• Linear regression
• Multiple regression
• Interpreting interactions
• Regression assumptions
• Logistic regression

## Correlation

• Quantify relation between two numerical variables (interval or ratio scale)
• $$-1 \leq r \leq 1$$ indicates strength (effect size) and direction

## Obtain significance of correlation

data(iris)  # standard dataset: contains petal and sepal length and width and species of irises
cor(iris$Sepal.Length, iris$Petal.Length)  # provides r

# [1] 0.87175

cor.test(iris$Sepal.Length, iris$Petal.Length)  # provides statistical test

#
#   Pearson's product-moment correlation
#
# data:  x and y
# t = 21.6, df = 148, p-value <2e-16
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
#  0.82704 0.90551
# sample estimates:
#     cor
# 0.87175


## Correlation for ordinal data: Spearman $$\rho$$

# Also when residuals not normally distributed
cor.test(iris$Sepal.Length, iris$Petal.Length, method = "spearman")

#
#   Spearman's rank correlation rho
#
# data:  x and y
# S = 66400, p-value <2e-16
# alternative hypothesis: true rho is not equal to 0
# sample estimates:
#    rho
# 0.8819

# Similar result
cor.test(rank(iris$Sepal.Length), rank(iris$Petal.Length))$estimate  # cor # 0.8819  ## Reporting results • The Pearson correlation between the sepal length and petal length was positive with $$r = 0.87$$, $$df = 148$$, $$p_{two-tailed} < 0.001$$. • Note that the number of degrees of freedom for a correlation is: $$N - 2$$ ## Visualizing multiple correlations library(corrgram) corrgram(iris[, c("Sepal.Width", "Sepal.Length", "Petal.Length")], lower.panel = panel.shade, upper.panel = panel.pie)  ## Correlation: sensitivity to outliers #### http://eolomea.let.rug.nl/Correlation (login: f112300 and ShinyDem0) ## Correlation is no causation! ## Linear regression • To assess relationship between numerical dependent variable and one (simple regression) or more (multiple regression) quantitative or categorical predictor variables • Measures impact of each individual variable on dependent variable, while controlling for other variables in the model • Note that regression is equivalent to ANOVA, but the focus is different: group comparisons vs. relation between numerical variables ## Linear regression: formula • Linear regression captures relationship between dependent variable and independent variables using a formula • $$y_i = \beta_1 x_i + \beta_0 + \epsilon_i$$ • With $$y_i$$: dependent variable, $$x_i$$: independent variable, $$\beta_0$$: intercept (value of $$y_i$$ when $$x_i$$ equals 0), $$\beta_1$$: coefficient (slope) for all $$x_i$$, and $$\epsilon_i$$: error (residuals; all residuals follow normal distribution with mean 0) ## Residuals: difference between actual and fitted values ## Linear regression: slope and intercept ## Visualization is essential! ## Dataset for this lecture head(iris)  # Sepal.Length Sepal.Width Petal.Length Petal.Width Petal.Area Species # 1 5.1 3.5 1.4 0.2 0.14 setosa # 2 4.9 3.0 1.4 0.2 0.14 setosa # 3 4.7 3.2 1.3 0.2 0.13 setosa # 4 4.6 3.1 1.5 0.2 0.15 setosa # 5 5.0 3.6 1.4 0.2 0.14 setosa # 6 5.4 3.9 1.7 0.4 0.34 setosa  ## Fitting a simple regression model in R m0 <- lm(Petal.Area ~ Sepal.Length, data = iris) summary(m0)  # # Call: # lm(formula = Petal.Area ~ Sepal.Length, data = iris) # # Residuals: # Min 1Q Median 3Q Max # -2.671 -0.794 -0.099 0.730 3.489 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) -11.357 0.711 -16.0 <2e-16 *** # Sepal.Length 2.439 0.120 20.3 <2e-16 *** # --- # Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # # Residual standard error: 1.22 on 148 degrees of freedom # Multiple R-squared: 0.735, Adjusted R-squared: 0.733 # F-statistic: 410 on 1 and 148 DF, p-value: <2e-16  ## Visualization library(visreg) # package containing visualization function visreg visreg(m0) # visualize regression line together with data points  • The blue regression line shows the predicted (fitted) values of the model ## Interpretation • $$y_i = \beta_1 x_i + \beta_0 + \epsilon_i$$ round(m0$coefficients, 2)

#  (Intercept) Sepal.Length
#       -11.36         2.44

• Petal.Area = 2.44 $$\times$$ Sepal.Length + -11.36
• For sepal length of 5.1, predicted (fitted) petal area: 2.44 $$\times$$ 5.1 + -11.36 = 1.08
iris$FittedPA <- fitted(m0) head(iris, 2)  # Sepal.Length Sepal.Width Petal.Length Petal.Width Petal.Area Species FittedPA # 1 5.1 3.5 1.4 0.2 0.14 setosa 1.08376 # 2 4.9 3.0 1.4 0.2 0.14 setosa 0.59589  ## Interpretation of intercept # (Intercept) Sepal.Length # -11.36 2.44  • For sepal length of 0, predicted petal area: 2.44 $$\times$$ 0 + -11.36 = -11.36 • The estimate for the intercept is an artificial result: there are no irises with a sepal length of 0! ## Question 2 ## Refitting the model with a centered predictor iris$Sepal.Length.c <- iris$Sepal.Length - mean(iris$Sepal.Length)  # centered Sepal.Length
m0.c <- lm(Petal.Area ~ Sepal.Length.c, data = iris)

library(mosaic)  # contains msummary function
msummary(m0.c)  # more concise summary (slide friendly!)

#                Estimate Std. Error t value Pr(>|t|)
# (Intercept)      2.8970     0.0994    29.1   <2e-16 ***
# Sepal.Length.c   2.4394     0.1204    20.3   <2e-16 ***
#
# Residual standard error: 1.22 on 148 degrees of freedom
# Multiple R-squared:  0.735,   Adjusted R-squared:  0.733
# F-statistic:  410 on 1 and 148 DF,  p-value: <2e-16

• Only the estimate of the intercept changed (the estimate for Sepal.Length was 2.4394)

## Fitted values are identical

iris\$FittedPA.c <- fitted(m0.c)

#   Sepal.Length Sepal.Length.c FittedPA FittedPA.c
# 1          5.1       -0.74333  1.08376    1.08376
# 2          4.9       -0.94333  0.59589    0.59589
# 3          4.7       -1.14333  0.10801    0.10801
# 4          4.6       -1.24333 -0.13593   -0.13593
# 5          5.0       -0.84333  0.83982    0.83982
# 6          5.4       -0.44333  1.81558    1.81558


## Interpretation of intercept with centered predictor

#    (Intercept) Sepal.Length.c
#           2.90           2.44

• For a centered sepal length of 0 (an average sepal length of about 5.8), the predicted petal area is: 2.44 $$\times$$ 0 + 2.9 = 2.9
• The estimate for the intercept is now more natural: it reflects the value of the dependent variable (petal area) when the predictor (sepal length) is at its average value!

## Multiple regression

• $$y = \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \beta_0 + \epsilon$$
• (Note that the $$i$$ subscripts were dropped here for simplicity)
• $$\beta_1$$ should be interpreted as the effect of $$x_1$$, while controlling for the other variables ($$x_2$$ ... $$x_n$$) in the model
• $$\beta_2$$ should be interpreted as the effect of $$x_2$$, while controlling for the other variables in the model
• $$\beta_0$$ (Intercept) is the value when all $$x_1$$, $$x_2$$, ..., $$x_n$$ are 0
• This might be a value which has no meaning!
• To give it meaning, centering (subtracting mean) numerical variables helps
• When the variables are centered, the intercept corresponds with the value of the dependent variable when all predictor values are at their mean value

## Fitting a multiple regression model

m1 <- lm(Petal.Area ~ Sepal.Length.c + Sepal.Width.c, data = iris)
msummary(m1)

#                Estimate Std. Error t value Pr(>|t|)
# (Intercept)      2.8970     0.0928   31.23  < 2e-16 ***
# Sepal.Length.c   2.3757     0.1132   20.99  < 2e-16 ***
# Sepal.Width.c   -1.0285     0.2150   -4.78  4.1e-06 ***
#
# Residual standard error: 1.14 on 147 degrees of freedom
# Multiple R-squared:  0.771,   Adjusted R-squared:  0.768
# F-statistic:  247 on 2 and 147 DF,  p-value: <2e-16

• Note that we use centered variables to facilitate interpretation: when both sepal length and width are at their average value, the predicted petal area equals 2.897

## Visualization

visreg(m1)


## 2D visualization

visreg2d(m1, 'Sepal.Length.c', 'Sepal.Width.c')