# Regression

Martijn Wieling
University of Groningen

## This lecture

• Correlation
• Regression
• Linear regression
• Multiple regression
• Interpreting interactions
• Regression assumptions
• Logistic regression

## Dataset for this lecture

head(iris)

#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Area
# 1          5.1         3.5          1.4         0.2  setosa       0.14
# 2          4.9         3.0          1.4         0.2  setosa       0.14
# 3          4.7         3.2          1.3         0.2  setosa       0.13
# 4          4.6         3.1          1.5         0.2  setosa       0.15
# 5          5.0         3.6          1.4         0.2  setosa       0.14
# 6          5.4         3.9          1.7         0.4  setosa       0.34


## Correlation

• Quantify relation between two numerical variables (interval or ratio scale)
• $$-1 \leq r \leq 1$$ indicates strength (effect size) and direction

## Obtain significance of correlation

data(iris)  # load iris data set
cor(iris$Sepal.Length, iris$Petal.Length)  # provides r

# [1] 0.87175

cor.test(iris$Sepal.Length, iris$Petal.Length)  # provides statistical test

#
#   Pearson's product-moment correlation
#
# data:  x and y
# t = 21.6, df = 148, p-value <2e-16
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
#  0.82704 0.90551
# sample estimates:
#     cor
# 0.87175


## Correlation for ordinal data: Spearman $$\rho$$

# Also when residuals not normally distributed
cor.test(iris$Sepal.Length, iris$Petal.Length, method = "spearman")

#
#   Spearman's rank correlation rho
#
# data:  x and y
# S = 66429, p-value <2e-16
# alternative hypothesis: true rho is not equal to 0
# sample estimates:
#    rho
# 0.8819

# Similar result
cor.test(rank(iris$Sepal.Length), rank(iris$Petal.Length))$estimate  # cor # 0.8819  ## Reporting results • The Pearson correlation between the sepal length and petal length was positive with $$r = 0.87$$, $$df = 148$$, $$p_{two-tailed} < 0.001$$. • Note that the number of degrees of freedom for a correlation is: $$N - 2$$ ## Visualizing multiple correlations library(corrgram) corrgram(iris[, c("Sepal.Width", "Sepal.Length", "Petal.Length")], lower.panel = panel.shade, upper.panel = panel.pie)  ## Correlation: sensitivity to outliers #### https://eolomea.let.rug.nl/Correlation (login: f112300 and ShinyDem0) ## Correlation is no causation! ## Linear regression • To assess relationship between numerical dependent variable and one (simple regression) or more (multiple regression) quantitative or categorical predictor variables • Measures impact of each individual variable on dependent variable, while controlling for other variables in the model • Note that regression is equivalent to ANOVA, but the focus is different: relation between numerical variables vs. group comparisons ## Linear regression: formula • Linear regression captures relationship between dependent variable and independent variables using a formula • $$y_i = \beta_1 x_i + \beta_0 + \epsilon_i$$ • With $$y_i$$: dependent variable, $$x_i$$: independent variable, $$\beta_0$$: intercept (value of $$y_i$$ when $$x_i$$ equals 0), $$\beta_1$$: coefficient (slope) for all $$x_i$$, and $$\epsilon_i$$: error (residuals; all residuals follow normal distribution with mean 0) ## Residuals: difference between actual and fitted values ## Linear regression: slope and intercept ## Visualization is essential! ## Fitting a simple regression model in R m0 <- lm(Petal.Area ~ Sepal.Length, data = iris) summary(m0)  # # Call: # lm(formula = Petal.Area ~ Sepal.Length, data = iris) # # Residuals: # Min 1Q Median 3Q Max # -2.671 -0.794 -0.099 0.730 3.489 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) -11.357 0.711 -16.0 <2e-16 *** # Sepal.Length 2.439 0.120 20.3 <2e-16 *** # --- # Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # # Residual standard error: 1.22 on 148 degrees of freedom # Multiple R-squared: 0.735, Adjusted R-squared: 0.733 # F-statistic: 410 on 1 and 148 DF, p-value: <2e-16  ## Visualization library(visreg) # package containing visualization function visreg visreg(m0) # visualize regression line together with data points  • The blue regression line shows the predicted (fitted) values of the model ## Interpretation • $$y_i = \beta_1 x_i + \beta_0 + \epsilon_i$$ round(m0$coefficients, 2)

#  (Intercept) Sepal.Length
#       -11.36         2.44

• Petal.Area = 2.44 $$\times$$ Sepal.Length + -11.36
• For sepal length of 5.1, predicted (fitted) petal area: 2.44 $$\times$$ 5.1 + -11.36 = 1.08
iris\$FittedPA <- fitted(m0)

#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Area FittedPA

#  (Intercept) Sepal.Length