Correlation and regression

Martijn Wieling
BCN Statistics Course, University of Groningen

This lecture

Correlation
Linear regression
Logistic regression

Correlation

Quantify relation between two numerical variables (interval or ratio scale)
- \(-1 \leq r \leq 1\) indicates strength (effect size) and direction

plot of chunk unnamed-chunk-1

Residuals: difference between actual and fitted values

plot of chunk unnamed-chunk-2

Correlation: sensitivity to outliers

http://eolomea.let.rug.nl/BCN-Stats/Correlation (login with RUG account)

Obtain significance of correlation

data(iris)  # standard dataset: contains petal and sepal length and width and species of irises
cor.test(iris$Sepal.Length, iris$Petal.Length)  # cor(A,B) just gives the correlation

# 
#   Pearson's product-moment correlation
# 
# data:  iris$Sepal.Length and iris$Petal.Length
# t = 21.6, df = 148, p-value <2e-16
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
#  0.82704 0.90551
# sample estimates:
#     cor 
# 0.87175

Question 1

Correlation for ordinal data: Spearman \(\rho\)

# Also when residuals not normally distributed
cor.test(iris$Sepal.Length, iris$Petal.Length, method = "spearman")

# 
#   Spearman's rank correlation rho
# 
# data:  iris$Sepal.Length and iris$Petal.Length
# S = 66400, p-value <2e-16
# alternative hypothesis: true rho is not equal to 0
# sample estimates:
#    rho 
# 0.8819

# Similar result
iris$Sepal.Length.rank <- rank(iris$Sepal.Length)
iris$Petal.Length.rank <- rank(iris$Petal.Length)
cor.test(iris$Sepal.Length.rank, iris$Petal.Length.rank)$estimate

#    cor 
# 0.8819

Reporting results

The Pearson correlation between the sepal length and petal length was positive with \(r = 0.87\), \(df = 148\), \(p_{two-tailed} < 0.001\).
- Note that the degrees of freedom for a correlation is: \(N - 2\)

Visualizing multiple correlations

library(corrgram)
corrgram(iris[, c("Sepal.Width", "Sepal.Length", "Petal.Length")], lower.panel = panel.shade, 
    upper.panel = panel.pie)

plot of chunk unnamed-chunk-6

Correlation is no causation!

See http://www.tylervigen.com for more examples

Linear regression

To assess relationship between numerical dependent variable and one (simple regression) or more (multiple regression) quantitative or categorical predictor variables
- Measures impact of each individual variable on dependent variable, while controlling for other variables in the model
- ANOVA and regression are equivalent
  - Only the focus is different: ANOVA focuses on group comparison, regression on numerical relationships, but ANCOVA can handle numerical predictors and regression can handle group comparisons
  - I prefer regression, as it prevents thinking about the data in terms of groups: dichotomization of continuous variables is bad practice

Question 2

Linear regression: formula

Linear regression captures relationship between dependent variable and independent variables using a formula
- \(y_i = \beta_1 x_i + \beta_0 + \epsilon_i\)
- With \(y_i\): dependent variable, \(x_i\): independent variable, \(\beta_0\): intercept, \(\beta_1\): coefficient (slope) for all \(x_i\), and \(\epsilon_i\): error (residuals; all residuals follow distribution \(N(0,\sigma^2)\))

Linear regression: slope and intercept

Fitting a simple regression model

iris$Petal.Area <- iris$Petal.Width * iris$Petal.Length
model0 <- lm(Petal.Area ~ Sepal.Length, data = iris)
summary(model0)

# 
# Call:
# lm(formula = Petal.Area ~ Sepal.Length, data = iris)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
# -5.343 -1.588 -0.198  1.459  6.978 
# 
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   -22.714      1.421   -16.0   <2e-16 ***
# Sepal.Length    4.879      0.241    20.3   <2e-16 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 2.43 on 148 degrees of freedom
# Multiple R-squared:  0.735,   Adjusted R-squared:  0.733 
# F-statistic:  410 on 1 and 148 DF,  p-value: <2e-16

Effect size: (adjusted) \(R^2\) ~ explained variance (here: about 73%)

Interpretation

\(y_i = \beta_1 x_i + \beta_0 + \epsilon_i\)

round(model0$coefficients, 2)

#  (Intercept) Sepal.Length 
#       -22.71         4.88

Petal.Area = 4.88 \(\times\) Sepal.Length + -22.71
For sepal length of 5.1, predicted (fitted) petal area: 4.88 \(\times\) 5.1 + -22.71 = 2.17

iris$FittedPA <- fitted(model0)
head(iris, 2)

#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Area FittedPA
# 1          5.1         3.5          1.4         0.2  setosa       0.28   2.1675
# 2          4.9         3.0          1.4         0.2  setosa       0.28   1.1918

Multiple regression

\(y = \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \beta_0 + \epsilon\)
- (Note that the \(i\) subscripts were dropped here for simplicity)
- \(\beta_1\) should be interpreted as the effect of \(x_1\), while controlling for the other variables (\(x_2\) ... \(x_n\)) in the model
- \(\beta_2\) should be interpreted as the effect of \(x_2\), while controlling for the other variables in the model
- \(\beta_0\) (Intercept) is the value when all \(x_1\), \(x_2\), ..., \(x_n\) are 0
  - This might be a value which has no meaning!
  - To give it meaning, centering (subtracting mean) numerical variables helps
  - When the variables are centered, the intercept corresponds with the value of the dependent variable when all predictor values are at their mean value

Fitting a multiple regression model

model1 <- lm(Petal.Area ~ Sepal.Length + Sepal.Width, data = iris)
summary(model1)

# 
# Call:
# lm(formula = Petal.Area ~ Sepal.Length + Sepal.Width, data = iris)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
# -4.416 -1.491 -0.298  1.178  7.535 
# 
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   -15.681      1.980   -7.92  5.4e-13 ***
# Sepal.Length    4.751      0.226   20.99  < 2e-16 ***
# Sepal.Width    -2.057      0.430   -4.78  4.1e-06 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 2.27 on 147 degrees of freedom
# Multiple R-squared:  0.771,   Adjusted R-squared:  0.768 
# F-statistic:  247 on 2 and 147 DF,  p-value: <2e-16

Interpretation

\(y = \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \beta_0 + \epsilon\)

round(model1$coefficients, 2)

#  (Intercept) Sepal.Length  Sepal.Width 
#       -15.68         4.75        -2.06

Petal.Area = 4.75 \(\times\) Sepal.Length + -2.06 \(\times\) Sepal.Width + -15.68
For sepal length of 5.1 and sepal width of 3.5, predicted (fitted) petal area:
4.75 \(\times\) 5.1 + -2.06 \(\times\) 3.5 + -15.68 = 1.35

iris$FittedPA <- fitted(model1)
head(iris, 1)

#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Area FittedPA
# 1          5.1         3.5          1.4         0.2  setosa       0.28   1.3515

Model comparison: more complex model necessary?

anova(model0, model1)  # model comparison

# Analysis of Variance Table
# 
# Model 1: Petal.Area ~ Sepal.Length
# Model 2: Petal.Area ~ Sepal.Length + Sepal.Width
#   Res.Df RSS Df Sum of Sq    F  Pr(>F)    
# 1    148 877                              
# 2    147 759  1       118 22.9 4.1e-06 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AIC(model0) - AIC(model1)  # AIC decrease >= 2: more complex model should be used

# [1] 19.703

Evidence ratio: the model with lower AIC is \(e^{\frac{\Delta AIC}{2}}\) times more likely than the model with higher AIC

Multiple regression: adding an interaction

\(y = \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1 x_2 + \beta_0 + \epsilon\)
- \(\beta_{12}\) is the coefficient for the interaction effect

Modeling an interaction

summary(model2 <- lm(Petal.Area ~ Sepal.Length + Sepal.Width + Sepal.Length:Sepal.Width, 
    data = iris))  # Shorthand: Sepal.Length * Sepal.Width

# 
# Call:
# lm(formula = Petal.Area ~ Sepal.Length + Sepal.Width + Sepal.Length:Sepal.Width, 
#     data = iris)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
# -4.713 -1.379 -0.223  1.176  7.240 
# 
# Coefficients:
#                          Estimate Std. Error t value Pr(>|t|)    
# (Intercept)                14.047     10.708    1.31  0.19167    
# Sepal.Length               -0.488      1.869   -0.26  0.79438    
# Sepal.Width               -11.596      3.405   -3.41  0.00085 ***
# Sepal.Length:Sepal.Width    1.686      0.597    2.82  0.00543 ** 
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 2.22 on 146 degrees of freedom
# Multiple R-squared:  0.783,   Adjusted R-squared:  0.778 
# F-statistic:  175 on 3 and 146 DF,  p-value: <2e-16

Plotting an interaction

library(rockchalk)
plotSlopes(model2, Sepal.Length, modx = Sepal.Width)

plot of chunk unnamed-chunk-15

Model comparison: interaction necessary?

anova(model1, model2)  # model comparison

# Analysis of Variance Table
# 
# Model 1: Petal.Area ~ Sepal.Length + Sepal.Width
# Model 2: Petal.Area ~ Sepal.Length + Sepal.Width + Sepal.Length:Sepal.Width
#   Res.Df RSS Df Sum of Sq    F Pr(>F)   
# 1    147 759                            
# 2    146 720  1      39.3 7.97 0.0054 **
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AIC(model1) - AIC(model2)  # AIC decrease >= 2: more complex model should be used

# [1] 5.9696

The more complex model is necessary (\(e^\frac{6}{2} \approx\) 20 times as likely)

Intermezzo: ANOVA gives identical results

library(car)
Anova(aov(Petal.Area ~ Sepal.Length * Sepal.Width, data = iris), type = 3)  # identical to: Anova(model2, type=3)

# Anova Table (Type III tests)
# 
# Response: Petal.Area
#                          Sum Sq  Df F value  Pr(>F)    
# (Intercept)                   8   1    1.72 0.19167    
# Sepal.Length                  0   1    0.07 0.79438    
# Sepal.Width                  57   1   11.59 0.00085 ***
# Sepal.Length:Sepal.Width     39   1    7.97 0.00543 ** 
# Residuals                   720 146                    
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(model2)$coef

#                           Estimate Std. Error t value   Pr(>|t|)
# (Intercept)               14.04668   10.70844  1.3117 0.19166689
# Sepal.Length              -0.48811    1.86945 -0.2611 0.79438330
# Sepal.Width              -11.59565    3.40542 -3.4051 0.00085441
# Sepal.Length:Sepal.Width   1.68611    0.59737  2.8226 0.00542907

Numerical interpretation

\(y = \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1 x_2 + \beta_0 + \epsilon\)

#      (Intercept) Sepal.Length Sepal.Width Sepal.Length:Sepal.Width
# [1,]       14.05        -0.49       -11.6                     1.69

Petal.Area = -0.49 \(\times\) Sepal.Length + -11.6 \(\times\) Sepal.Width + 1.69 \(\times\) Sepal.Length \(\times\) Sepal.Width + 14.05
For sepal length of 5.1 and sepal width of 3.5, predicted (fitted) petal area:
-0.49 \(\times\) 5.1 + -11.6 \(\times\) 3.5 + 1.69 \(\times\) 5.1 \(\times\) 3.5 + 14.05 = 1.07

iris$FittedPA <- fitted(model2)
head(iris, 1)

#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Area FittedPA
# 1          5.1         3.5          1.4         0.2  setosa       0.28   1.0696

Question 3

Interpreting an interaction

When interpreting an interaction (A:B), the significance of the terms (A and B) involved in the interaction are not main effects
- If the interaction A:B is significant, the significance of A indicates that the effect of A when B equals 0 is significantly different from 0
- Similarly, the significance of B (when A:B is signficant) indicates that the effect of B when A equals 0 is significantly different from 0
- When A or B are never equal to 0 (e.g., height and weight of a person), this may be an artificial result
  - Centering A or B will give a different meaning to the value 0 and therefore change the results

Results with centered variables

iris$Sepal.Length.c <- iris$Sepal.Length - mean(iris$Sepal.Length)
iris$Sepal.Width.c <- iris$Sepal.Width - mean(iris$Sepal.Width)
model2b <- lm(Petal.Area ~ Sepal.Length.c + Sepal.Width.c + Sepal.Length.c:Sepal.Width.c, 
    data = iris)
round(summary(model2b)$coefficients, 2)  # now everything is significant

#                              Estimate Std. Error t value Pr(>|t|)
# (Intercept)                      5.87       0.18   32.05     0.00
# Sepal.Length.c                   4.67       0.22   20.91     0.00
# Sepal.Width.c                   -1.74       0.43   -4.01     0.00
# Sepal.Length.c:Sepal.Width.c     1.69       0.60    2.82     0.01

round(summary(model2)$coefficients, 2)

#                          Estimate Std. Error t value Pr(>|t|)
# (Intercept)                 14.05      10.71    1.31     0.19
# Sepal.Length                -0.49       1.87   -0.26     0.79
# Sepal.Width                -11.60       3.41   -3.41     0.00
# Sepal.Length:Sepal.Width     1.69       0.60    2.82     0.01

Prediction does not change

iris$FittedPA <- fitted(model2)
iris$FittedPA.c <- fitted(model2b)
head(iris[, c("Sepal.Length", "Sepal.Width", "FittedPA", "FittedPA.c")], 1)

#   Sepal.Length Sepal.Width FittedPA FittedPA.c
# 1          5.1         3.5   1.0696     1.0696

Categorical predictors

Categorical predictors (\(n\) levels) are modeled with (\(n-1\)) dummy variables
- Dummy variable: either 0 or 1
- Reference level: when all dummy variables are 0
  - Consequently, the reference level is incorporated in the Intercept (\(\beta_0\)) of a regression model
- Dummy (treatment) coding done automatically in regression

Dummy coding

levels(iris$Species)  # factor levels of Species variable (by default ordered alphabetically)

# [1] "setosa"     "versicolor" "virginica"

contrasts(iris$Species)  # show internal dummy coding

#            versicolor virginica
# setosa              0         0
# versicolor          1         0
# virginica           0         1

iris$Species <- relevel(iris$Species, "versicolor")  # change reference level
contrasts(iris$Species)  # dummy coding)

#            setosa virginica
# versicolor      0         0
# setosa          1         0
# virginica       0         1

Fitting a model with a categorical predictor

model <- lm(Petal.Area ~ Species, data = iris)
summary(model)

# 
# Call:
# lm(formula = Petal.Area ~ Species, data = iris)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
# -3.796 -0.520 -0.066  0.671  4.574 
# 
# Coefficients:
#                  Estimate Std. Error t value Pr(>|t|)    
# (Intercept)         5.720      0.209    27.4   <2e-16 ***
# Speciessetosa      -5.355      0.296   -18.1   <2e-16 ***
# Speciesvirginica    5.576      0.296    18.9   <2e-16 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 1.48 on 147 degrees of freedom
# Multiple R-squared:  0.903,   Adjusted R-squared:  0.902 
# F-statistic:  683 on 2 and 147 DF,  p-value: <2e-16

Question 4

Manually creating dummy variables: identical results

iris$SpeciesSetosa <- (iris$Species == "setosa") * 1
iris$SpeciesVirginica <- (iris$Species == "virginica") * 1
summary(lm(Petal.Area ~ SpeciesSetosa + SpeciesVirginica, data = iris))

# 
# Call:
# lm(formula = Petal.Area ~ SpeciesSetosa + SpeciesVirginica, data = iris)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
# -3.796 -0.520 -0.066  0.671  4.574 
# 
# Coefficients:
#                  Estimate Std. Error t value Pr(>|t|)    
# (Intercept)         5.720      0.209    27.4   <2e-16 ***
# SpeciesSetosa      -5.355      0.296   -18.1   <2e-16 ***
# SpeciesVirginica    5.576      0.296    18.9   <2e-16 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 1.48 on 147 degrees of freedom
# Multiple R-squared:  0.903,   Adjusted R-squared:  0.902 
# F-statistic:  683 on 2 and 147 DF,  p-value: <2e-16

Post-hoc tests for factorial predictors

library(multcomp)
summary(glht(model, linfct = mcp(Species = "Tukey")))

# 
#    Simultaneous Tests for General Linear Hypotheses
# 
# Multiple Comparisons of Means: Tukey Contrasts
# 
# 
# Fit: lm(formula = Petal.Area ~ Species, data = iris)
# 
# Linear Hypotheses:
#                             Estimate Std. Error t value Pr(>|t|)    
# setosa - versicolor == 0      -5.355      0.296   -18.1   <2e-16 ***
# virginica - versicolor == 0    5.576      0.296    18.9   <2e-16 ***
# virginica - setosa == 0       10.931      0.296    37.0   <2e-16 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# (Adjusted p values reported -- single-step method)

Question 5

Interpretation

\(y_i = \beta_1 x_1 + \beta_2 x_2 + \beta_0 + \epsilon_i\)

#      (Intercept)    Speciessetosa Speciesvirginica 
#             5.72            -5.35             5.58

Petal.Area = -5.35 \(\times\) Speciessetosa + 5.58 \(\times\) Speciesvirginica + 5.72
- Petal area for species setosa: -5.35 \(\times\) 1 + 5.58 \(\times\) 0 + 5.72 = 0.37
- Petal area for species versicolor: -5.35 \(\times\) 0 + 5.58 \(\times\) 0 + 5.72 = 5.72
- Petal area for species viriginica: -5.35 \(\times\) 0 + 5.58 \(\times\) 1 + 5.72 = 11.3

iris$FittedPA <- round(fitted(model),2); unique(iris[,c("Species","FittedPA")])

#        Species FittedPA
# 1       setosa     0.37
# 51  versicolor     5.72
# 101  virginica    11.30

Combining categorical and numerical predictors

model3 <- lm(Petal.Area ~ Sepal.Length + Sepal.Width + Species, data = iris)
library(mosaic)
msummary(model3)  # more concise summary

#                  Estimate Std. Error t value Pr(>|t|)    
# (Intercept)        -5.056      1.146   -4.41  2.0e-05 ***
# Sepal.Length        1.399      0.221    6.34  2.7e-09 ***
# Sepal.Width         0.892      0.334    2.67   0.0085 ** 
# Speciessetosa      -4.641      0.439  -10.57  < 2e-16 ***
# Speciesvirginica    4.482      0.264   17.00  < 2e-16 ***
# 
# Residual standard error: 1.17 on 145 degrees of freedom
# Multiple R-squared:  0.94,    Adjusted R-squared:  0.939 
# F-statistic:  571 on 4 and 145 DF,  p-value: <2e-16

confint(model3)  # 95% confidence intervals

#                     2.5 %  97.5 %
# (Intercept)      -7.32117 -2.7901
# Sepal.Length      0.96294  1.8351
# Sepal.Width       0.23141  1.5531
# Speciessetosa    -5.50889 -3.7728
# Speciesvirginica  3.96071  5.0025

Regression assumptions

Independent observations (✓)
Response at least interval-scaled (✓)
Relation between dependent and independent variables is linear
Homoscedasticity of variance (variability of residuals is constant over dependent and independent variables)
No strong multicollinearity
Residuals are not autocorrelated (mostly occurs for time series data)
Residuals are normally distributed

Evaluating linearity: component-residual plot

par(mfrow = c(1, 2))
crPlot(model3, var = "Sepal.Length")  # from library(car)
crPlot(model3, var = "Sepal.Width")  # suggests non-linear pattern (see lecture 5)

plot of chunk unnamed-chunk-29

Homoscedasticity of error variance

plot(model3, which = 1)

plot of chunk unnamed-chunk-30

Clear heteroscedasticity (more variance for larger fitted values)

Non-constant variance test

ncvTest(model3)  # from library(car)

# Non-constant Variance Score Test 
# Variance formula: ~ fitted.values 
# Chisquare = 29.075    Df = 1     p = 6.9628e-08

Significant heteroscedasticity (\(p\)-value < 0.05)

Solution: power transformation

library(MASS)
boxcox(model3)  # shows optimal transformation at around 0.3

plot of chunk unnamed-chunk-32

No heteroscedasticity anymore

altmodel3 <- lm(Petal.Area^0.3 ~ Sepal.Length + Sepal.Width + Species, data = iris)
msummary(altmodel3)

#                  Estimate Std. Error t value Pr(>|t|)    
# (Intercept)        0.8400     0.0899    9.35  < 2e-16 ***
# Sepal.Length       0.1005     0.0173    5.81  3.8e-08 ***
# Sepal.Width        0.0867     0.0262    3.31   0.0012 ** 
# Speciessetosa     -0.9174     0.0344  -26.64  < 2e-16 ***
# Speciesvirginica   0.3015     0.0207   14.59  < 2e-16 ***
# 
# Residual standard error: 0.0915 on 145 degrees of freedom
# Multiple R-squared:  0.975,   Adjusted R-squared:  0.975 
# F-statistic: 1.44e+03 on 4 and 145 DF,  p-value: <2e-16

ncvTest(altmodel3)

# Non-constant Variance Score Test 
# Variance formula: ~ fitted.values 
# Chisquare = 2.1766    Df = 1     p = 0.14013

No heteroscedasticity per explanatory variable

ncvTest(altmodel3, ~Sepal.Length)

# Non-constant Variance Score Test 
# Variance formula: ~ Sepal.Length 
# Chisquare = 1.4761    Df = 1     p = 0.22439

ncvTest(altmodel3, ~Sepal.Width)

# Non-constant Variance Score Test 
# Variance formula: ~ Sepal.Width 
# Chisquare = 2.1062    Df = 1     p = 0.1467

ncvTest(altmodel3, ~Species)

# Non-constant Variance Score Test 
# Variance formula: ~ Species 
# Chisquare = 2.3906    Df = 2     p = 0.30262

Assessing multicollinearity: variance inflation factors

car::vif(altmodel3)  # Preferably VIF < 5

#                GVIF Df GVIF^(1/(2*Df))
# Sepal.Length 3.6484  1          1.9101
# Sepal.Width  2.3215  1          1.5237
# Species      6.0044  2          1.5654

cor(iris$Sepal.Length, iris$Sepal.Width)  # Not caused by this pair

# [1] -0.11757

Relationship between species and sepal length

msummary(lm(Sepal.Width ~ Species, data = iris))

#                  Estimate Std. Error t value Pr(>|t|)    
# (Intercept)        2.7700     0.0480   57.66   <2e-16 ***
# Speciessetosa      0.6580     0.0679    9.69   <2e-16 ***
# Speciesvirginica   0.2040     0.0679    3.00   0.0031 ** 
# 
# Residual standard error: 0.34 on 147 degrees of freedom
# Multiple R-squared:  0.401,   Adjusted R-squared:  0.393 
# F-statistic: 49.2 on 2 and 147 DF,  p-value: <2e-16

msummary(lm(Sepal.Length ~ Species, data = iris))  # Highest R^2: potentially problematic

#                  Estimate Std. Error t value Pr(>|t|)    
# (Intercept)        5.9360     0.0728   81.54  < 2e-16 ***
# Speciessetosa     -0.9300     0.1030   -9.03  8.8e-16 ***
# Speciesvirginica   0.6520     0.1030    6.33  2.8e-09 ***
# 
# Residual standard error: 0.515 on 147 degrees of freedom
# Multiple R-squared:  0.619,   Adjusted R-squared:  0.614 
# F-statistic:  119 on 2 and 147 DF,  p-value: <2e-16

Excluding a variable to reduce collinearity

# Exclude length (best choice)
AIC(m3 <- lm(Petal.Area^0.3 ~ Sepal.Width + Species, data = iris))

# [1] -255.34

# Or exclude species (worst choice)
AIC(m3b <- lm(Petal.Area^0.3 ~ Sepal.Width + Sepal.Length, data = iris))

# [1] 14.039

car::vif(m3)  # OK

#               GVIF Df GVIF^(1/(2*Df))
# Sepal.Width 1.6688  1          1.2918
# Species     1.6688  2          1.1366

Simplified model

summary(m3)

# 
# Call:
# lm(formula = Petal.Area^0.3 ~ Sepal.Width + Species, data = iris)
# 
# Residuals:
#      Min       1Q   Median       3Q      Max 
# -0.26966 -0.06716  0.00023  0.05874  0.29275 
# 
# Coefficients:
#                  Estimate Std. Error t value Pr(>|t|)    
# (Intercept)        1.2130     0.0696   17.43  < 2e-16 ***
# Sepal.Width        0.1675     0.0246    6.81  2.4e-10 ***
# Speciessetosa     -1.0640     0.0259  -41.04  < 2e-16 ***
# Speciesvirginica   0.3506     0.0209   16.80  < 2e-16 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 0.101 on 146 degrees of freedom
# Multiple R-squared:  0.97,    Adjusted R-squared:  0.969 
# F-statistic: 1.56e+03 on 3 and 146 DF,  p-value: <2e-16

Autocorrelation in residuals

durbinWatsonTest(m3)  # from library(car)

#  lag Autocorrelation D-W Statistic p-value
#    1       -0.016722        2.0218    0.99
#  Alternative hypothesis: rho != 0

acf(resid(m3))  # autocorrelation only expected for time-series data

plot of chunk unnamed-chunk-39

Normally distributed residuals

shapiro.test(resid(m3))$p.value

# [1] 0.65129

plot(m3, which = 2)

plot of chunk unnamed-chunk-40

Question 6

Linear regression: diagnostics

Model building

With a clear hypothesis, you should only fit the model reflecting the hypothesis
For exploratory research (without a clear hypothesis), model comparison is necessecary to determine the optimal model
- Stepwise model comparison may be problematic, as the importance of additional predictors is assessed when other predictors are already in the model and some variables may be correlated
- For exploratory models, I generally start from the simpler model and then build a more complex model
  - The advantage of this forward selection approach is that it also builds up your knowledge of the model
  - But backwards selection (starting with the most complex model) has lower risk of missing out on important variables
  - R contains functions to automatically evaluate a series of models
    - But remember: data analyst knows more than the computer

Automatic model fitting

complexmodel <- lm(Petal.Area^0.3 ~ Sepal.Width * Species, data = iris)
bestmodel <- step(complexmodel)

msummary(bestmodel)

#                              Estimate Std. Error t value Pr(>|t|)    
# (Intercept)                    0.9498     0.1236    7.68  2.2e-12 ***
# Sepal.Width                    0.2625     0.0444    5.92  2.3e-08 ***
# Speciessetosa                 -0.4505     0.1770   -2.55   0.0120 *  
# Speciesvirginica               0.4616     0.1788    2.58   0.0108 *  
# Sepal.Width:Speciessetosa     -0.1972     0.0576   -3.42   0.0008 ***
# Sepal.Width:Speciesvirginica  -0.0438     0.0619   -0.71   0.4799    
# 
# Residual standard error: 0.0974 on 144 degrees of freedom
# Multiple R-squared:  0.972,   Adjusted R-squared:  0.971 
# F-statistic: 1.01e+03 on 5 and 144 DF,  p-value: <2e-16

Visualization of interaction

plotSlopes(bestmodel, Sepal.Width, modx = Species)

plot of chunk unnamed-chunk-43

Interpretation

#      (Intercept) Sepal.Width Speciessetosa Speciesvirginica Sepal.Width:Speciessetosa
# [1,]        0.95        0.26         -0.45             0.46                      -0.2
#      Sepal.Width:Speciesvirginica
# [1,]                        -0.04

Petal.Area = 0.26 \(\times\) Sepal.Width + -0.45 \(\times\) Speciessetosa + 0.46 \(\times\) Speciesvirginica + -0.2 \(\times\) Sepal.Width \(\times\) Speciessetosa + -0.04 \(\times\) Sepal.Width \(\times\) Speciesvirginica + 0.95
- Petal area for species setosa with sepal width of 3.5: 0.26 \(\times\) 3.5 + -0.45 \(\times\) 1 + 0.46 \(\times\) 0 + -0.2 \(\times\) 3.5 \(\times\) 1 + -0.04 \(\times\) 3.5 \(\times\) 0 + 0.95 = 0.73

head(iris[, c("Sepal.Length", "Sepal.Width", "Species", "FittedPA")], 1)

#   Sepal.Length Sepal.Width Species FittedPA
# 1          5.1         3.5  setosa  0.72783

Same model, but different questions answered

similarmodel <- lm(Petal.Area^0.3 ~ Species + Sepal.Width:Species, data = iris)
msummary(similarmodel)

#                               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)                     0.9498     0.1236    7.68  2.2e-12 ***
# Speciessetosa                  -0.4505     0.1770   -2.55    0.012 *  
# Speciesvirginica                0.4616     0.1788    2.58    0.011 *  
# Speciesversicolor:Sepal.Width   0.2625     0.0444    5.92  2.3e-08 ***
# Speciessetosa:Sepal.Width       0.0653     0.0367    1.78    0.077 .  
# Speciesvirginica:Sepal.Width    0.2187     0.0432    5.07  1.2e-06 ***
# 
# Residual standard error: 0.0974 on 144 degrees of freedom
# Multiple R-squared:  0.972,   Adjusted R-squared:  0.971 
# F-statistic: 1.01e+03 on 5 and 144 DF,  p-value: <2e-16

AIC(similarmodel) - AIC(bestmodel)

# [1] 0

Effect of sepal width now for each species, rather than in comparsion with the coefficient of sepal width for the reference level

Multiple regression: visualization

Model criticism

Idea: check outlier characteristics and refit model with large outliers excluded to verify your effects are not "carried" by these outliers
- Important: no a priori exclusion of outliers without a clear reason
  - A good reason is not that the value is over 2.5 SD above the mean
  - A good reason (e.g.,) is that the response is faster than possible (so it must have been a lucky guess)
- See Baayen (2008)

Distribution of residuals

plot(bestmodel, which = 2)  # some outliers in residuals

plot of chunk unnamed-chunk-47

Trimmed model

iris2 <- iris[abs(scale(resid(bestmodel))) < 2, ]  # exclude outliers
(noutliers <- sum(abs(scale(resid(bestmodel))) >= 2))  # 7 outliers

# [1] 7

bestmodel2 <- lm(Petal.Area^0.3 ~ Sepal.Width * Species, data = iris2)
msummary(bestmodel2)  # only small changes in p-values: same pattern as before

#                              Estimate Std. Error t value Pr(>|t|)    
# (Intercept)                    0.7921     0.1096    7.23  3.1e-11 ***
# Sepal.Width                    0.3139     0.0391    8.04  3.8e-13 ***
# Speciessetosa                 -0.3735     0.1552   -2.41  0.01745 *  
# Speciesvirginica               0.5412     0.1556    3.48  0.00068 ***
# Sepal.Width:Speciessetosa     -0.2268     0.0505   -4.49  1.5e-05 ***
# Sepal.Width:Speciesvirginica  -0.0712     0.0537   -1.33  0.18710    
# 
# Residual standard error: 0.082 on 137 degrees of freedom
# Multiple R-squared:  0.98,    Adjusted R-squared:  0.98 
# F-statistic: 1.37e+03 on 5 and 137 DF,  p-value: <2e-16

Checking residuals of the trimmed model

par(mfrow = c(1, 2))
plot(bestmodel, which = 2, main = "original model")
plot(bestmodel2, which = 2, main = "trimmed model")

plot of chunk unnamed-chunk-49

Question 7

Alternative to identifying outliers: influence plot

# Format of influence plot:
# x-axis: hat values (leverage) - amount of influence a predictor has on fitted values
# y-axis: standardized residuals
# circle size: Cook's distance - effect of removing observation on coefficients & fitted values
outliers <- influencePlot(bestmodel) # from library(car)

plot of chunk unnamed-chunk-50

Checking for overfitting

library(rms)
bm.val <- ols(Petal.Area^0.3 ~ Sepal.Width * Species, data = iris, x = TRUE, y = TRUE)  # specify model as ols
validate(bm.val, B = 1000)

#           index.orig training   test optimism index.corrected    n
# R-square      0.9724   0.9732 0.9712   0.0020          0.9703 1000
# MSE           0.0091   0.0088 0.0095  -0.0008          0.0099 1000
# g             0.6208   0.6172 0.6206  -0.0034          0.6241 1000
# Intercept     0.0000   0.0000 0.0017  -0.0017          0.0017 1000
# Slope         1.0000   1.0000 0.9987   0.0013          0.9987 1000

Large optimism values indicate overfitting
Slope optimism should be below 0.05: OK

Bootstrap sampling

bestmodel.boot <- Boot(bestmodel, R = 1000)  # from library(car)
summary(bestmodel.boot)

#                                 R original  bootBias bootSE bootMed
# (Intercept)                  1000   0.9498 -0.004888 0.1335  0.9468
# Sepal.Width                  1000   0.2625  0.001803 0.0456  0.2641
# Speciessetosa                1000  -0.4505  0.001771 0.1967 -0.4409
# Speciesvirginica             1000   0.4616  0.006549 0.1854  0.4639
# Sepal.Width:Speciessetosa    1000  -0.1972 -0.000931 0.0625 -0.1985
# Sepal.Width:Speciesvirginica 1000  -0.0438 -0.002436 0.0616 -0.0454

confint(bestmodel.boot, type = "norm")

# Bootstrap quantiles, type =  normal 
# 
#                                  2.5 %    97.5 %
# (Intercept)                   0.693062  1.216255
# Sepal.Width                   0.171389  0.350029
# Speciessetosa                -0.837843 -0.066748
# Speciesvirginica              0.091734  0.818301
# Sepal.Width:Speciessetosa    -0.318811 -0.073731
# Sepal.Width:Speciesvirginica -0.162137  0.079332

Effect size of individual predictors (\(\eta^2_p\))

library(lmSupport)
modelEffectSizes(bestmodel)

# lm(formula = Petal.Area^0.3 ~ Sepal.Width * Species, data = iris)
# 
# Coefficients
#                        SSR df pEta-sqr dR-sqr
# (Intercept)         0.5602  1   0.2907     NA
# Sepal.Width         0.3325  1   0.1956 0.0067
# Species             0.2415  2   0.1501 0.0049
# Sepal.Width:Species 0.1304  2   0.0871 0.0026
# 
# Sum of squared errors (SSE): 1.4
# Sum of squared total  (SST): 49.5

Reporting results (example)

In our exploratory analysis, we determined the optimal model via model comparison on the basis of AIC. To counter heteroscedasticity, we transformed the dependent variable using a power transformation (\(\lambda = 0.3\), determined using a Box Cox profile-likelihood procedure). After determining the optimal model specification, we applied model criticism by excluding all observations with absolute residuals larger than 2 standard deviations above the mean (2% of the observations; see Baayen, 2008). The final model (see Table 1) reflects the results on the basis of this trimmed dataset. Table 1 shows that the variables of interest are significant (all \(|t|\)'s > 2, \(p\) < 0.05) and their effect size (\(\eta^2_p\)). The interpretation of the effects is as follows: ... [take note of the interpretation of interactions]. The residuals followed a normal distribution, did not show auto-correlation or heteroscedasticity. Table 1 also shows the 95%-confidence intervals which were determined via bootstrap sampling. The adjusted \(R^2\) of the model was 0.75.

Logistic regression

Dependent variable is binary (1: success, 0: failure), not continuous
Transform to continuous variable via log odds: \(\log(\frac{p}{1-p})\) = logit\((p)\)
- Automatically in regression by using glm with family="binomial" or using lrm
- Transformation of dependent variable: generalized regression model
interpret coefficients w.r.t. success as logits: in R: plogis(x)

Logistic regression example

m <- glm(SpeciesVirginica ~ Petal.Area, data = iris, family = "binomial")  # dependent variable equals 0 or 1
msummary(m)

# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)    
# (Intercept)  -19.492      5.241   -3.72  0.00020 ***
# Petal.Area     2.440      0.665    3.67  0.00024 ***
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 190.954  on 149  degrees of freedom
# Residual deviance:  22.333  on 148  degrees of freedom
# AIC: 26.33
# 
# Number of Fisher Scoring iterations: 9

Alternative function (more diagnostics)

(malt <- lrm(SpeciesVirginica ~ Petal.Area, data = iris))  # from library(rms)

# 
# Logistic Regression Model
# 
# lrm(formula = SpeciesVirginica ~ Petal.Area, data = iris)
#                       Model Likelihood     Discrimination    Rank Discrim.    
#                          Ratio Test            Indexes          Indexes       
# Obs           150    LR chi2     168.62    R2       0.938    C       0.997    
#  0            100    d.f.             1    g       13.075    Dxy     0.994    
#  1             50    Pr(> chi2) <0.0001    gr  476765.314    gamma   0.995    
# max |deriv| 7e-08                          gp       0.444    tau-a   0.445    
#                                            Brier    0.025                     
# 
#            Coef     S.E.   Wald Z Pr(>|Z|)
# Intercept  -19.4916 5.2409 -3.72  0.0002  
# Petal.Area   2.4401 0.6648  3.67  0.0002

Index of concordance \(C\) is a good measure of discrimination (0.5: no discrimination; 0.7-0.8: acceptable; 0.8-0.9: excellent; >0.9 outstanding)

Interpreting logit coefficients I

# chance for an iris with a
# petal area of 9 to be a
# Virginica iris
logit = coef(m)["(Intercept)"] + 
    9 * coef(m)["Petal.Area"]
logit

# [1] 2.4692

plogis(logit)

# [1] 0.92195

plot of chunk unnamed-chunk-59

Interpreting logit coefficients II

# chance for an iris with a
# petal area of 7 to be a
# Virginica iris
logit = coef(m)["(Intercept)"] + 
    7 * coef(m)["Petal.Area"]
logit

# [1] -2.411

plogis(logit)

# [1] 0.08234

plot of chunk unnamed-chunk-61

Logistic regression assumptions

Independent observations (✓)
Relation between logit and quantitative predictors is linear
- Function crPlot from library(car) can be used to assess this
No strong multicollinearity between predictors
- Function vif from library(car) can be used to assess this

Final remarks about logistic regression

Overfitting can be evaluated using the function validate() from the rms package applied to the model fit with lrm()
influencePlot() can be used together with the glm result
Instead of having a dependent variable with value 0 or 1, it is also possible to have a dependent variable which contains the number of successes and the number of failures
- For example, rather than having a column "correct" in a dataframe which indicates if someone answered a specific question correctly (1) or not (0), it may contain the number of correctly and incorrectly answered questions on the whole text in two columns "ncorrect" and "nincorrect"
- The model specification then becomes:
  glm(cbind(ncorrect,nincorrect) ~ ...) as opposed to
  glm(correct ~ ...)

Question 8

Alternative regression models

If the dependent variable consists of count data, Poisson regression is necessary
- This is not covered in this course, but you can fit these types of models with glm() using family='poisson'
If the dependent variable has more than two levels, multinomial (polytomous) logistic regression can be used
- This is not covered in this course, but see Levshina (2015, Ch. 13)

Recap

In this lecture, we've covered:
- Correlation
- Multiple regression
- Logistic regression
Your turn to try!
- http://www.let.rug.nl/wieling/BCN-Stats/lecture3/lab
- Finish Lessons 1-12 and optionally Lesson 13 from the swirl course Regression Models:
  library(swirl); install_from_swirl('Regression_Models')
Next lecture: mixed-effects regression

Evaluation

Questions?

Thank you for your attention!

http://www.martijnwieling.nl
m.b.wieling@rug.nl