Quantify relation between two numerical variables (interval or ratio scale)
\(-1 \leq r \leq 1\) indicates strength (effect size) and direction
Obtain significance of correlation
data(iris) # load iris data setcor(iris$Sepal.Length,iris$Petal.Length) # provides r
[1] 0.87175
cor.test(iris$Sepal.Length, iris$Petal.Length) # provides statistical test
Pearson's product-moment correlation
data: x and y
t = 21.6, df = 148, p-value <2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.82704 0.90551
sample estimates:
cor
0.87175
Correlation for ordinal data: Spearman \(\rho\)
# Also when residuals not normally distributedcor.test(iris$Sepal.Length, iris$Petal.Length, method='spearman')
Spearman's rank correlation rho
data: x and y
S = 66429, p-value <2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.8819
# Similar resultcor.test(rank(iris$Sepal.Length),rank(iris$Petal.Length))$estimate
cor
0.8819
Reporting results
The Pearson correlation between the sepal length and petal length was positive with \(r = 0.87\), \(df = 148\), \(p_{two-tailed} < 0.001\).
Note that the number of degrees of freedom for a correlation is: \(N - 2\)
To assess relationship between numerical dependent variable and one (simple regression) or more (multiple regression) quantitative or categorical predictor variables
Measures impact of each individual variable on dependent variable, while controlling for other variables in the model
Note that regression is equivalent to ANOVA, but the focus is different: relation between numerical variables vs. group comparisons
Linear regression: formula
Linear regression captures a linear relationship between dependent variable (DV) and independent variable(s) (IV)
Regression formula\(y = \beta_0 + \beta_1 x + \epsilon\)
Interpretation of intercept with centered predictor
(Intercept) Sepal.Length.c
2.90 2.44
For a centered sepal length of 0 (an average sepal length of about 5.8), the predicted petal area is: 2.44 \(\times\) 0 + 2.9 = 2.9
The estimate for the intercept is now more natural: it reflects the value of the dependent variable (petal area) when the predictor (sepal length) is at its average value!
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.8970 0.0928 31.23 < 2e-16 ***
Sepal.Length.c 2.3757 0.1132 20.99 < 2e-16 ***
Sepal.Width.c -1.0285 0.2150 -4.78 4.1e-06 ***
Residual standard error: 1.14 on 147 degrees of freedom
Multiple R-squared: 0.771, Adjusted R-squared: 0.768
F-statistic: 247 on 2 and 147 DF, p-value: <2e-16
Note that we use centered variables to facilitate interpretation: when both sepal length and width are at their average value, the predicted petal area equals 2.897
For a centered sepal length of about -0.7 (sepal length of 5.1) and a centered sepal width of about 0.4 (sepal width of 3.5), the predicted (fitted) petal area is: 2.38 \(\times\) -0.7 + -1.03 \(\times\) 0.4 + 2.9 = 0.68
Analysis of Variance Table
Model 1: Petal.Area ~ Sepal.Length.c
Model 2: Petal.Area ~ Sepal.Length.c + Sepal.Width.c
Res.Df RSS Df Sum of Sq F Pr(>F)
1 148 219
2 147 190 1 29.5 22.9 4.1e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(m0.c) -AIC(m1) # AIC decrease >= 2: more complex model should be used
[1] 19.703
Evidence ratio: the model with lower AIC is \(e^{\frac{\Delta AIC}{2}}\) times more likely than the model with higher AIC (\(e^{9.85} \approx\) 19,000)
Analysis of Variance Table
Model 1: Petal.Area ~ Sepal.Length.c + Sepal.Width.c
Model 2: Petal.Area ~ Sepal.Length.c * Sepal.Width.c
Res.Df RSS Df Sum of Sq F Pr(>F)
1 147 190
2 146 180 1 9.82 7.97 0.0054 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(m1) -AIC(m2) # AIC decrease >= 2: more complex model should be used
[1] 5.9696
The more complex model is necessary (\(e^\frac{6}{2} \approx\) 20 times as likely)
Model comparison when including an interaction
Note that a model with an interaction term (A:B) should be compared to the best model without that term
This is the model including A + B, if both terms are significant (determined via model comparison)
This is the model including A, if A is significant and B is not
This is the model including B, if B is significant and A is not
This is the model witout A and B, if neither are significant
It is important to get this comparison right, as an interaction might provide an improvement over including A + B (if both are not significant), but may not be an improvement over the model without A and B.
In that case one should stick with the model without A, B, or their interaction
For a centered sepal length of about -0.7 (sepal length of 5.1) and a centered sepal width of about 0.4 (sepal width of 3.5), the predicted (fitted) petal area is: 2.33 \(\times\) -0.7 + -0.87 \(\times\) 0.4 + 0.84 \(\times\) -0.7 \(\times\) 0.4 + 2.93 = 0.53
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.150 0.104 30.38 < 2e-16 ***
Sepal.Width.c 1.008 0.160 6.32 3.1e-09 ***
Speciessetosa -3.341 0.168 -19.85 < 2e-16 ***
Speciesvirginica 2.582 0.135 19.06 < 2e-16 ***
Residual standard error: 0.657 on 146 degrees of freedom
Multiple R-squared: 0.924, Adjusted R-squared: 0.922
F-statistic: 589 on 3 and 146 DF, p-value: <2e-16
Note that the estimate for Sepal.Width.c is now positive as including Species controls for Setosa irises having the highest sepal width and the lowest sepal area
Homoscedasticity of error variance
plot(fitted(m5), resid(m5))
Clear heteroscedasticity (more variance for larger fitted values)
Non-constant variance test
ncvTest(m5) # from library(car)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 32.18, Df = 1, p = 1.41e-08
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.4012 0.0130 108.01 < 2e-16 ***
Sepal.Width.c 0.1360 0.0200 6.81 2.4e-10 ***
Speciessetosa -0.8642 0.0211 -41.04 < 2e-16 ***
Speciesvirginica 0.2848 0.0169 16.80 < 2e-16 ***
Residual standard error: 0.0823 on 146 degrees of freedom
Multiple R-squared: 0.97, Adjusted R-squared: 0.969
F-statistic: 1.56e+03 on 3 and 146 DF, p-value: <2e-16
ncvTest(m6)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 0.48356, Df = 1, p = 0.487
No heteroscedasticity per explanatory variable
ncvTest(m6, ~ Sepal.Width.c)
Non-constant Variance Score Test
Variance formula: ~ Sepal.Width.c
Chisquare = 0.63821, Df = 1, p = 0.424
ncvTest(m6, ~ Species)
Non-constant Variance Score Test
Variance formula: ~ Species
Chisquare = 0.31587, Df = 2, p = 0.854
Normally distributed residuals
shapiro.test(resid(m6))$p.value
[1] 0.65129
qqnorm(resid(m6))qqline(resid(m6))
Question 5
Linear regression: diagnostics
Model building and hypothesis testing
With a clear hypothesis, you should only fit the model reflecting the hypothesis
For exploratory research (without a clear hypothesis), model comparison is necessary to determine the optimal model
Stepwise model comparison may be problematic, as the importance of additional predictors is assessed when other predictors are already in the model and some variables may be correlated
A hypothesis-testing analysis may also be followed by an exploratory analysis, but make this explicit in your report
My approach
For exploratory models, I generally start from the simpler model and then build a more complex model
The advantage of this forward selection approach is that it also builds up your knowledge of the model
But backwards selection (starting with the most complex model) has a lower risk of missing out on important variables
In practice, I tend to combine both approaches by also assessing if variables excluded at an earlier stage need to be included after other variables have been added
Best model
Using model comparison in an exploratory analysis, we obtain the following best model (only including the two selected predictors):
\(\eta^2_p\): 0.01 (small), 0.06 (medium), > 0.14 (large); however, I generally tend to interpret effect sizes in terms of actual influence on the dependent variable
The effects of individual predictors can also be compared directly when all numerical predictors are \(z\)-transformed
The effect size of the full model is the adjusted \(R^2\): 0.9714 (shown in model summary)
Intermezzo: visualizing a three-way interaction
tmp <-lm(Petal.Area.0.3~ Species * Sepal.Length.c * Sepal.Width.c, data=iris) # model used for demonstration purposespar(mfrow=c(1,3))visreg2d(tmp, 'Sepal.Length.c', 'Sepal.Width.c', cond=list(Species='versicolor'), plot.type='persp', main='versicolor', nn=50)visreg2d(tmp, 'Sepal.Length.c', 'Sepal.Width.c', cond=list(Species='setosa'), plot.type='persp', main='setosa', nn=50)visreg2d(tmp, 'Sepal.Length.c', 'Sepal.Width.c', cond=list(Species='virginica'), plot.type='persp', main='virginica', nn=50)
Model criticism for final model
Idea: check outlier characteristics and refit model with large outliers excluded to verify your effects are not “carried” by these outliers
Important: no a priori exclusion of outliers without a clear reason
A good reason is not that the value is over 2.5 SD above the mean
A good reason (e.g.,) is that the response is faster than possible (so it must have been a lucky guess)
bestmodel2 <-lm(Petal.Area.0.3~ Sepal.Width.c * Species, data=iris2) msummary(bestmodel2) # only small changes in p-values: same pattern as before
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.4228 0.0129 110.25 < 2e-16 ***
Sepal.Width.c 0.2549 0.0317 8.04 3.8e-13 ***
Speciessetosa -0.8666 0.0186 -46.50 < 2e-16 ***
Speciesvirginica 0.2629 0.0162 16.24 < 2e-16 ***
Sepal.Width.c:Speciessetosa -0.1842 0.0410 -4.49 1.5e-05 ***
Sepal.Width.c:Speciesvirginica -0.0578 0.0436 -1.33 0.19
Residual standard error: 0.0666 on 137 degrees of freedom
Multiple R-squared: 0.98, Adjusted R-squared: 0.98
F-statistic: 1.37e+03 on 5 and 137 DF, p-value: <2e-16
Distribution of residuals before and after MC
Question 6
Alternative to identifying outliers: influence plot
# x-axis: hat values (leverage) - amount of influence a predictor has on fitted values# y-axis: standardized residuals# circle size: Cook's distance - effect of removing observation on coefficients & fitted values# influential outliers: those with large absolute residuals and large Cook's distanceoutliers <-influencePlot(bestmodel) # from library(car)
Validation: bootstrap sampling
summary(bestmodel.boot <-Boot(bestmodel2,R=1000)) # from library(car), 1000 bootstrap samples
In our exploratory analysis, we determined the optimal model via model comparison on the basis of AIC. To counter heteroscedasticity, we transformed the dependent variable using a power transformation (\(\lambda = 0.3\), determined using a Box Cox profile-likelihood procedure). After determining the optimal model specification, we applied model criticism by excluding all observations with absolute residuals larger than 2 standard deviations above the mean (2% of the observations; see Baayen, 2008). The final model (see Table 1) reflects the results on the basis of this trimmed dataset. Table 1 shows that the variables of interest are significant at a significance threshold \(\alpha\) of 0.05 (all \(|t|\)’s > 2, \(p\) < 0.05) and their effect size (\(\eta^2_p\)). The interpretation of the effects is as follows: … [take note of the interpretation of interactions]. The residuals followed a normal distribution, did not show auto-correlation or heteroscedasticity. Table 1 also shows the 95%-confidence intervals which were determined via bootstrap sampling. The adjusted \(R^2\) of the model was 0.98.
Logistic regression
Dependent variable is binary (1: success, 0: failure), not continuous
Transform to continuous variable via log odds: \(\log(\frac{p}{1-p})\) = logit\((p)\)
Automatically in regression by using glm (or lrm) with family=“binomial”
Transformation of dependent variable: generalized regression model
interpret coefficients w.r.t. success as logits: in R: plogis(x)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -19.49 5.24 -3.72 0.00020 ***
Petal.Area 4.88 1.33 3.67 0.00024 ***
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 190.954 on 149 degrees of freedom
Residual deviance: 22.333 on 148 degrees of freedom
AIC: 26.33
Number of Fisher Scoring iterations: 9
Logistic Regression Model
lrm(formula = SpeciesVirginica ~ Petal.Area, data = iris)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 150 LR chi2 168.62 R2 0.938 C 0.997
0 100 d.f. 1 R2(1,150)0.673 Dxy 0.994
1 50 Pr(> chi2) <0.0001 R2(1,100)0.813 gamma 0.995
max |deriv| 3e-08 Brier 0.025 tau-a 0.445
Coef S.E. Wald Z Pr(>|Z|)
Intercept -19.4916 5.2409 -3.72 0.0002
Petal.Area 4.8802 1.3296 3.67 0.0002
Index of concordance \(C\) is a good measure of discrimination (0.5: no discrimination; 0.7 - 0.8: acceptable; 0.8 - 0.9: excellent; > 0.9 outstanding)
Logistic regression assumptions
Independent observations (✓)
Relation between logit and quantitative predictors is linear
Function crPlot from library(car) can be used to assess this
No strong multicollinearity between predictors
Function vif from library(car) can be used to assess this
Final remarks about logistic regression
Overfitting can be evaluated with bootstrap analysis
Outliers can be identified with the function influencePlot()
Instead of having a dependent variable with value 0 or 1, a dependent variable containing the number of successes and the number of failures can be used
For example, rather than having a column “correct” in a dataframe which indicates if someone answered a specific question correctly (1) or not (0), it may contain the number of correctly and incorrectly answered questions on the whole text in two columns “ncorrect” and “nincorrect”
The model specification then becomes: glm(cbind(ncorrect,nincorrect) ~ …) as opposed to glm(correct ~ …)
Question 7
Alternative regression models
If the dependent variable consists of count data, Poisson regression is necessary
This is not covered in this lecture, but you can fit these types of models with glm() using family="poisson"
If the dependent variable has more than two levels, multinomial (polytomous) logistic regression may be used