Statistiek I

Introduction to linear regression

Martijn Wieling

Question 1: last lecture

Last lecture

  • How to reason about the population using a sample (CLT)
  • Calculating the standard error (\(SE\))
    • Standard error: used when reasoning about the population using a sample
    • Standard deviation: used when comparing an individual to the population
  • Calculating a confidence interval
  • Specifying a concrete testable hypothesis based on a research question
  • Specifying the null (\(H_0\)) and alternative hypothesis (\(H_a\))
  • Conducting a \(z\)-test and using the results to evaluate a hypothesis
  • Definition of a \(p\)-value: probability of data given that \(H_0\) is true
  • Evaluating the statistical significance given \(p\)-value and \(\alpha\)-level
  • Difference between a one-tailed and a two-tailed test
  • Type I and II errors

This lecture

  • Introduction to data used in this lecture
  • Correlation as a descriptive statistic
  • Introduction to simple linear regression
  • Effect size
  • How to report on a regression analysis

Introduction to data used in this lecture

Lowlands Science 2018 data

  • RQ: Is there a different influence of alcohol (BAC) on L1 vs. L2 language proficiency?
    • Hypothesis (alternative): BAC has a stronger negative effect on L1 than L2
      • Note: not possible to evaluate in this lecture (but later in this course)
      • Instead, we assess here whether there is a negative effect of BAC on L1 only
  • BAC and L1 and L2 speech samples collected for about 90 native Dutch speakers
  • Dutch speech samples were rated by native Dutch speakers (at Lowlands)
  • English speech samples were rated by native American English speakers (online)
  • For each speaker, the ratings were averaged per language
  • Final ratings were \(z\)-transformed (subtract mean and divide by standard deviation)
  • Other variables we collected included sex, education level, etc.
  • We evaluate the influence of BAC on L1 proficiency for 43 speakers with BAC > 0

Relation between two numerical variables

  • We can visualize this relationship in a scatter plot
plot(lls$BAC, lls$ratingNL, xlab='BAC', ylab='rating (NL)')

Quantifying the relationship descriptively

  • We can quantify the relationship between two variables \(x\) and \(y\) by calculating the Pearson correlation coefficient \(r\)
  • Values range between \(-1 \leq r \leq 1\): strength and direction of the effect
cor(lls$BAC, lls$ratingNL) # NA due to missing BAC values
[1] NA
sum(is.na(lls$BAC)) # NA result due to 4 missing values for BAC in data
[1] 4
lls <- lls[!is.na(lls$BAC),] # remove rows with missing BAC values

Correlation results

cor(lls$BAC, lls$ratingNL)
[1] -0.31931

Question 2

Correlation demo: sensitivity to outliers

Question 3

(Cor)relation is not causation (1)

(Cor)relation is not causation (2)

  • Shoe size and reading ability are correlated
  • Is there a dependency between the two?
  • No: the hidden variable age affects both shoe size and reading ability!
  • Experiments are necessary to control for potential causes (and confounding factors)

From correlation to regression

  • In this course, we are using correlation only descriptively
  • We use linear regression to statistically assess the relationship between one or more independent variables (predictors) and a dependent variable
  • \(H_0\) for linear regression: no effect of independent variable(s) on dependent variable
  • We only focus on numerical variables in this lecture

Linear regression: formula

  • Linear regression captures a linear relationship between dependent variable (DV) and independent variable(s) (IV)
    • Regression formula \(y = \beta_0 + \beta_1 x + \epsilon\)
      • \(y\): dependent variable (= predicted or fitted values), \(x\): independent variable
      • \(\beta_0\): intercept (value of \(y\) when \(x = 0\)), \(\beta_1\): influence (slope) of \(x\) on \(y\)
      • \(\epsilon\): errors (residuals)

Question 4

Residuals: difference between actual and fitted values

  • Regression minimizes residual error (residuals: sum of squared errors)
    • Thereby determining the optimal coefficients for intercept and slope(s)

Linear regression: minimizing the residuals

Visualization is essential!

  • Linear regression is only suitable in the top-left situation

Fitting our first simple linear regression model in R

m <- lm(ratingNL ~ 1 + BAC, data=lls) # "1 +" (Intercept) is included by default
summary(m)

Call:
lm(formula = ratingNL ~ 1 + BAC, data = lls)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0653 -0.3496  0.0867  0.2953  1.0466 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   0.3497     0.0995    3.51   0.0011 **
BAC          -0.3505     0.1624   -2.16   0.0369 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.456 on 41 degrees of freedom
Multiple R-squared:  0.102, Adjusted R-squared:  0.0801 
F-statistic: 4.66 on 1 and 41 DF,  p-value: 0.0369

Visualizing the results

library(visreg) # load visreg package to visualize regression results
visreg(m, xlab = "BAC", ylab = "rating (NL)") # including 95% CI bands
abline(h=0, v=0, lty=2) # add x- and y-axis

Interpretation

summary(m)$coef # show coefficients only (saves space)
            Estimate Std. Error t value  Pr(>|t|)
(Intercept)  0.34970   0.099513  3.5141 0.0010909
BAC         -0.35049   0.162448 -2.1576 0.0368728
  • Summary shows \(\beta_\textrm{bac} =\) -0.35
    • For every increase of BAC with 1, ratingNL (dependent variable) decreases by 0.35
  • Summary shows \(\beta_0\) (intercept) \(=\) 0.35
    • Someone with a BAC of 0 is expected to have a ratingNL of 0.35
  • Fitted (predicted) value of the model can be determined using regression formula
    • \(y = \beta_0 + \beta_1 x \implies\) ratingNL = 0.35 + -0.35 * BAC

Using the regression formula

            Estimate Std. Error t value  Pr(>|t|)
(Intercept)  0.34970   0.099513  3.5141 0.0010909
BAC         -0.35049   0.162448 -2.1576 0.0368728
  • For BAC of \(0\) fitted ratingNL of: \(0.35 +\,\) \(-0.35\, *\) \(0\) \(=\) \(0.35\) (= Intercept)
  • For BAC of \(0.5\) fitted ratingNL of: \(0.35 +\,\) \(-0.35\, *\) \(0.5\) \(=\) \(0.175\)
  • For BAC of \(1.5\) fitted ratingNL of: \(0.35 +\,\) \(-0.35\, *\) \(1.5\) \(=\) \(-0.175\)

Towards a more meaningful intercept

summary(m)$coef # only show coefficients
            Estimate Std. Error t value Pr(>|t|)
(Intercept)     0.35     0.0995    3.51  0.00109
BAC            -0.35     0.1624   -2.16  0.03687
  • The intercept indicates that people with a BAC of 0 have an expected L1 rating of 0.35
    • As sober people were excluded when fitting the model, this value is not so informative
  • To make it more meaningful, it is useful to center BAC by subtracting the mean value

Centering the independent variable

lls$BAC.c = lls$BAC - mean(lls$BAC) # center BAC
summary(mc <- lm(ratingNL ~ BAC.c, data=lls))$coef # results with centered BAC
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    0.196     0.0695    2.82  0.00736
BAC.c         -0.350     0.1624   -2.16  0.03687
  • Intercept now shows that people with an average BAC (BAC = 0.44 => BAC.c = 0) have an expected L1 rating of 0.196
    • This is a meaningful value in the context of our data
    • (Usually the intercept is not involved in hypotheses, however)
  • Note that the coefficient (\(\beta\)) and \(p\)-value of the slope did not change due to centering!

Question 5

Visual comparison: shifted \(y\)-axis

Creating a more meaningful slope

  • Interpretation of slope: if BAC increases with 1, ratingNL decreases by 0.35
  • Not always easy to interpret as scales of independent variables may be very different
  • Standardization (\(z\)-transforming) helps interpretation (and facilitates comparison)
lls$BAC.z = (lls$BAC - mean(lls$BAC)) / sd(lls$BAC) # z-transform BAC values
summary(mz <- lm(ratingNL ~ BAC.z, data=lls))$coef # results with z-transformed BAC
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    0.196     0.0695    2.82  0.00736
BAC.z         -0.152     0.0704   -2.16  0.03687
  • Interpretation of slope (for \(z\)-transformed BAC): if BAC increases with 1 standard deviation (\(\approx\) 0.43), the L1 rating decreases by 0.152
  • Note that the \(p\)-value of the independent variable remains unchanged

Visual comparison of the three models

visreg(m, xlab = "BAC (original)", ylab = "rating (NL)") 
visreg(mc, xlab = "BAC (centered)", ylab = "rating (NL)") # note the shifted y-axis
visreg(mz, xlab = "BAC (z-transformed)", ylab = "rating (NL)") # note the different scale

Testing hypotheses using a regression model

summary(mz)$coef # only show coefficients (model with BAC z-transformed)
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    0.196     0.0695    2.82  0.00736
BAC.z         -0.152     0.0704   -2.16  0.03687
  • Slope coefficients are relevant for assessing hypotheses
  • Here: \(H_0\): \(\beta_\text{bac} = 0\), \(H_a\): \(\beta_\textrm{bac} < 0\)
  • Summary shows \(\beta_\textrm{bac} =\) -0.15
  • Associated two-sided \(p\)-value is 0.037
    • Hypothesis is one-sided, and coefficient in line with \(H_a\), so \(p\)-value = \(\frac{0.037}{2} \approx 0.0185\)
    • \(p\)-value < 0.05 (\(\alpha\)) \(\implies\) reject \(H_0\) and accept \(H_a\)
      • There is a significant negative effect of BAC on L1 pronunciation

Background: how are \(p\)-values determined in regression?

  • Last lecture: \(z\)-values can be used to determine a \(p\)-value
    • E.g., \(z \geq 2 \implies p \approx 0.025\)

    • But \(z\)-values can only be determined when \(\sigma\) is known

  • When \(\sigma\) is not known, we can estimate it using the sample standard deviation \(s\)
  • Using \(s\), we can determine the \(t\)-value, which is comparable to the \(z\)-value

Determining the \(t\)-value

  • Recall the formula to determine the \(z\)-value for a sample:

\[z = \frac{m - \mu}{\sigma / \sqrt{n}}\]

  • The only difference between determining the \(t\)-value compared to the \(z\)-value is the way the standard error \(SE\) is computed:

\[SE = s / \sqrt{n}\]

  • For regression, we always assume \(\mu\) to be 0 (\(H_0\): no effect of a predictor), and for \(m\) we use the estimated regression coefficient \(\beta\), yielding the following formula for \(t\):

\[t = \frac{\beta}{SE}\]

Manually checking the \(t\)-values

summary(m)$coef # show coefficients 
            Estimate Std. Error t value Pr(>|t|)
(Intercept)     0.35     0.0995    3.51  0.00109
BAC            -0.35     0.1624   -2.16  0.03687
(tval_intercept = 0.34970 / 0.099513) # t-value for intercept 
[1] 3.51
(tval_BAC = -0.35049 / 0.162448) # t-value for BAC
[1] -2.16

Obtaining \(p\)-values on the basis of \(t\)-values

  • To determine a \(p\)-value, \(z\)-values are compared to the standard normal distribution
  • But \(t\)-values are compared to the \(t\)-distribution
  • \(t\)-distributions look similar to the standard normal distribution
    • But dependent on the number of degrees of freedom (dF)

What are degrees of freedom?

  • There are five balloons each having a different color
  • There are five students (\(n = 5\)) who need to select a balloon
    • If 4 students have selected a balloon (dF = 4), student nr. 5 gets the last balloon (fixed)

  • Similarly, if we have a fixed slope and a fixed intercept with 43 observations
    • 41 values may vary, but the 42nd and 43rd are fixed: dF = 43 - 1 - 1 = 41
  • For regression, dF = N - V - 1 (with N: nr. of obs., V: nr. of independent variables)

Question 6

\(t\)-distribution vs. normal distribution

  • Difference between normal distribution and \(t\)-distribution is large for small dFs
  • When dF \(\geq\) 100, the difference is negligible
  • As the shape differs, the \(p\)-value associated with a certain \(t\)-value also changes

Visualizing \(t\)-distributions

  • For significance (given \(\alpha\)), higher (abs.) \(t\)-values are needed than \(z\)-values (but only when dF < 100, otherwise \(z\) and \(t\) are equal)
qt(0.025, df=5, lower.tail=F) # critical t-value (alpha = 0.025) for dF = 5
[1] 2.57

Question 7

Answer to question 7

pt(2, 10, lower.tail=F) * 2 # two-sided p-value = 2 * one-sided p-value
[1] 0.0734

  • Dark gray area: \(p\) < 0.05 (2-tailed)

Calculating \(p\)-values manually for our model

            Estimate Std. Error t value Pr(>|t|)
(Intercept)     0.35     0.0995    3.51  0.00109
BAC            -0.35     0.1624   -2.16  0.03687
pt(abs(3.5141), df=41, lower.tail=F) * 2 # p-value in regression always two-tailed
[1] 0.00109
pt(abs(-2.1576), df=41, lower.tail=F) * 2 # p-value for slope of BAC
[1] 0.0368693
  • Manual computation is not needed as R does it for us
    • But you have to understand the idea behind it!

Other components of the regression model summary

summary(m)

Call:
lm(formula = ratingNL ~ 1 + BAC, data = lls)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0653 -0.3496  0.0867  0.2953  1.0466 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   0.3497     0.0995    3.51   0.0011 **
BAC          -0.3505     0.1624   -2.16   0.0369 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.456 on 41 degrees of freedom
Multiple R-squared:  0.102, Adjusted R-squared:  0.0801 
F-statistic: 4.66 on 1 and 41 DF,  p-value: 0.0369
  • You can ignore the remaining parts, except for the line with the R-squared values

Effect size: \(R^2\) of the model

  • \(R^2\): proportion of variation of the DV explained by the IV(s) in the model
    • \(R^2\) ranges from 0 to 1, with higher being better (perfect prediction: 1)
    • Effect size magnitudes: small (\(R^2 \approx 0.01\)), medium (\(R^2 \approx 0.1\)), large (\(R^2 \gtrapprox 0.25\))
  • When the model contains a single IV, \(R^2\) is the effect size of that variable
  • (Adjusted \(R^2\) is used instead when multiple independent variables are present)
summary(m)$r.squared # R^2: 10.2% of the variation in ratingNL scores explained by BAC
[1] 0.102

Correlation and \(R^2\)

  • \(R^2\) of a linear regression model with 1 independent variable (IV) and 1 dependent variable (DV) equals the square of Pearson’s correlation
cor(lls$ratingNL, lls$BAC) # Pearson's correlation
[1] -0.31931
cor(lls$ratingNL, lls$BAC)^2 # square of Pearson's correlation
[1] 0.10196
summary(m)$r.squared # R^2
[1] 0.10196

Model assumptions for linear regression

  • Two important requirements:
    • Distribution of residuals is assumed to follow a normal distribution
    • Variance of residuals should be equal along the regression line (homoscedastic)

Does our model satisfy the required assumptions?

par(mfrow=c(1,3))
qqnorm(resid(m))
qqline(resid(m)) # approximately normal
plot(lls$BAC, resid(m), main='Variance of residuals (vs predictor)') # approx. constant
plot(fitted(m), resid(m), main='Residual variance (vs fitted)')      # = homoscedastic

Reporting results

 
Our RQ was: Does alcohol have a negative influence on L1 language proficiency? Our hypotheses were: \(H_0\): \(\beta_{bac} = 0\), \(H_a\): \(\beta_{bac} < 0\). We obtained alcohol percentages and L1 proficiency ratings for 43 individuals. We fitted a regression model with the L1 ratings as the dependent variable (DV) and blood alcohol concentration (BAC) as the independent variable (IV). We verified that the required assumptions were satisfied (i.e. normally distributed and homoscedastic residuals). The regression coefficient for BAC was -0.35 with a one-tailed associated \(p\)-value of 0.018. The effect size, measured in \(R^2\) was 0.102 (medium). This indicates that the predictor BAC accounts for about 10% of the variance in L1 proficiency ratings. See Table X and Figure Y for the exact pattern. As the \(p\)-value was lower than our significance threshold \(\alpha\) of 0.05, we reject the null hypothesis and accept the alternative hypothesis that there is a significant negative influence of alcohol on L1 language proficiency.

Summary: assessing significance and reporting results

  1. Describe RQ, \(H_0\) and \(H_a\)
  2. Indicate how and what data was collected and the sample size
  3. Indicate the dependent variable (DV) and the independent variable (IV)
  4. Report whether requirements for residuals satisfied (normal and homoscedastic)
  5. Report the estimates and \(p\)-values (and provide all details in a table)
  6. Visualize the results
  7. Report effect size
  8. Observe the \(p\)-value (adjust for one-tailed hypothesis), compare to \(\alpha\) (report!) and determine your conclusion
  • \(p\)-value \(< \alpha\), significant result: reject \(H_0\) and accept \(H_a\)
  • \(p\)-value \(\geq \alpha\), non-significant result: retain \(H_0\)

Practice this in laboratory exercises!

Recap

  • In this lecture, we’ve covered
    • Correlation
    • Linear regression with a single numerical variable
  • Next lecture: Multiple linear regression (including nominal variables)

Please evaluate this lecture!

Exam question

Questions?

Thank you for your attention!

 

https://www.martijnwieling.nl

m.b.wieling@rug.nl