Martijn Wieling

University of Groningen

- Non-parametric tests:
- Mann-Whitney U test (alternative to independent samples \(t\)-test)
- Wilcoxon signed-rank test (alternative to paired and single sample \(t\)-test)
- Sign test (alternative to Wilcoxon signed-rank test)

- Reporting statistical analysis

- \(p\)-value: probability of the observed (or more extreme) data given that \(H_0\) true
- Two-tailed test: very low and very high values evidence against \(H_0\)
- One-tailed test: only low (or high) values evidence against \(H_0\)

- In
`R`

: parameter`alternative="two-tailed"`

yields \(p\)-value for two-tailed test`alternative="less"`

(or`"greater"`

): \(p\)-value for one-tailed test- (No need to divide/multiply by 2, in contrast to using
`pnorm()`

or`pt()`

) - \(p\)-value of a one-tailed test (appropriate side) will be
**half**of \(p\)-value of two-tailed test (due to symmetry)

- \(p\)-value is compared to the \(\alpha\)-value (set before analysis)
- One-tailed test is more likely to yield significant result (as \(p\)-value is lower)

- Assessing the relationship between variables of the same type
- Two categorical variables: \(\chi^2\)-test
- Two numerical variables: correlation

- Reliability of questionnaires: Cronbach's \(\alpha\)

- In earlier lectures, we were interested in the relationship between a categorical variable (two groups) and a numerical variable (e.g., English grades)
- But what about the relationship between two categorical variables?
- Examples:
- Social group and /r/ pronunciation
- Authorship and use of personal pronouns
- Gender and having had bilingual education

```
(counts <- table(dat$gender, dat$bl_edu))
```

```
#
# Y N
# F 20 202
# M 10 83
```

```
prop.table(counts) # as proportions
```

```
#
# Y N
# F 0.063492 0.641270
# M 0.031746 0.263492
```

```
prop.table(counts, 1) # proportions per row
```

```
#
# Y N
# F 0.09009 0.90991
# M 0.10753 0.89247
```

```
prop.table(counts, 2) # proportions per column
```

```
#
# Y N
# F 0.66667 0.70877
# M 0.33333 0.29123
```

```
barplot(counts, legend = c("F", "M"), beside = T, ylim = c(0, 380), main = "Beside")
barplot(counts, legend = c("F", "M"), ylim = c(0, 380), main = "Segmented")
```

**\(\chi^2\)**-test is used for assessing the relationship between two categorical variables- Is the distribution of one variable affected by the distribution of another?
- \(P(A|B) \neq P(A)\)?
- Example: \(P(\textrm{bilingual education}|\textrm{gender = M}) \neq P(\textrm{bilingual education})\)?

- We can test this by
**counting**

- Is the distribution of one variable affected by the distribution of another?

- The
**\(\chi^2\)**-test compares the**actual frequencies**to the**expected frequencies**(assuming that the two variables are unrelated, i.e. independent)- Large differences: likely related
- Small differences: most likely not related

- Note:
**not for paired data**(i.e. repeated measures for one individual)!- Then the McNemar test is appropriate (not covered in this course)

- Data randomly selected from population
- Independent observations
- Each observation can be classified into exactly one category per variable
- Expected frequency in each cell \(\geq\) 5
- Otherwise we use
**Fisher's exact test**(covered later this lecture)

- Otherwise we use

- Actual frequencies (including column and row sums)

Y | N | Total | |
---|---|---|---|

F | 20 | 202 | 222 |

M | 10 | 83 | 93 |

Total | 30 | 285 | 315 |

- Expected frequencies: multiply column total with row proportions

Y | N | Proportion | |
---|---|---|---|

F | 0.7 * 30 = 21.1 | 0.7 * 285 = 200.9 | 0.7 |

M | 0.3 * 30 = 8.9 | 0.3 * 285 = 84.1 | 0.3 |

Total | 30 | 285 |

\[\chi^2 = \Sigma \frac{(O - E)^2}{E}\]

Y | N | |
---|---|---|

F | $(20-21.1)^2 / 21.1 \approx 0.06$ | $(202-200.9)^2 / 200.9 \approx 0.01$ |

M | $(10-8.9)^2 / 8.9 \approx 0.15$ | $(83-84.1)^2 / 84.1 \approx 0.02$ |

- \(\chi^2\) = 0.06 + 0.01 + 0.15 + 0.02 \(\approx\) 0.23
- Degrees of freedom: (R - 1) \(\times\) (C - 1) = (2 - 1) \(\times\) (2 - 1) = 1
- With R: number of rows, C: number of columns
- If row and column totals are known together with (R - 1) \(\times\) (C -1) values, all other values can be calculated

- \(p\)-value associated with \(\chi^2\) value is dependent on the shape of the distribution
- Shape of the \(\chi^2\) distribution is dependent on number of degrees of freedom (dF)

```
chisq.test(counts, correct = FALSE) # chi-square test, no continuity correction
```

```
#
# Pearson's Chi-squared test
#
# data: counts
# X-squared = 0.231, df = 1, p-value = 0.63
```

- Using an \(\alpha\)-level of 0.05, this association is not significant
- Note that we always set
`correct = FALSE`

, as using Yates' correction is not recommended (lowers dF: too conservative)

- Are handedness and having had bilingual education related?

```
(counts <- table(dat$bl_edu, dat$hand)) # actual frequencies
```

```
#
# R L
# Y 25 5
# N 252 33
```

```
chisq.test(counts)$expected # expected frequencies
```

```
#
# R L
# Y 26.381 3.619
# N 250.619 34.381
```

- The \(\chi^2\)-test will give a warning with expected frequencies < 5

```
chisq.test(counts, correct = F)
```

```
# Warning in chisq.test(counts, correct = F): Chi-squared approximation may be
# incorrect
```

```
#
# Pearson's Chi-squared test
#
# data: counts
# X-squared = 0.662, df = 1, p-value = 0.42
```

- In this course, we use Fisher's exact test when low expected frequencies occur
- But note that it is conservative, and alternatives exist (e.g., Fisher-Boschloo test)

```
fisher.test(counts)
```

```
#
# Fisher's Exact Test for Count Data
#
# data: counts
# p-value = 0.38
# alternative hypothesis: true odds ratio is not equal to 1
# 95 percent confidence interval:
# 0.22484 2.34374
# sample estimates:
# odds ratio
# 0.65576
```

- Flexible measure for effect size of the relation between two variables: Cramer's \(V\)
- Cramer's \(V\): scaled \(\chi^2\) value (between 0 and 1)
- Small effect: \(V = 0.1\), medium effect: \(V = 0.3\), large effect: \(V = 0.5\)

(for tables with 2 rows or 2 columns)

- Another measure of effect size is the odds ratio: see Levshina, p. 208

```
library(vcd) # install with: install.packages('vcd')
assocstats(counts)
```

```
# X^2 df P(> X^2)
# Likelihood Ratio 0.60777 1 0.43563
# Pearson 0.66231 1 0.41575
#
# Phi-Coefficient : 0.046
# Contingency Coeff.: 0.046
# Cramer's V : 0.046
```

- The effect size for the relation between handedness and education is negligible

- We are interested in assessing if gender and education level are significantly (with \(\alpha\) = 0.05) related (i.e. dependent)
- Hypotheses:
- \(H_0\): gender and education level are independent
- \(H_a\): gender and education level are dependent

```
(counts <- table(dat$gender, dat$edu_level))
```

```
#
# HAVO VWO HBO
# F 30 137 48
# M 13 69 8
```

- Data randomly selected from population ?
- Independent observations ✓
- Each observation can be classified into exactly one category per variable ✓
- Expected frequency in each cell \(\geq\) 5 ?

```
round(chisq.test(counts)$expected)
```

```
#
# HAVO VWO HBO
# F 30 145 39
# M 13 61 17
```

- All expected frequencies \(\geq\) 5: ✓

```
barplot(counts, legend = c("F", "M"), col = c("pink", "lightblue"))
```

```
chisq.test(counts, correct = F)
```

```
#
# Pearson's Chi-squared test
#
# data: counts
# X-squared = 7.82, df = 2, p-value = 0.02
```

- \(p\)-value < \(\alpha\)-value: reject \(H_0\) and accept \(H_a\)

```
assocstats(counts)
```

```
# X^2 df P(> X^2)
# Likelihood Ratio 8.7109 2 0.012837
# Pearson 7.8235 2 0.020005
#
# Phi-Coefficient : NA
# Contingency Coeff.: 0.158
# Cramer's V : 0.16
```

- The effect size is small to medium

```
chisq.test(counts)$expected # expected frequencies
```

```
#
# HAVO VWO HBO
# F 30.311 145.213 39.475
# M 12.689 60.787 16.525
```

```
chisq.test(counts)$observed # observed frequencies
```

```
#
# HAVO VWO HBO
# F 30 137 48
# M 13 69 8
```

```
prop.table(counts, 2) # column proportions (2nd index)
```

```
#
# HAVO VWO HBO
# F 0.70 0.67 0.86
# M 0.30 0.33 0.14
```

```
prop.table(counts, 1) # row proportions (1st index)
```

```
#
# HAVO VWO HBO
# F 0.14 0.64 0.22
# M 0.14 0.77 0.09
```

- Standardized residuals: \(z\)-scores indicating where the largest differences are
- (Standardization against expected frequency and row and column proportion)
- Only useful when one of the variables has > 2 levels
- Otherwise all absolute values equal

```
chisq.test(counts)$stdres # z-scored differences
```

```
#
# HAVO VWO HBO
# F -0.11237 -2.20225 2.76432
# M 0.11237 2.20225 -2.76432
```

- The HBO education shows the largest deviation from the expected gender distribution

We tested whether gender (two levels: male and female) and education level (three levels: HAVO, VWO and HBO) were related (with \(\alpha\) = 0.05). Our hypotheses were \(H_0\): gender and education level are independent, and \(H_a\): gender and education level are dependent. In our sample of 315 students of the *Statistiek I* course, we obtained the students' gender and education level. Table 1 shows their relation. As both variables are nominal, and the expected frequency of each combination was greater than 5, we assessed the dependence of the two variables using the \(\chi^2\) test: \(\chi^2(2) = 7.82\), \(p = 0.02\). As the \(p\)-value was lower than the \(\alpha\)-level, the result was significant. We reject the null hypothesis and accept the alternative hypothesis that gender and education level are dependent. The difference from the expected distribution was greatest for the HBO group: there were many more women than expected. The effect size was small to medium (Cramer's \(V\) = 0.16).

- Avoid overuse of \(\chi^2\):
- When tables have very large frequencies (e.g., corpus linguistics) results are almost always significant

- \(\chi^2\) is not very sensitive, so use numerical variables where appropriate
- I.e. don't use: This course was:
*useless / somewhat useful / useful* - But use
**Likert scales**:- On a scale from 1-5, this course was: ...
*(1: useless; 5: useful)*

- On a scale from 1-5, this course was: ...

- I.e. don't use: This course was:

- Patient survival after surgery:

Hospital A | Hospital B | |
---|---|---|

Deceased | 63 (3%) | 16 (2%) |

Survived | 2037 | 784 |

Total | 2100 | 800 |

- With the original information,
*Hospital B*appears better - But consider:

Hosp. A (ill) | Hosp. B (ill) | Hosp. A (very ill) | Hosp. B (very ill) | |
---|---|---|---|---|

Deceased | 6 (1%) | 8 (1.3%) | 57 (3.8%) | 8 (4%) |

Survived | 594 | 592 | 1433 | 192 |

Total | 600 | 600 | 1500 | 200 |

- Hospital A is better
*in both categories*!- But it attracts more difficult patients (perhaps because it's better), so its overall survival rate is lower

**Simpson's Paradox**: the distribution of a nominal variable can be reversed under aggregation- This is caused by
*hidden variables*, variables which are important, but neglected in the analysis

- This is caused by

- In contrast to two nominal variables, it is much easier to assess the relationship between two numerical variables
- We can visualize such a relationship in a
**scatter plot**

```
plot(dat$english_grade, dat$english_score)
```

- We can quantify the relationship between two variables \(x\) and \(y\) by calculating the Pearson
**correlation**coefficient \(r\) - After standardizing (i.e. converting the values of each variable to \(z\)-scores), we can calculate the correlation between \(z_x\) and \(z_y\): \[r = \frac{1}{n-1} \sum_{i=1}^{n} z_{x_i} z_{y_i}\]
- Note: division by \(n-1\) to match with the \(n-1\) in formula for sample standard deviation (used in \(z\)-score calculation)
- Values range between \(-1 \leq r \leq 1\): strength and direction of the effect

```
sum( scale(dat$english_score) * scale(dat$english_grade) ) / ( nrow(dat) - 1 ) # manual
```

```
# [1] 0.74346
```

```
cor(dat$english_score, dat$english_grade) # using the cor function (easier)
```

```
# [1] 0.74346
```

- With NA values, parameter
`use`

needs to be set to`"pairwise.complete.obs"`

```
dat[1, ]$english_score = NA # set 1 value to NA (only for illustration)
cor(dat$english_score, dat$english_grade) # will result in NA
```

```
# [1] NA
```

```
cor(dat$english_score, dat$english_grade, use = "pairwise.complete.obs")
```

```
# [1] 0.74177
```

- Shoe size and reading ability are correlated
- Is there a dependency between the two?
- No: the
**hidden variable***age*affects both shoe size and reading ability! - Experiments are necessary to control for potential causes (and confounding factors)

- In this course, we are not assessing whether a correlation is statistically significant
- But it is possible: function
`cor.test()`

conducts this test- However, there are several required assumptions (see Levshina 6.2.3)

- Or you can use
**regression**: covered in*Statistiek II* - Non-parametric alternatives to the correlation are also available
- For example: Spearman's \(\rho\) (correlation of ranks, see Levshina 6.3)

- But it is possible: function

- Questionnaires are an easy way to obtain much data
- But:
**how to ask the right question?** - For example: How do students feel about statistics?
- What
*type*of statistics? - What
*kind*of feelings?

- What
- For this reason, researchers ask several questions all aimed at acquiring similar information (i.e. questions are combined in a so-called
*scale*)

**Validity**: are you measuring what you intend to measure?- This can be checked by consulting experts, comparing to similar measures, etc.

**Reliability**: is your measure consistent?- I.e. are results repeatable when obtained given the same conditions?
- This can be statistically assessed using
**Cronbach's \(\alpha\)**

- Reliability \(\neq\) validity!

- Underlying idea:
- If we split the questions, how well do two halves of the questionnaire agree?
- And if there are lots of questions, how well would they agree if we looked at all the ways of splitting?

- More information: https://www.ijme.net/archive/2/cronbachs-alpha.pdf

- Cronbach's \(\alpha\) depends on:
- The average correlation of all variables (i.e. questions) involved
- The number of questions

- Removing a problematic question may increase Cronbach's \(\alpha\)
- Cronbach's \(\alpha\): > 0.7 is acceptable, 0.8 is good, 0.9 is very good
- With Cronbach's \(\alpha\) > 0.7: mean of items can be used as summary of the scale

- Questions (in English): https://goo.gl/efxtre

```
load("sats.rda")
head(sats) # scores 1-7: negative - positive (some inverted)
```

```
# Q3 Q4 Q14 Q15 Q18 Q19 Q28
# 1 7 1 7 1 1 7 1
# 2 4 4 7 5 3 3 5
# 3 7 2 6 2 2 6 1
# 4 4 4 3 4 4 4 4
# 5 3 3 6 6 5 2 6
# 6 5 4 6 4 3 6 2
```

- Note that I've included one question which is part of a different scale
- We'll try to identify this question later

```
library(psych)
result <- alpha(sats)
```

```
# Some items ( Q3 Q19 ) were negatively correlated with the total scale and
# probably should be reversed.
# To do this, run the function again with the 'check.keys=TRUE' option
```

```
summary(result)
```

```
#
# Reliability analysis
# raw_alpha std.alpha G6(smc) average_r S/N ase mean sd
# 0.32 0.3 0.61 0.057 0.43 0.057 4.1 0.66
```

- Reliability is very low, but we ignored some questions having an inverted scale

- The instructions of the questionnaire show that questions 4, 15, 18 and 28 should be inverted (1 \(\rightarrow\) 7, 2 \(\rightarrow\) 6, 3 \(\rightarrow\) 5, 4 \(\rightarrow\) 4, etc.)

```
result <- alpha(sats, keys = c("Q4", "Q15", "Q18", "Q28")) # keys: scales inverted
summary(result) # much better reliability
```

```
#
# Reliability analysis
# raw_alpha std.alpha G6(smc) average_r S/N ase mean sd
# 0.8 0.76 0.82 0.32 3.2 0.015 4.2 1
```

```
result$alpha.drop
```

```
# raw_alpha std.alpha G6(smc) average_r S/N alpha se
# Q3 0.75 0.70 0.74 0.28 2.28 0.02
# Q4- 0.76 0.71 0.78 0.29 2.50 0.02
# Q14 0.85 0.85 0.86 0.49 5.67 0.01
# Q15- 0.79 0.74 0.81 0.32 2.80 0.02
# Q18- 0.75 0.70 0.76 0.28 2.36 0.02
# Q19 0.76 0.70 0.75 0.28 2.32 0.02
# Q28- 0.74 0.69 0.76 0.27 2.27 0.02
```

- Dropping Q14 would yield a higher Cronbach's \(\alpha\)

```
sats2 <- sats[, -3] # drop third column (Q14)
result2 <- alpha(sats2, keys = c("Q4", "Q15", "Q18", "Q28")) # keys: scales inverted
summary(result2)
```

```
#
# Reliability analysis
# raw_alpha std.alpha G6(smc) average_r S/N ase mean sd
# 0.85 0.85 0.86 0.49 5.7 0.013 4 1.2
```

- In this lecture, we've covered:
- \(\chi^2\)-test for assessing relationship between 2 nominal variables
- Pearson correlation for assessing relationship between 2 numerical variables
- Cronbach's \(\alpha\) for assessing reliability of multiple numerical variables
- Hidden variables
- A (cor)relation does not imply causation
- Reliability is no validity

- Experiment with correlation: http://eolomea.let.rug.nl/Statistiek-I/HC6
- Next lecture:
**practice exam**

Thank you for your attention!