# Statistiek I

## Relating same-type variables

Martijn Wieling
University of Groningen

## Last lecture

• Non-parametric tests:
• Mann-Whitney U test (alternative to independent samples $t$-test)
• Wilcoxon signed-rank test (alternative to paired and single sample $t$-test)
• Sign test (alternative to Wilcoxon signed-rank test)
• Reporting statistical analysis

## Important: one-tailed and two-tailed $p$-values

• $p$-value: probability of the observed (or more extreme) data given that $H_0$ true
• Two-tailed test: very low and very high values evidence against $H_0$
• One-tailed test: only low (or high) values evidence against $H_0$
• In R: parameter alternative="two-tailed" yields $p$-value for two-tailed test
• alternative="less" (or "greater"): $p$-value for one-tailed test
• (No need to divide/multiply by 2, in contrast to using pnorm() or pt())
• $p$-value of a one-tailed test (appropriate side) will be half of $p$-value of two-tailed test (due to symmetry)
• $p$-value is compared to the $\alpha$-value (set before analysis)
• One-tailed test is more likely to yield significant result (as $p$-value is lower)

## This lecture

• Assessing the relationship between variables of the same type
• Two categorical variables: $\chi^2$-test
• Two numerical variables: correlation
• Reliability of questionnaires: Cronbach's $\alpha$

## Relationship between categorical variables

• In earlier lectures, we were interested in the relationship between a categorical variable (two groups) and a numerical variable (e.g., English grades)
• But what about the relationship between two categorical variables?
• Examples:
• Social group and /r/ pronunciation
• Authorship and use of personal pronouns
• Gender and having had bilingual education

## Showing a categorical relationship: cross table

(counts <- table(dat$gender, dat$bl_edu))

#
#       Y   N
#   F  20 202
#   M  10  83

prop.table(counts)  # as proportions

#
#            Y        N
#   F 0.063492 0.641270
#   M 0.031746 0.263492


## Proportions per row and per column

prop.table(counts, 1)  # proportions per row

#
#           Y       N
#   F 0.09009 0.90991
#   M 0.10753 0.89247

prop.table(counts, 2)  # proportions per column

#
#           Y       N
#   F 0.66667 0.70877
#   M 0.33333 0.29123


## Visualizing a categorical relationship

barplot(counts, legend = c("F", "M"), beside = T, ylim = c(0, 380), main = "Beside")
barplot(counts, legend = c("F", "M"), ylim = c(0, 380), main = "Segmented")


## Assessing statistical dependency (1)

• $\chi^2$-test is used for assessing the relationship between two categorical variables
• Is the distribution of one variable affected by the distribution of another?
• $P(A|B) \neq P(A)$?
• Example: $P(\textrm{bilingual education}|\textrm{gender = M}) \neq P(\textrm{bilingual education})$?
• We can test this by counting

## Assessing statistical dependency (2)

• The $\chi^2$-test compares the actual frequencies to the expected frequencies (assuming that the two variables are unrelated, i.e. independent)
• Large differences: likely related
• Small differences: most likely not related
• Note: not for paired data (i.e. repeated measures for one individual)!
• Then the McNemar test is appropriate (not covered in this course)

## Assumptions of $\chi^2$-test

• Data randomly selected from population
• Independent observations
• Each observation can be classified into exactly one category per variable
• Expected frequency in each cell $\geq$ 5
• Otherwise we use Fisher's exact test (covered later this lecture)

## Expected frequencies?

• Actual frequencies (including column and row sums)
Y N Total
F 20 202 222
M 10 83 93
Total 30 285 315
• Expected frequencies: multiply column total with row proportions
Y N Proportion
F 0.7 * 30 = 21.1 0.7 * 285 = 200.9 0.7
M 0.3 * 30 = 8.9 0.3 * 285 = 84.1 0.3
Total 30 285

## $\chi^2$ calculation

$\chi^2 = \Sigma \frac{(O - E)^2}{E}$

Y N
F $(20-21.1)^2 / 21.1 \approx 0.06$ $(202-200.9)^2 / 200.9 \approx 0.01$
M $(10-8.9)^2 / 8.9 \approx 0.15$ $(83-84.1)^2 / 84.1 \approx 0.02$
• $\chi^2$ = 0.06 + 0.01 + 0.15 + 0.02 $\approx$ 0.23
• Degrees of freedom: (R - 1) $\times$ (C - 1) = (2 - 1) $\times$ (2 - 1) = 1
• With R: number of rows, C: number of columns
• If row and column totals are known together with (R - 1) $\times$ (C -1) values, all other values can be calculated

## $\chi^2$ distribution

• $p$-value associated with $\chi^2$ value is dependent on the shape of the distribution
• Shape of the $\chi^2$ distribution is dependent on number of degrees of freedom (dF)

## $\chi^2$ in R

chisq.test(counts, correct = FALSE)  # chi-square test, no continuity correction

#
#   Pearson's Chi-squared test
#
# data:  counts
# X-squared = 0.231, df = 1, p-value = 0.63

• Using an $\alpha$-level of 0.05, this association is not significant
• Note that we always set correct = FALSE, as using Yates' correction is not recommended (lowers dF: too conservative)

## Another example

• Are handedness and having had bilingual education related?
(counts <- table(dat$bl_edu, dat$hand))  # actual frequencies

#
#       R   L
#   Y  25   5
#   N 252  33

chisq.test(counts)$expected # expected frequencies  # # R L # Y 26.381 3.619 # N 250.619 34.381  ## $\chi^2$: problematic for low frequencies • The $\chi^2$-test will give a warning with expected frequencies < 5 chisq.test(counts, correct = F)  # Warning in chisq.test(counts, correct = F): Chi-squared approximation may be # incorrect  # # Pearson's Chi-squared test # # data: counts # X-squared = 0.662, df = 1, p-value = 0.42  ## Alternative: Fisher's exact test • In this course, we use Fisher's exact test when low expected frequencies occur fisher.test(counts)  # # Fisher's Exact Test for Count Data # # data: counts # p-value = 0.38 # alternative hypothesis: true odds ratio is not equal to 1 # 95 percent confidence interval: # 0.22484 2.34374 # sample estimates: # odds ratio # 0.65576  ## Effect size • Flexible measure for effect size of the relation between two variables: Cramer's $V$ • Cramer's $V$: scaled $\chi^2$ value (between 0 and 1) • Small effect: $V = 0.1$, medium effect: $V = 0.3$, large effect: $V = 0.5$ (for tables with 2 rows or 2 columns) • Another measure of effect size is the odds ratio: see Levshina, p. 208 ## Cramer's $V$ in R library(vcd) # install with: install.packages('vcd') assocstats(counts)  # X^2 df P(> X^2) # Likelihood Ratio 0.60777 1 0.43563 # Pearson 0.66231 1 0.41575 # # Phi-Coefficient : 0.046 # Contingency Coeff.: 0.046 # Cramer's V : 0.046  • The effect size for the relation between handedness and education is negligible ## Question 3 ## Worked out example: gender vs. education level • We are interested in assessing if gender and education level are significantly (with $\alpha$ = 0.05) related (i.e. dependent) • Hypotheses: • $H_0$: gender and education level are independent • $H_a$: gender and education level are dependent (counts <- table(dat$gender, dat$edu_level))  # # HAVO VWO HBO # F 30 137 48 # M 13 69 8  ## Step 1: $\chi^2$ assumptions met? • Data randomly selected from population ? • Independent observations ✓ • Each observation can be classified into exactly one category per variable ✓ • Expected frequency in each cell $\geq$ 5 ? ## Expected frequencies $\geq$ 5? round(chisq.test(counts)$expected)

#
#     HAVO VWO HBO
#   F   30 145  39
#   M   13  61  17

• All expected frequencies $\geq$ 5: ✓

## Step 2: visualization

barplot(counts, legend = c("F", "M"), col = c("pink", "lightblue"))


## Step 3: calculation of $\chi^2$-value and $p$-value

chisq.test(counts, correct = F)

#
#   Pearson's Chi-squared test
#
# data:  counts
# X-squared = 7.82, df = 2, p-value = 0.02

• $p$-value < $\alpha$-value: reject $H_0$ and accept $H_a$

## Step 4: calculation of effect size

assocstats(counts)

#                     X^2 df P(> X^2)
# Likelihood Ratio 8.7109  2 0.012837
# Pearson          7.8235  2 0.020005
#
# Phi-Coefficient   : NA
# Contingency Coeff.: 0.158
# Cramer's V        : 0.16

• The effect size is small to medium

## Step 5: where are the largest differences? (1)

chisq.test(counts)$expected # expected frequencies  # # HAVO VWO HBO # F 30.311 145.213 39.475 # M 12.689 60.787 16.525  chisq.test(counts)$observed  # observed frequencies

#
#     HAVO VWO HBO
#   F   30 137  48
#   M   13  69   8


## Proportions are insightful

prop.table(counts, 2)  # column proportions (2nd index)

#
#     HAVO  VWO  HBO
#   F 0.70 0.67 0.86
#   M 0.30 0.33 0.14

prop.table(counts, 1)  # row proportions (1st index)

#
#     HAVO  VWO  HBO
#   F 0.14 0.64 0.22
#   M 0.14 0.77 0.09


## Step 5: where are the largest differences? (2)

• Standardized residuals: $z$-scores indicating where the largest differences are
• (Standardization against expected frequency and row and column proportion)
• Only useful when one of the variables has > 2 levels
• Otherwise all absolute values equal
chisq.test(counts)$stdres # z-scored differences  # # HAVO VWO HBO # F -0.11237 -2.20225 2.76432 # M 0.11237 2.20225 -2.76432  • The HBO education shows the largest deviation from the expected gender distribution ## Reporting results: example We tested whether gender (two levels: male and female) and education level (three levels: HAVO, VWO and HBO) were related (with $\alpha$ = 0.05). Our hypotheses were $H_0$: gender and education level are independent, and $H_a$: gender and education level are dependent. In our sample of 315 students of the Statistiek I course, we obtained the students' gender and education level. Table 1 shows their relation. As both variables are nominal, and the expected frequency of each combination was greater than 5, we assessed the dependence of the two variables using the $\chi^2$ test: $\chi^2(2) = 7.82$, $p = 0.02$. As the $p$-value was lower than the $\alpha$-level, the result was significant. We reject the null hypothesis and accept the alternative hypothesis that gender and education level are dependent. The difference from the expected distribution was greatest for the HBO group: there were many more women than expected. The effect size was small to medium (Cramer's $V$ = 0.16). ## $\chi^2$-test: some caveats • Avoid overuse of $\chi^2$: • When tables have very large frequencies (e.g., corpus linguistics) results are almost always significant • $\chi^2$ is not very sensitive, so use numerical variables where appropriate • I.e. don't use: This course was: useless / somewhat useful / useful • But use Likert scales: • On a scale from 1-5, this course was: ... (1: useless; 5: useful) ## Puzzle about relations • Patient survival after surgery: Hospital A Hospital B Deceased 63 (3%) 16 (2%) Survived 2037 784 Total 2100 800 ## Question 5 ## Hidden variables • With the original information, Hospital B appears better • But consider: Hosp. A (ill) Hosp. B (ill) Hosp. A (very ill) Hosp. B (very ill) Deceased 6 (1%) 8 (1.3%) 57 (3.8%) 8 (4%) Survived 594 592 1433 192 Total 600 600 1500 200 • Hospital A is better in both categories! • But it attracts more difficult patients (perhaps because it's better), so its overall survival rate is lower ## Simpson's Paradox • Simpson's Paradox: the distribution of a nominal variable can be reversed under aggregation • This is caused by hidden variables, variables which are important, but neglected in the analysis ## Relation between two numerical variables • In contrast to two nominal variables, it is much easier to assess the relationship between two numerical variables • We can visualize such a relationship in a scatter plot plot(dat$english_grade, dat$english_score)  ## Quantifying the relationship • We can quantify the relationship between two variables $x$ and $y$ by calculating the Pearson correlation coefficient $r$ • After standardizing (i.e. converting the values of each variable to $z$-scores), we can calculate the correlation between $z_x$ and $z_y$: $r = \frac{1}{n-1} \sum_{i=1}^{n} z_{x_i} z_{y_i}$ • Note: division by $n-1$ to match with the $n-1$ in formula for sample standard deviation (used in $z$-score calculation) • Values range between $-1 \leq r \leq 1$: strength and direction of the effect ## Correlation in R (1) sum( scale(dat$english_score) * scale(dat$english_grade) ) / ( nrow(dat) - 1 ) # manual  # [1] 0.74346  cor(dat$english_score, dat$english_grade) # using the cor function (easier)  # [1] 0.74346  ## Correlation in R (2) • With NA values, parameter use needs to be set to "pairwise.complete.obs" dat[1, ]$english_score = NA  # set 1 value to NA (only for illustration)
cor(dat$english_score, dat$english_grade)  # will result in NA

# [1] NA

cor(dat$english_score, dat$english_grade, use = "pairwise.complete.obs")

# [1] 0.74177


## (Cor)relation is no causation (2)

• Shoe size and reading ability are correlated
• Is there a dependency between the two?
• No: the hidden variable age affects both shoe size and reading ability!
• Experiments are necessary to control for potential causes (and confounding factors)

• In this course, we are not assessing whether a correlation is statistically significant
• But it is possible: function cor.test() conducts this test
• However, there are several required assumptions (see Levshina 6.2.3)
• Or you can use regression: covered in Statistiek II
• Non-parametric alternatives to the correlation are also available
• For example: Spearman's $\rho$ (correlation of ranks, see Levshina 6.3)

## Questionnaires

• Questionnaires are an easy way to obtain much data
• But: how to ask the right question?
• For example: How do students feel about statistics?
• What type of statistics?
• What kind of feelings?
• For this reason, researchers ask several questions all aimed at acquiring similar information (i.e. questions are combined in a so-called scale)

## Reliability and validity

• Validity: are you measuring what you intend to measure?
• This can be checked by consulting experts, comparing to similar measures, etc.
• Reliability: is your measure consistent?
• I.e. are results repeatable when obtained given the same conditions?
• This can be statistically assessed using Cronbach's $\alpha$
• Reliability $\neq$ validity!

## Reliability: Cronbach's $\alpha$ (1)

• Underlying idea:
• If we split the questions, how well do two halves of the questionnaire agree?
• And if there are lots of questions, how well would they agree if we looked at all the ways of splitting?

## Reliability: Cronbach's $\alpha$ (2)

• Cronbach's $\alpha$ depends on:
• The average correlation of all variables (i.e. questions) involved
• The number of questions
• Removing a problematic question may increase Cronbach's $\alpha$
• Cronbach's $\alpha$: > 0.7 is acceptable, 0.8 is good, 0.9 is very good
• With Cronbach's $\alpha$ > 0.7: mean of items can be used as summary of the scale

## Questionnaire: your feelings toward statistics

load("sats.rda")
head(sats)  # scores 1-7: negative - positive (some inverted)

#   Q3 Q4 Q14 Q15 Q18 Q19 Q28
# 1  7  1   7   1   1   7   1
# 2  4  4   7   5   3   3   5
# 3  7  2   6   2   2   6   1
# 4  4  4   3   4   4   4   4
# 5  3  3   6   6   5   2   6
# 6  5  4   6   4   3   6   2

• Note that I've included one question which is part of a different scale
• We'll try to identify this question later

## Cronbach's $\alpha$ in R (1)

library(psych)
result <- alpha(sats)

# Some items ( Q3 Q19 ) were negatively correlated with the total scale and
# probably should be reversed.
# To do this, run the function again with the 'check.keys=TRUE' option

summary(result)

#
# Reliability analysis
#  raw_alpha std.alpha G6(smc) average_r  S/N   ase mean   sd
#       0.32       0.3    0.61     0.057 0.43 0.057  4.1 0.66

• Reliability is very low, but we ignored some questions having an inverted scale

## Taking into account inverted questions

• The instructions of the questionnaire show that questions 4, 15, 18 and 28 should be inverted (1 $\rightarrow$ 7, 2 $\rightarrow$ 6, 3 $\rightarrow$ 5, 4 $\rightarrow$ 4, etc.)
result <- alpha(sats, keys = c("Q4", "Q15", "Q18", "Q28"))  # keys: scales inverted
summary(result)  # much better reliability

#
# Reliability analysis
#  raw_alpha std.alpha G6(smc) average_r S/N   ase mean sd
#        0.8      0.76    0.82      0.32 3.2 0.015  4.2  1


## Can we improve reliability by dropping an item?

result\$alpha.drop

#      raw_alpha std.alpha G6(smc) average_r  S/N alpha se
# Q3        0.75      0.70    0.74      0.28 2.28     0.02
# Q4-       0.76      0.71    0.78      0.29 2.50     0.02
# Q14       0.85      0.85    0.86      0.49 5.67     0.01
# Q15-      0.79      0.74    0.81      0.32 2.80     0.02
# Q18-      0.75      0.70    0.76      0.28 2.36     0.02
# Q19       0.76      0.70    0.75      0.28 2.32     0.02
# Q28-      0.74      0.69    0.76      0.27 2.27     0.02

• Dropping Q14 would yield a higher Cronbach's $\alpha$

## Reliability without Q14

sats2 <- sats[, -3]  # drop third column (Q14)
result2 <- alpha(sats2, keys = c("Q4", "Q15", "Q18", "Q28"))  # keys: scales inverted
summary(result2)

#
# Reliability analysis
#  raw_alpha std.alpha G6(smc) average_r S/N   ase mean  sd
#       0.85      0.85    0.86      0.49 5.7 0.013    4 1.2


## Recap of this lecture

• In this lecture, we've covered:
• $\chi^2$-test for assessing relationship between 2 nominal variables
• Pearson correlation for assessing relationship between 2 numerical variables
• Cronbach's $\alpha$ for assessing reliability of multiple numerical variables
• Hidden variables
• A (cor)relation does not imply causation
• Reliability is no validity
• Experiment with correlation: http://eolomea.let.rug.nl/Statistiek-I/HC6
• Next lecture: practice exam