 # Statistiek I

## Relating same-type variables

Martijn Wieling
University of Groningen

## Last lecture

• Non-parametric tests:
• Mann-Whitney U test (alternative to independent samples $$t$$-test)
• Wilcoxon signed-rank test (alternative to paired and single sample $$t$$-test)
• Sign test (alternative to Wilcoxon signed-rank test)
• Reporting statistical analysis

## Important: one-tailed and two-tailed $$p$$-values

• $$p$$-value: probability of the observed (or more extreme) data given that $$H_0$$ true
• Two-tailed test: very low and very high values evidence against $$H_0$$
• One-tailed test: only low (or high) values evidence against $$H_0$$
• In R: parameter alternative="two-tailed" yields $$p$$-value for two-tailed test
• alternative="less" (or "greater"): $$p$$-value for one-tailed test
• (No need to divide/multiply by 2, in contrast to using pnorm() or pt())
• $$p$$-value of a one-tailed test (appropriate side) will be half of $$p$$-value of two-tailed test (due to symmetry)
• $$p$$-value is compared to the $$\alpha$$-value (set before analysis)
• One-tailed test (on appropriate side) is more likely to yield significant result (as $$p$$-value is lower)

## This lecture

• Assessing the relationship between variables of the same type
• Two categorical variables: $$\chi^2$$-test and Fisher's exact test
• Two numerical variables: Pearson correlation
• Reliability of questionnaires: Cronbach's $$\alpha$$

## Lowlands Science 2018 data

• RQ: Is there a different influence of BAC on L1 vs. L2 language proficiency?
• Note: not conclusively answered in this course (regression necessary: Statistiek II)
• BAC and L1 and L2 speech samples collected for about 90 native Dutch speakers
• Dutch speech samples were rated by native Dutch speakers (at Lowlands)
• English speech samples were rated by native American English speakers (online)
• For each speaker, the ratings were averaged per language
• Other variables we collected included gender, education level, drug use, etc.
• Relationships between these variables will be looked at in this lecture
• First: gender and alcohol use (BAC > 0 vs. BAC = 0)

## Relationship between categorical variables

• In earlier lectures, we were interested in the relationship between a categorical variable (two groups) and a numerical variable (e.g., English grades)
• But what about the relationship between two categorical variables?
• Examples:
• Social group and /r/ pronunciation
• Authorship and use of personal pronouns
• Gender and drinking alcohol or not

## Showing a categorical relationship: cross table

(counts <- table(llsNL$gender, llsNL$alcohol))

#
#          Alcohol No-alcohol
#   Female      21         31
#   Male        26         13

prop.table(counts)  # as proportions

#
#          Alcohol No-alcohol
#   Female 0.23077    0.34066
#   Male   0.28571    0.14286


## Proportions per row and per column

prop.table(counts, 1)  # proportions per row

#
#          Alcohol No-alcohol
#   Female 0.40385    0.59615
#   Male   0.66667    0.33333

prop.table(counts, 2)  # proportions per column

#
#          Alcohol No-alcohol
#   Female 0.44681    0.70455
#   Male   0.55319    0.29545


## Visualizing a categorical relationship

barplot(counts, legend = c("F", "M"), beside = T, ylim = c(0, 60), main = "Beside")
barplot(counts, legend = c("F", "M"), ylim = c(0, 60), main = "Segmented") ## Assessing statistical dependency (1)

• $$\chi^2$$-test is used for assessing the relationship between two categorical variables
• Is the distribution of one variable affected by the distribution of another?
• $$P(A|B) \neq P(A)$$?
• Example: $$P(\textrm{having had alcohol}|\textrm{gender = M}) \neq P(\textrm{having had alcohol})$$?
• We can test this by counting

## Assessing statistical dependency (2)

• The $$\chi^2$$-test compares the actual frequencies to the expected frequencies (assuming that the two variables are unrelated, i.e. independent)
• Large differences: likely related
• Small differences: most likely not related
• Note: not for paired data (i.e. repeated measures for one individual)!
• Then the McNemar test is appropriate (not covered in this course)

## Assumptions of $$\chi^2$$-test

• Data randomly selected from population
• Independent observations
• Each observation can be classified into exactly one category per variable
• Expected frequency in each cell $$\geq$$ 5
• Otherwise we use Fisher's exact test (covered later this lecture)

## Expected frequencies?

• Actual frequencies (including column and row sums)
Alcohol No-alcohol Total
Female 21 31 52
Male 26 13 39
Total 47 44 91
• Expected frequencies: multiply column total with row proportions
Alcohol No-alcohol Proportion
Female 0.57 * 47 = 26.9 0.57 * 44 = 25.1 0.57
Male 0.43 * 47 = 20.1 0.43 * 44 = 18.9 0.43
Total 47 44

## $$\chi^2$$ calculation

$\chi^2 = \Sigma \frac{(O - E)^2}{E}$

Alcohol No-alcohol
Female $(21-26.9)^2 / 26.9 \approx 1.28$ $(31-25.1)^2 / 25.1 \approx 1.36$
Male $(26-20.1)^2 / 20.1 \approx 1.7$ $(13-18.9)^2 / 18.9 \approx 1.82$
• $$\chi^2$$ = 1.28 + 1.36 + 1.7 + 1.82 $$\approx$$ 6.16
• Degrees of freedom: (R - 1) $$\times$$ (C - 1) = (2 - 1) $$\times$$ (2 - 1) = 1
• With R: number of rows, C: number of columns
• If row and column totals are known together with (R - 1) $$\times$$ (C -1) values, all other values can be calculated

## $$\chi^2$$ distribution

• $$p$$-value associated with $$\chi^2$$ value is dependent on the shape of the distribution
• Shape of the $$\chi^2$$ distribution is dependent on number of degrees of freedom (dF) ## $$\chi^2$$ in R

chisq.test(counts, correct = FALSE)  # chi-square test, no continuity correction

#
#   Pearson's Chi-squared test
#
# data:  counts
# X-squared = 6.16, df = 1, p-value = 0.013

• Using an $$\alpha$$-level of 0.05, this association (gender and alcohol use) is significant
• Note that we always set correct = FALSE, as using Yates' correction is not recommended (lowers dF: too conservative)

## Effect size

• Flexible measure for effect size of the relation between two variables: Cramer's $$V$$
• Cramer's $$V$$: scaled $$\chi^2$$ value (between 0 and 1)
• Negligible: $$V < 0.1$$, small: $$V < 0.3$$, medium: $$V < 0.5$$, large: $$V \geq 0.5$$
(for tables with 2 rows or 2 columns)
• Another measure of effect size is the odds ratio: see Levshina, p. 208

## Cramer's $$V$$ in R

library(vcd)  # install with: install.packages('vcd')
assocstats(counts)

#                     X^2 df P(> X^2)
# Likelihood Ratio 6.2536  1 0.012394
# Pearson          6.1642  1 0.013036
#
# Phi-Coefficient   : 0.26
# Contingency Coeff.: 0.252
# Cramer's V        : 0.26

• The effect size for the relation between alcohol use and gender is small

## Another example

• Are gender and having had (soft or hard) drugs related?
(counts <- table(llsNL$gender, llsNL$drugs))  # actual frequencies

#
#          No Soft Hard
#   Female 49    1    2
#   Male   37    1    1

chisq.test(counts)$expected # expected frequencies  # # No Soft Hard # Female 49.143 1.14286 1.7143 # Male 36.857 0.85714 1.2857  ## $$\chi^2$$: problematic for low frequencies • The $$\chi^2$$-test will give a warning with expected frequencies < 5 chisq.test(counts, correct = F)  # Warning in chisq.test(counts, correct = F): Chi-squared approximation may be # incorrect  # # Pearson's Chi-squared test # # data: counts # X-squared = 0.154, df = 2, p-value = 0.93  ## Alternative: Fisher's exact test • In this course, we use Fisher's exact test when low expected frequencies occur fisher.test(counts)  # # Fisher's Exact Test for Count Data # # data: counts # p-value = 1 # alternative hypothesis: two.sided  ## Effect size assocstats(counts)  # X^2 df P(> X^2) # Likelihood Ratio 0.15618 2 0.92488 # Pearson 0.15375 2 0.92601 # # Phi-Coefficient : NA # Contingency Coeff.: 0.041 # Cramer's V : 0.041  • The effect size for the relation between having had drugs and gender is negligible ## Question 3 ## Worked out example: gender vs. education level • We are interested in assessing whether gender and education level are significantly (with $$\alpha$$ = 0.05) related (i.e. dependent) • Hypotheses: • $$H_0$$: gender and education level are independent • $$H_a$$: gender and education level are dependent (counts <- table(llsNL$gender, llsNL$edu))  # # PVMBO HBO WO # Female 10 9 33 # Male 10 14 15  ## Step 1: $$\chi^2$$ assumptions met? • Data randomly selected from population ? • Independent observations ✓ • Each observation can be classified into exactly one category per variable ✓ • Expected frequency in each cell $$\geq$$ 5 ? ## Expected frequencies $$\geq$$ 5? round(chisq.test(counts)$expected)

#
#          PVMBO HBO WO
#   Female    11  13 27
#   Male       9  10 21

• All expected frequencies $$\geq$$ 5: ✓

## Step 2: visualization

barplot(counts, legend = c("F", "M"), col = c("pink", "lightblue"), ylim = c(0, 70)) ## Step 3: calculation of $$\chi^2$$-value and $$p$$-value

chisq.test(counts, correct = F)

#
#   Pearson's Chi-squared test
#
# data:  counts
# X-squared = 6.1, df = 2, p-value = 0.047

• $$p$$-value < $$\alpha$$-value: reject $$H_0$$ and accept $$H_a$$

## Step 4: calculation of effect size

assocstats(counts)

#                     X^2 df P(> X^2)
# Likelihood Ratio 6.1500  2 0.046189
# Pearson          6.1044  2 0.047255
#
# Phi-Coefficient   : NA
# Contingency Coeff.: 0.251
# Cramer's V        : 0.259

• The effect size is small

## Step 5: where are the largest differences? (1)

chisq.test(counts)$expected # expected frequencies  # # PVMBO HBO WO # Female 11.4286 13.1429 27.429 # Male 8.5714 9.8571 20.571  chisq.test(counts)$observed  # observed frequencies

#
#          PVMBO HBO WO
#   Female    10   9 33
#   Male      10  14 15


## Proportions are insightful

prop.table(counts, 2)  # column proportions (2nd index)

#
#          PVMBO  HBO   WO
#   Female  0.50 0.39 0.69
#   Male    0.50 0.61 0.31

prop.table(counts, 1)  # row proportions (1st index)

#
#          PVMBO  HBO   WO
#   Female  0.19 0.17 0.63
#   Male    0.26 0.36 0.38


## Step 5: where are the largest differences? (2)

• Standardized residuals: $$z$$-scores indicating where the largest differences are
• (Standardization against expected frequency and row and column proportion)
• Only useful when ≥ 1 of the variables has > 2 levels: otherwise |values| equal
chisq.test(counts)$stdres # z-scored differences  # # PVMBO HBO WO # Female -0.73078 -2.01934 2.36396 # Male 0.73078 2.01934 -2.36396  • The WO education level shows the largest deviation from the expected gender distribution ## Reporting results: example We tested whether gender (two levels: male and female) and education level (three levels: PVMBO, HBO and WO) were related (with $$\alpha$$ = 0.05). Our hypotheses were $$H_0$$: gender and education level are independent, and $$H_a$$: gender and education level are dependent. In our sample of 91 participants from the Lowlands Science experiment, we obtained the participants' gender and education level. Table 1 shows their relation and Figure 1 shows a bar plot visualizing the results of the table. As both variables are nominal, and the expected frequency of each combination was greater than 5, we assessed the dependence of the two variables using the $$\chi^2$$ test: $$\chi^2(2) = 6.1$$, $$p = 0.047$$. As the $$p$$-value was lower than the $$\alpha$$-level, the result was significant. We reject the null hypothesis and accept the alternative hypothesis that gender and education level are dependent. The difference from the expected distribution was greatest for the WO group: there were many more women than expected. The effect size was small (Cramer's $$V$$ = 0.26). ## $$\chi^2$$-test: some caveats • Avoid overuse of $$\chi^2$$: • When tables have very large frequencies (e.g., corpus linguistics) results are almost always significant • $$\chi^2$$ is not very sensitive, so use numerical variables where appropriate • I.e. don't use: This course was: useless / somewhat useful / useful • But use Likert scales: • On a scale from 1-5, this course was: ... (1: useless; 5: useful) ## Puzzle about relations • Patient survival after surgery: Hospital A Hospital B Deceased 63 (3%) 16 (2%) Survived 2037 784 Total 2100 800 ## Question 5 ## Hidden variables • With the original information, Hospital B appears better • But consider: Hosp. A (ill) Hosp. B (ill) Hosp. A (very ill) Hosp. B (very ill) Deceased 6 (1%) 8 (1.3%) 57 (3.8%) 8 (4%) Survived 594 592 1433 192 Total 600 600 1500 200 • Hospital A is better in both categories! • But it attracts more difficult patients (perhaps because it's better), so its overall survival rate is lower ## Simpson's Paradox • Simpson's Paradox: the distribution of a nominal variable can be reversed under aggregation • This is caused by hidden variables, variables which are important, but neglected in the analysis ## Relation between two numerical variables • In contrast to two nominal variables, it is much easier to assess the relationship between two numerical variables • We can visualize such a relationship in a scatter plot plot(llsNL$BAC, llsNL$rating) ## Quantifying the relationship • We can quantify the relationship between two variables $$x$$ and $$y$$ by calculating the Pearson correlation coefficient $$r$$ • After standardizing (i.e. converting the values of each variable to $$z$$-scores), we can calculate the correlation between $$z_x$$ and $$z_y$$: $r = \frac{1}{n-1} \sum_{i=1}^{n} z_{x_i} z_{y_i}$ • Note: division by $$n-1$$ to match with the $$n-1$$ in formula for sample standard deviation (used in $$z$$-score calculation) • Values range between $$-1 \leq r \leq 1$$: strength and direction of the effect ## Correlation in R sum( scale(llsNL$BAC) * scale(llsNL$rating) ) / ( nrow(llsNL) - 1 ) # manual  #  -0.34045  cor(llsNL$BAC, llsNL$rating) # using the cor function (easier)  #  -0.34045  ## Lowlands Science 2018 results cor(llsNL$BAC, llsNL$rating) # when speaking Dutch  #  -0.34045  cor(llsEN$BAC, llsEN$rating) # when speaking English  #  -0.11235  ## Correlation in R with missing values • With NA values, parameter use needs to be set to "pairwise.complete.obs" llsNL[1, ]$rating = NA  # set 1 value to NA (only for illustration)
cor(llsNL$BAC, llsNL$rating)  # will result in NA

#  NA

cor(llsNL$BAC, llsNL$rating, use = "pairwise.complete.obs")

#  -0.36322


## (Cor)relation is no causation (1) ## (Cor)relation is no causation (2)

• Shoe size and reading ability are correlated
• Is there a dependency between the two?
• No: the hidden variable age affects both shoe size and reading ability!
• Experiments are necessary to control for potential causes (and confounding factors)

• In this course, we are not assessing whether a correlation is statistically significant
• But it is possible: function cor.test() conducts this test
• However, there are several required assumptions (see Levshina 6.2.3)
• Or you can use regression: covered in Statistiek II
• Non-parametric alternatives to the correlation are also available
• For example: Spearman's $$\rho$$ (correlation of ranks, see Levshina 6.3)

## Questionnaires

• Questionnaires are an easy way to obtain much data
• But: how to ask the right question?
• For example: How do students feel about statistics?
• What type of statistics?
• What kind of feelings?
• For this reason, researchers ask several questions all aimed at acquiring similar information (i.e. questions are combined in a so-called scale)

## Reliability and validity

• Validity: are you measuring what you intend to measure?
• This can be checked by consulting experts, comparing to similar measures, etc.
• Reliability: is your measure consistent?
• I.e. are results repeatable when obtained given the same conditions?
• This can be statistically assessed using Cronbach's $$\alpha$$
• Reliability $$\neq$$ validity!

## Reliability and validity ## Reliability: Cronbach's $$\alpha$$ (1)

• Underlying idea:
• If we split the questions, how well do two halves of the questionnaire agree?
• And if there are many questions, how well would they agree if we looked at all possible ways of splitting?

## Reliability: Cronbach's $$\alpha$$ (2)

• Cronbach's $$\alpha$$ depends on:
• The average correlation of all variables (i.e. questions) involved
• The number of questions
• Removing a problematic question may increase Cronbach's $$\alpha$$
• Cronbach's $$\alpha$$: > 0.7 is acceptable, 0.8 is good, 0.9 is very good
• With Cronbach's $$\alpha$$ > 0.7: mean of items can be used as summary of the scale

## Cronbach's $$\alpha$$ depends on $$n$$ and $$r$$ ## Questionnaire: your feelings toward statistics

load("sats.rda")
head(sats)  # scores 1-7: negative - positive (some inverted)

#   Q3 Q4 Q14 Q15 Q18 Q19 Q28
# 1  7  2   6   2   2   6   1
# 2  4  4   3   4   4   4   4
# 3  3  3   6   6   5   2   6
# 4  5  4   6   4   3   6   2
# 5  4  5   6   4   4   2   5
# 6  4  2   6   2   3   5   2

• Note that I have included one question which is part of a different scale
• We will try to identify this question later

## Cronbach's $$\alpha$$ in R (1)

library(psych)
result <- alpha(sats)

# Some items ( Q3 Q19 ) were negatively correlated with the total scale and
# probably should be reversed.
# To do this, run the function again with the 'check.keys=TRUE' option

summary(result)

#
# Reliability analysis
#  raw_alpha std.alpha G6(smc) average_r  S/N   ase mean   sd median_r
#        0.4      0.37    0.63     0.078 0.59 0.039  4.2 0.69    0.044

• Reliability is very low, but we ignored some questions having an inverted scale

## Taking into account inverted questions

• The instructions of the questionnaire show that questions 4, 15, 18 and 28 should be inverted (1 $$\rightarrow$$ 7, 2 $$\rightarrow$$ 6, 3 $$\rightarrow$$ 5, 4 $$\rightarrow$$ 4, etc.)
result <- alpha(sats, keys = c("Q4", "Q15", "Q18", "Q28"))  # keys: scales inverted
summary(result)  # much better reliability

#
# Reliability analysis
#  raw_alpha std.alpha G6(smc) average_r S/N   ase mean   sd median_r
#       0.79      0.75    0.81       0.3   3 0.013  4.3 0.98     0.36


## Can we improve reliability by dropping an item?

result\$alpha.drop

#      raw_alpha std.alpha G6(smc) average_r  S/N alpha se var.r med.r
# Q3        0.74      0.68    0.73      0.26 2.16     0.02  0.10  0.36
# Q4-       0.74      0.69    0.76      0.27 2.28     0.02  0.09  0.36
# Q14       0.84      0.83    0.85      0.46 5.04     0.01  0.02  0.38
# Q15-      0.77      0.73    0.80      0.31 2.65     0.01  0.11  0.36
# Q18-      0.73      0.68    0.75      0.27 2.17     0.02  0.09  0.34
# Q19       0.74      0.69    0.73      0.27 2.19     0.02  0.10  0.35
# Q28-      0.71      0.67    0.73      0.25 2.05     0.02  0.07  0.34

• Dropping Q14 would yield a higher Cronbach's $$\alpha$$

## Reliability without Q14

sats2 <- sats[, -3]  # drop third column (Q14)
result2 <- alpha(sats2, keys = c("Q4", "Q15", "Q18", "Q28"))  # keys: scales inverted
summary(result2)

#
# Reliability analysis
#  raw_alpha std.alpha G6(smc) average_r S/N   ase mean  sd median_r
#       0.84      0.83    0.85      0.46   5 0.011    4 1.2     0.38


## Wrap-up: overview of statistical tests ## Recap of this lecture

• In this lecture, we've covered:
• $$\chi^2$$-test for assessing relationship between 2 nominal variables
• Pearson correlation for assessing relationship between 2 numerical variables
• Cronbach's $$\alpha$$ for assessing reliability of multiple numerical variables
• Hidden variables
• A (cor)relation does not imply causation
• Reliability is no validity
• Experiment with correlation: http://eolomea.let.rug.nl/Statistiek-I/HC6
• Next lecture: practice exam
• (We run many experiments and outreach events in my lab, for example using our SPRAAKLAB mobile laboratory, contact me if you would like to get involved!)