Seminar in Methodology and Statistics

For students in the Linguistics Research Master's Program and Linguistics PhD students (& others by agreement)

Course under development--more or less permanently!

Spring, 2016 (under construction)

Lecture/Seminar: Tues. 16:00-17:45, Room A902K (little bldg at corner of Broerstraat & Oude Kijk in Jatstraat, just west of the Academiegebouw.)
R Lab w. Annelot de Rechteren van Hemert: Fri. 9:00-10:45, Let 1313 Multimediazaal 1, beginning Feb.12!anne.recht@gmail.com

Instructor: John Nerbonne (see site for email, phone, geographic coordinates and for office hours)

Announcements

Sheets for the first half of the course (multivariate techniques) available here
Lab Sessions begin Fri. Feb.12. See the labs' web site for an idea of what will the labs will involve.
No lecture 2 Feb. Due to Dr. Marjolijn Verspoor's inaugural lecture, which JN will introduce.

Description

The first half of the course will be an introduction to multivariate statistical techniques commonly used in linguistics and communications. It will be accompanied by lab sessions using the statistical package R. The rest of the course revolves around seminar presentations by participants. Presentations primarily concern statistical or methodological issues in the research of the participants. See below for some of the topics presented and discussed from 2004 through 2014.

Topics for the second half include a selection from permutation tests; bootstrapping; analysis of variance and analysis of covariance; regression including multiple regression and hierarchical (multi-level) regression; dimension reduction techniques including factor analysis, principal component analysis, multidimensional scaling and/or latent semantic analysis; analysis of nominal data including association strength, Cohen's kappa, binomial or multinomial models, Fisher's exact test, odds ratios, information-theoretic inspired measures such as pointwise mutual information, or logistic regression. Other topics have regularly been presented, mostly at the request of the participants.

Prerequisites

Participants in the course should have completed a basic course in statistics covering topics such as descriptive statistics but also basis hypothesis testing using z-tests, t-tests, and χ² Participation in the first semester Research Master course on statistics and corpus linguistics (Wander Lowie and Gertjan van Noord) is strongly recommended for anyone who's never taken statistics. It is required that you know the statistics from that course, so if you've never taken such a course, that's a good basis. The statistics course given in the European Master's in Clinical Linguistics is also good.

Requirements

All participants, including auditors not taking the course for credit, must present at least one hour-long session (30 min. presentation plus 15 min. discussion and question session) on a statistical analysis technique. In addition students taking the course for credit in the research master must turn in a 8-10 pp. (2,000-2,500 wd.) paper, which may be on the same statistical analysis technique, but which may also be on another. In the paper and presentation it is important to embed the discussion in the analysis of concrete data, to explain how the analysis works, under what conditions it may be applied, and what its shortcomings are. The emphasis is on the statistical technique, but the research question should be explained along with the background theory. Graphical presentations of the data attempting to illustrate the tendency that is to be proven or disproven are definitely valuable.

It is fine with me if you turn in a paper and/or presentation reporting on work for another course as long as the paper turned in receive focuses on the statistical analysis. If you want turn in the same paper for two courses, make sure that both instructors know this and agree to it.

Ph.D. candidates from BCN or the Graduate School in the Humanities have received credit for this course in the past if they presented one topic (45 min. -- (30 + 15)) and participated regularly. I assume that that will continue to be the case.

Books

In general, we will try to use the following:

Natalia Levshina (2015) How to do linguistics with R. Data exploration and statistical analysis John Banjamins: Amsterdam.
Good for R tips, R Studio, nice focus on linguistic problems.
Andy Field, Jeremy Miles & Zoë Fielde (2012) Discovering Statistics using R. Sage: London.
Good for R tips, covers basics OK, informal (wordy) style.
David S. Moore and George McCabe (1993) Introduction to the Practice of Statistics 5th edition. Freeman: New York.
We assume the materials in chapters 1-9, 12, and 14 (subject of Introductory Statistics). More advanced chapters such as those on permutation tests, bootstrapping, or regression models might be subjects of presentations and discussion. Excellent introduction!
The Moore & McCabe book is in the library of the faculty of Behavioral and Social Sciences (Grote Kruisstr. 2/1). At least one copy is kept there and is not lent but must be used there. Filed under usoc 014D 073 ex.5
A nice alternative seems to be Alan Agesti & Barbara Finlay's Statistical Methods for the Social Sciences 4th ed. Pearson: Upper Saddle River, NJ, 2009. I haven't used it yet, but it has a good selection of material.
Toni Rietveld and Roeland van Hout (1993) Statistical Techniques for the Study of Language and Language Behavior. Mouton De Gruyter: Berlin.
For many years, the text for statistics in linguistics, and still excellent. But see below.
Available electronically from the University Library in Groningen!

Other references that have also been found useful are the following:

Alan Agresti (1996) An Introduction to Categorical Data Analysis. Wiley: New York.
Barbara Tabachnik and Linda Fidell (2001) Using Multivariate Statistics, Pearson: Needham Heights, MA.
Comprehensive, and aimed at analysis, as opposed to those interested in mathematical underpinnings or those interested in developing statistical theory further.
Chris Manning and Hinrich Schütze (1999) Foundations of Statistical Natural Language Processing, MIT Press: Cambridge, MA.
Focus on computational lingusitics, naturally, but lots on appropriate stats, including statistical modeling, information theory.
Chris Manning, Prabhakar Raghavan and Hinrich Schütze (1999) Introduction to Information Retrieval, Cambridge University Press: Cambridge, UK
Focus on IR, CL, naturally, but lots on stats, including singular value decomposition, latent semantic indexing.

Many also find the more "how-to" books useful. The following books have been found especially valuable as they provide very practical instructions for doing analysis in SPSS or R.

Dennis Howitt and Duncan Cramer (2008) Introduction to SPSS in Psychology For Version 16 and earlier. 4th ed. Pearson: Essex.
Excellent continuation for topics too difficult for the Field book.
Harald Baayen (2008) Analyzing Linguistic Data. A Practical Introduction to Linguistics using R Cambridge University Press: Cambridge. Thereis also an online version available.
The book by Baayen may be the best book ever written on linguistic statistics. Especially if you are using large data sets (corpus frequencies), R is the way to go.
Keith Johnson (2008) Quantitative Methods in Linguistics There is an online version as well.
I confess that I still haven't read this (1/2015) through, but I've read sections, and based on those and on Johnson's work in general, I expect it to be good. Like Baayen's, this book is R-based.

The books are on reserve, most at the Letteren library, reserve shelf. Please note that books on reserve at the Letteren library that normally belong there are not moved to the reserve shelf. Instead, they're kept at their normal places (use the catalogue), but are on reserve and cannot be loaned out.

Schedule for Seminars/Lectures 2016 (Tentative!)

The schedule for Spring 2016. Meetings are Tues. 4-6 pm in A902

Week Date Theme Readings Leader

1 9 Feb. Organizational John Nerbonne

2 16 Feb. ANOVA, Factorial ANOVA Levshina, Ch.8.2-3 John Nerbonne

3 23 Feb. Repeated Measures Levshina, Ch.8.4; Rietveld/van Hout, Ch.4.6 John Nerbonne

4 1 Mar. Simple Linear Regression Levshina, Ch.6> or Field 6

5 8 Mar. Mult. Regr. Levshina, Ch.7 or Field 7 John Nerbonne

6 15 Mar Logistic Regression Levshina, Ch.12; Field, Ch.8 John Nerbonne

21/3-8/4 Exam period no meetings

7 12 Apr. Mixed effects Models Baayen, Ch.7 John Nerbonne

8 19 Apr. Quantifier Interpretation Mixed Effects Log. Regr. Isolde van Dorst

9 26 Apr. Final Voicing in Whisper Rep. Meas. ANOVA Marita Everhardt

Final Voicing, cont. Mixed Effects Regr. Liqin Zhang

10 3 May Permutation Tests Moore & McCabe, Ch. 14 John Nerbonne

Bootstrap Sampling Moore & McCabe, Ch. 14 John Nerbonne

11 10 May Instructor absent PhD defense, Freiburg

12 17 May Code Switching Logistic Regression Masha Medvereva

13 24 May Chat-like dialogues Logistic Regression Lotte Verheijen

13 24 May Chat-like dialogue Mixed Effects Log. Regr. Guanghao You

Materials

Click on the lecture (etc.) title to see more.

Course Materials 2016

John Nerbonne's lectures on various ANOVA & regression models (weeks 2-7).
John Nerbonne presenting Martijn Wieling's sheets on Mixed Effects Regression.
Isolde van Dorst on analyzing Quantifier Interpretations using Mixed Effect Regression
Marita Everhardt on Analyzing final voicing in whispered speech using Factorial ANOVA
Liqin Zhang on Mixed Effects Modeling of Final Devoicing in Whispered Speech (with same data as in Marita Everhardt's presentation (see above).
John Nerbonne on Permutation Testing.
John Nerbonne presenting Bootstrap clustering and noisy clustering, also using Jelena Prokić's sheets on Clustering and the Bootstrap.
Masha Medvedeva on Predicting Code Switches in Udmurt/Russian (using logistic regression).
Lotte Verheijen on Logistic Regression to analyze chat-like dialogues.
Guanghao You on Mixed Effects Logistic Regression to analyze chat-like dialogues.

Course Materials 2011

John Nerbonne's lectures on various ANOVA & regression models (weeks 2-6).
Martijn Wieling on Mixed Models
J. Nerbonne on Entropy and Information Theory and the Conditional Entropy of the phoneme mapping (in Scandivanvian) "semi-communication".
Stefan Evert's page on statistics and software for measuring collocation strength.
Assocation Strength talks
1. Simon Šuster on Mutual Information and Collocations
2. Laura Handojo Odds Ratios and Collocations
3. Jelke Bloem on Fisher's Exact Test to Detect Animacy
4. Igor Tytyk on Minimum Sensitivity and Collostructions
Stefan Evert on statistical association strength and multi-word expressions.
Repeated Measures vs. Mixed Models
1. Connie Lahmann on 'Higher Language Cognition' and Grammaticality Verification
2. Laura Bos Repeated Measures ANOVA \& Permutation Statistics
3. Ruggero Montalto on Repeated Measures vs. Mixed Models
4. Oscar Strik on the Aikake Information Criterion
Dimension Reduction
1. Martin Boros Cluster Analysis and Silhouette Width
2. Ke Tran on Principal Component Analysis (and Face Recognition!)
3. Lubomir Zlatkov on Multi-Dimensional Scaling
4. Kristel Uiboaed on Correspondence Analysis
Mona Timmermeister & Caitlin Mignella on Validating a Pronunciation Difference Measure
Jurriën Schuurman on Min F in Psycholinguistics
Jet Vonk on Cochran's Q

Course Materials 2010

Eliza Magaretha on regression used to evaluation the quality of inducing pronunciation distances from empirical data.
Nynke van der Vliet on Cohen's κ used to measure the agreement between annotators of hierarchically structured material.
Edgar Weiffenbach on Log Linear Models of Contingency used to analyse corpus frequencies.
Nick Ruiz on Logistic Regression used to analyse corpus frequencies.
Rahmad Mahendra on Cross Entropy used to measure model quality in computational analyses.
Nadine Glas on (Log) Odds Ratios used to statistic independence of categorical variables (with a comparison to χ².
Seid Tvica on Ordinal Regression used to measure comprehensibility of foreign speech.

Course Materials 2009

J. Nerbonne on Entropy and Information Theory and the Conditional Entropy of the phoneme mapping (in Scandivanvian) "semi-communication".
- Thomas Landauer & Susan Dumais. 1997. A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction and Representation of Knowledge Psychological Review 104(2), 211-240.
- Tim Van de Cruys Latent Semantic Analysis and some further dimension reduction techniques.
- Therese Leinonen Principal Component Analysis and Factor Analysis in Comparing Large Dialect Collections of Vowels
Thomas Zastrow Entropy in Dialectometry and P. Nabende on Cross Entropy and Model Comparision
Xuchen Yao Bayesian vs. Frequentist Approaches to Statistics
Çagri Çöltekin Hierarchical Bayesian Networks as Learning Models for background reading see Wagenmakers, E.-J., Lee, M. D., Lodewyckx, T., & Iverson, G. (2008). Bayesian versus frequentist inference. In H. Hoijtink, I. Klugkist, and P. A. Boelen (Eds.), Bayesian Evaluation of Informative Hypotheses, pp. 181-207. Springer: New York.
Ma Jianqiang Permutation Tests and Monte Carlo Sampling
Jelena Prokic Clustering and the Bootstrap
Arjen Versloot Using Late Medieval Sources for Linguistic Reconstructions (and regression)
Gulsen Yilmaz Using Multiple Regression to Understand Language Attrition
Harwintha Anjarningsih Repeated Measures ANOVA applied to ERP data.
Natalia Ergorova Repeated Measures ANOVA applied to ERP data, Example II.
Anja Schüppert (Binary) Logistic Regression applied to foreign comprehension data.
Ankelien Schippers (Multinomial) Logistic Regression applied to historical syntax.
Karin Beijering Loglinear Analysis of Contingency Tables applied to historical syntax.
Ildikó Berzlanovich Intercoder Agreement in Discourse Analysis (Cohen's κ)
Myrte Faber Annotating Turn Competition in Multi-Party Conversations (Cohen's κ)
Martijn Wieling Bipartite Spectral Graph Clustering (applied to dialectal variation)
Proscovia Olango Naive Bayes (applied to disambiguation)

Course Materials 2008

J. Nerbonne on Entropy and Information Theory and the Conditional Entropy of the phoneme mapping (in Scandivanvian) "semi-communication".
E. Rossi on Normal Distributions and Sampling and on hypothesis testing and t-tests
V. Koukoulioti on Nonparametric Fallback Tests
E. Rossi on Single ANOVA
Th. Mehotcheva on Kruskal-Wallis
H. Loerts on Multivariate ANOVA and Repeated Measures
H. Ahmed on χ² and Fisher's Exact Test
A. Lobanova on (Log) Odds Ratios and Word Order Studies
J. Nerbonne on regression and multiple regression
N. Haque on Principal Component Analysis
S. van Ommen on Clustering

Course Materials 2007

J. Nerbonne on regression and multiple regression
B. Szmrecsányi "Language users as creatures of habit: a corpus-linguistic analysis of persistence in spoken English" Corpus Linguistics and Linguistic Theory 1(1): 113-150.
L. Stowe on Analysis of Variance, incl. Multiple Analysis of Variance.
A. Banga and Tam Ho on Repeated Measures, (ANOVA)
S. Berends on Assumptions of ANOVA
T. Caspi on Windowing, Correlations, and Dynamic Systems Theory
Th. Leinonen on Regression in Phonetics and Computational Modeling
J. Nerbonne on Multiple Regression Models.
V. Baaijen on Applying Nonparametric Statistics to Analyse Writing (Comparing Think-Aloud Protocols and Keystroke Logging)
M. Knippers and R. Montalto on Dealing with Nonnormal Distributions in a Repeated Measures Design
M. Spruit on Search for Associations among Variables
G. Korfiatis on Principal Components Analysis
T. Van de Cruys on Dimensionality Reduction for Similarity Detection (Singular Value Decomposition, Non-Negative Matrix Factorization)

Course Materials 2006

Nerbonne on Factor Analysis
Wiersma on Permutation Tests
Zinger on n-gram models
Ruffle on syntactic differences in Old English
Van der Cruys on Latent Semantic Analysis
Vasishth on Mixed Effects Models
Moberg on conditional entropy used to model comprehensibility.
Mur on binomial models, esp. the paired sign test.
Heeringa on bootstrapping .
Ruffle on Log Odds
Xiaoyan Xu on Multivariate nature of Language Attrition
Kwant on "Delphi" techniques for identifying variables

Course Materials 2005

Nerbonne on χ²
Villada on Association Statistics for Recognizing Multi-Word Units
Featherston on Magnitude Estimation. (Various papers, of which Featherston's "Decathlon Model" is perhaps the best starting point.)
Donkers on ANOVA, repeated measures.
Ruffle and Trofimova on Fisher's Exact Test.
Bouma on Corpora and Counting.
Van Noord on Search in Automatically Analysed Corpora.
Smits and Rossi on Binomial Chances.
Kremers on Log Odds Ratios.
Nerbonne on Logistic Regression.
Deunk on Analysing Qualitative Results via Multi-Level Regression.
van der Plas on Clustering.
van der Beek on Entropy as Measure of Syntactic Influence.
Fahmi on Indentifying Terminology.

Course Materials 2004

Student Projects

Melanie Hof's 2012 paper "Questionnaire Evaluation with Factor Analysis and Cronbach's Alpha"
Gerrit Jan Kootstra's 2004 project on exploratory Factor Analysis applied to foreign language learning.

Week	Date	Theme	Readings	Leader
1	9 Feb.	Organizational		John Nerbonne
2	16 Feb.	ANOVA, Factorial ANOVA	Levshina, Ch.8.2-3	John Nerbonne
3	23 Feb.	Repeated Measures	Levshina, Ch.8.4; Rietveld/van Hout, Ch.4.6	John Nerbonne
4	1 Mar.	Simple Linear Regression	Levshina, Ch.6> or Field 6
5	8 Mar.	Mult. Regr.	Levshina, Ch.7 or Field 7	John Nerbonne
6	15 Mar	Logistic Regression	Levshina, Ch.12; Field, Ch.8	John Nerbonne
	21/3-8/4	Exam period	no meetings
7	12 Apr.	Mixed effects Models	Baayen, Ch.7	John Nerbonne
8	19 Apr.	Quantifier Interpretation	Mixed Effects Log. Regr.	Isolde van Dorst
9	26 Apr.	Final Voicing in Whisper	Rep. Meas. ANOVA	Marita Everhardt
		Final Voicing, cont.	Mixed Effects Regr.	Liqin Zhang
10	3 May	Permutation Tests	Moore & McCabe, Ch. 14	John Nerbonne
		Bootstrap Sampling	Moore & McCabe, Ch. 14	John Nerbonne
11	10 May	Instructor absent	PhD defense, Freiburg
12	17 May	Code Switching	Logistic Regression	Masha Medvereva
13	24 May	Chat-like dialogues	Logistic Regression	Lotte Verheijen
13	24 May	Chat-like dialogue	Mixed Effects Log. Regr.	Guanghao You