Seminar in Methodology and Statistics

For students in the Linguistics Research Master's Program and Linguistics PhD students

Course under development--more or less permanently!

Spring, 2011

When Tues. 9:15-11
Where Harmonie H15.0036

Instructor: John Nerbonne

Announcements

Description

The structure of the course revolves around seminar presentations by participants. Presentations primarily concern statistical or methodological issues in the research of the participants. See below for some of the topics presented and discussed in 2004 through 2008.

Topics include a selection from permutation tests; bootstrapping; analysis of variance and analysis of covariance; regression including multiple regression and hierarchical (multi-level) regression; dimension reduction techniques including factor analysis, principal component analysis, multidimensional scaling and/or latent semantic analysis; analysis of nominal data including association strength, Cohen's kappa, binomial or multinomial models, Fisher's exact test, odds ratios, entropy, or logistic regression. Other topics have regularly been presented, mostly at the request of the participants.

Prerequisites

Participants in the course should have completed a basic course in statistics covering topics such as descriptive statistics but also basis hypothesis testing using z-tests, t-tests, and χ² Participation in the first semester Research Master course on statistics and corpus linguistics (Wander Lowie and Gertjan van Noord) is strongly recommended for anyone who's never taken statistics. It is required that you know the statistics from that course, so if you've never taken such a course, that's a good basis. The statistics course given in the European Master's in Clinical Linguistics is also good.

Requirements

Students taking the course for credit in the research master must present at least one hour-long session (30 min. presentation plus 15 min. discussion and question session) on a statistical analysis technique. In addition they must turn in a 8-10 pp. (2,000-2,500 wd.), which may be on the same statistical analysis technique, but which may also be on another. In the paper it is important to embed the discussion in the analysis of concrete data, to explain how the analysis works, under what conditions it may be applied, and what its shortcomings are. Graphical presentations of the data attempting to illustrate the tendency that is to be proven or disproven are definitely valuable.

It is fine with me if you turn in a paper reporting on work for another course as long as the paper turned I receive focuses on the statistical analysis. If you want turn in the same paper for two course, make sure that both instructors know this.

Ph.D. candidates from BCN or the Graduate School in the Humanities have received credit for this course in the past if they presented one topic (45 min. -- (30 + 15)) and participated regularly. I assume that that will continue to be the case.

Books

In general, we will try to use the following:
Other references that have also been found useful are the following: Many also find the more "how-to" books useful. The following books have been found especially valuable as they provide very practical instructions for doing analysis in SPSS or R. Articles may also be used from time to time. The books are on reserve, most at the Letteren library, reserve shelf. Please note that books on reserve at the Letteren library that normally belong there are not moved to the reserve shelf. Instead, they're kept at their normal places (use the catalogue), but are on reserve and cannot be loaned out.

Schedule 2011 (Tentative!)

The schedule for Spring 2011. Meetings are Tues. 9-11 in H1315.036
Week Date Theme Readings Leader
1 15 Feb. Organizational John Nerbonne
2-3 22 Feb.- 1 Mar. ANOVA, Factorial ANOVA M&M Chap.12-13 John Nerbonne
4 8 Mar. Repeated Measures Rietveld/van Hout, Ch.4.6; Field 13 John Nerbonne
5-6 15-22 Mar. Regression, Mult. Regr. M&M 10-11 John Nerbonne
7 29 Mar. Mixed Models Baayen, Ch.7 Martijn Wieling
8 5 April no class - break
9 12 Apr. Identify topics Optional John Nerbonne
10 19 Apr. Logistic Regression M&M Chap.15 John Nerbonne
11 26 Apr. Association Strength, PMI Wiechmann slides Simon Šuster
Association Strength, Odds Ratios Agresti Chap. 2.3 Laura Handojo
Ass. Strength & Fisher's Exact Pedersen 1996 Jelke Bloem
Association Strength, Minimum Sensitivity Wiechmann paper Igor Tytyk
12 3 May May Break
13 10 May Association Strength Stefan Evert
14 17 May Repeated Measures Field 13; Rietveld & van Hout Ch. 4.6 Laura Bos
Connie Lahmann
... vs. Mixed Models Ruggero Montalto
Oscar Strik
15 24 May Clustering Johnson, Ch.6.1-6.4 Martin Boroš
Principal Components Tabachnik & Fidell, Ch.13 Ke Tranh
Multi-Dimensional Scaling Johnson, Ch.6.5 Ljubomir Žlatkov
Correspondence Analysis Kristel Uiboaed
Extra 26 May Validation Kaitlin Mignella
Mona Zimmermeister
minF Jurriën Schuurman
16 31 May No class JN in Kampala
17 7 June Logistic Regression M&M Ch.15 Vincenzo Tabacco
Cochran Q Field, 15.6.3 Jet Vonk
Guest Lecture Sentiment in Texts Clutering, PMI, Naive Bayes Tony Mullen

Materials

Click on the lecture (etc.) title to see more.

Course Materials 2011

  1. John Nerbonne's lectures on various ANOVA & regression models (weeks 2-6).
  2. Martijn Wieling on Mixed Models
  3. J. Nerbonne on Entropy and Information Theory and the Conditional Entropy of the phoneme mapping (in Scandivanvian) "semi-communication".
  4. Stefan Evert's page on statistics and software for measuring collocation strength.
  5. Assocation Strength talks
    1. Simon Šuster on Mutual Information and Collocations
    2. Laura Handojo Odds Ratios and Collocations
    3. Jelke Bloem on Fisher's Exact Test to Detect Animacy
    4. Igor Tytyk on Minimum Sensitivity and Collostructions
  6. Stefan Evert on statistical association strength and multi-word expressions.
  7. Repeated Measures vs. Mixed Models
    1. Connie Lahmann on 'Higher Language Cognition' and Grammaticality Verification
    2. Laura Bos Repeated Measures ANOVA \& Permutation Statistics
    3. Ruggero Montalto on Repeated Measures vs. Mixed Models
    4. Oscar Strik on the Aikake Information Criterion
  8. Dimension Reduction
    1. Martin Boros Cluster Analysis and Silhouette Width
    2. Ke Tran on Principal Component Analysis (and Face Recognition!)
    3. Lubomir Zlatkov on Multi-Dimensional Scaling
    4. Kristel Uiboaed on Correspondence Analysis
  9. Mona Timmermeister & Caitlin Mignella on Validating a Pronunciation Difference Measure
  10. Jurriën Schuurman on Min F in Psycholinguistics
  11. Jet Vonk on Cochran's Q

Course Materials 2010

  1. Eliza Magaretha on regression used to evaluation the quality of inducing pronunciation distances from empirical data.
  2. Nynke van der Vliet on Cohen's κ used to measure the agreement between annotators of hierarchically structured material.
  3. Edgar Weiffenbach on Log Linear Models of Contingency used to analyse corpus frequencies.
  4. Nick Ruiz on Logistic Regression used to analyse corpus frequencies.
  5. Rahmad Mahendra on Cross Entropy used to measure model quality in computational analyses.
  6. Nadine Glas on (Log) Odds Ratios used to statistic independence of categorical variables (with a comparison to χ².
  7. Seid Tvica on Ordinal Regression used to measure comprehensibility of foreign speech.

Course Materials 2009

  1. J. Nerbonne on Entropy and Information Theory and the Conditional Entropy of the phoneme mapping (in Scandivanvian) "semi-communication".
  2. Thomas Zastrow Entropy in Dialectometry and P. Nabende on Cross Entropy and Model Comparision
  3. Xuchen Yao Bayesian vs. Frequentist Approaches to Statistics
  4. Çagri Çöltekin Hierarchical Bayesian Networks as Learning Models for background reading see Wagenmakers, E.-J., Lee, M. D., Lodewyckx, T., & Iverson, G. (2008). Bayesian versus frequentist inference. In H. Hoijtink, I. Klugkist, and P. A. Boelen (Eds.), Bayesian Evaluation of Informative Hypotheses, pp. 181-207. Springer: New York.
  5. Ma Jianqiang Permutation Tests and Monte Carlo Sampling
  6. Jelena Prokic Clustering and the Bootstrap
  7. Arjen Versloot Using Late Medieval Sources for Linguistic Reconstructions (and regression)
  8. Gulsen Yilmaz Using Multiple Regression to Understand Language Attrition
  9. Harwintha Anjarningsih Repeated Measures ANOVA applied to ERP data.
  10. Natalia Ergorova Repeated Measures ANOVA applied to ERP data, Example II.
  11. Anja Schüppert (Binary) Logistic Regression applied to foreign comprehension data.
  12. Ankelien Schippers (Multinomial) Logistic Regression applied to historical syntax.
  13. Karin Beijering Loglinear Analysis of Contingency Tables applied to historical syntax.
  14. Ildikó Berzlanovich Intercoder Agreement in Discourse Analysis (Cohen's κ)
  15. Myrte Faber Annotating Turn Competition in Multi-Party Conversations (Cohen's κ)
  16. Martijn Wieling Bipartite Spectral Graph Clustering (applied to dialectal variation)
  17. Proscovia Olango Naive Bayes (applied to disambiguation)

Course Materials 2008

  1. J. Nerbonne on Entropy and Information Theory and the Conditional Entropy of the phoneme mapping (in Scandivanvian) "semi-communication".
  2. E. Rossi on Normal Distributions and Sampling and on hypothesis testing and t-tests
  3. V. Koukoulioti on Nonparametric Fallback Tests
  4. E. Rossi on Single ANOVA
  5. Th. Mehotcheva on Kruskal-Wallis
  6. H. Loerts on Multivariate ANOVA and Repeated Measures
  7. H. Ahmed on χ² and Fisher's Exact Test
  8. A. Lobanova on (Log) Odds Ratios and Word Order Studies
  9. J. Nerbonne on regression and multiple regression
  10. N. Haque on Principal Component Analysis
  11. S. van Ommen on Clustering

Course Materials 2007

  1. J. Nerbonne on regression and multiple regression
  2. B. Szmrecsányi "Language users as creatures of habit: a corpus-linguistic analysis of persistence in spoken English" Corpus Linguistics and Linguistic Theory 1(1): 113-150.
  3. L. Stowe on Analysis of Variance, incl. Multiple Analysis of Variance.
  4. A. Banga and Tam Ho on Repeated Measures, (ANOVA)
  5. S. Berends on Assumptions of ANOVA
  6. T. Caspi on Windowing, Correlations, and Dynamic Systems Theory
  7. Th. Leinonen on Regression in Phonetics and Computational Modeling
  8. J. Nerbonne on Multiple Regression Models.
  9. V. Baaijen on Applying Nonparametric Statistics to Analyse Writing (Comparing Think-Aloud Protocols and Keystroke Logging)
  10. M. Knippers and R. Montalto on Dealing with Nonnormal Distributions in a Repeated Measures Design
  11. M. Spruit on Search for Associations among Variables
  12. G. Korfiatis on Principal Components Analysis
  13. T. Van de Cruys on Dimensionality Reduction for Similarity Detection (Singular Value Decomposition, Non-Negative Matrix Factorization)

Course Materials 2006

  1. Nerbonne on Factor Analysis
  2. Wiersma on Permutation Tests
  3. Zinger on n-gram models
  4. Ruffle on syntactic differences in Old English
  5. Van der Cruys on Latent Semantic Analysis
  6. Vasishth on Mixed Effects Models
  7. Moberg on conditional entropy used to model comprehensibility.
  8. Mur on binomial models, esp. the paired sign test.
  9. Heeringa on bootstrapping .
  10. Ruffle on Log Odds
  11. Xiaoyan Xu on Multivariate nature of Language Attrition
  12. Kwant on "Delphi" techniques for identifying variables

Course Materials 2005

  1. Nerbonne on χ²
  2. Villada on Association Statistics for Recognizing Multi-Word Units
  3. Featherston on Magnitude Estimation. (Various papers, of which Featherston's "Decathlon Model" is perhaps the best starting point.)
  4. Donkers on ANOVA, repeated measures.
  5. Ruffle and Trofimova on Fisher's Exact Test.
  6. Bouma on Corpora and Counting.
  7. Van Noord on Search in Automatically Analysed Corpora.
  8. Smits and Rossi on Binomial Chances.
  9. Kremers on Log Odds Ratios.
  10. Nerbonne on Logistic Regression.
  11. Deunk on Analysing Qualitative Results via Multi-Level Regression.
  12. van der Plas on Clustering.
  13. van der Beek on Entropy as Measure of Syntactic Influence.
  14. Fahmi on Indentifying Terminology.

Course Materials 2004

  1. Nerbonne on Logistic Regression
  2. Siedle on Hierarchical Cluster Analysis
  3. Hopp on Magnitude Estimation
  4. Lichte on Association Measures
  5. Kootstra on Exploratory Factor Analysis
  6. Rossi on Odds Ratios in Aphasiology

Student Projects

  1. Gerrit Jan Kootstra's 2004 project on exploratory Factor Analysis applied to foreign language learning.