René van der Ark (2008)
Comparing Languages and Dialects in Central Asia
Master's thesis, Rijksuniversiteit Groningen.
[ Paper (PDF, 1140 kb) ]


This thesis is the result of a collaboration between the Paris anthropological research institute Musée de l'Homme and the University of Groningen. The purpose of the larger research-project (led by Evelyne Heyer of Musée de l'Homme) is to analyse the common traits between genetic and linguistic markers in the countries Uzbekistan, Tajikistan and Kirghistan. The task of the University of Groningen was to quantify the differences in the linguistic data, acquired and preclassified by Philippe Mennecier, by means of the Levenshtein algorithm, a procedure which measures the distance between two strings of tokens. Unique about the data provided is that they span across multiple languages and even two larger language groups.

This thesis has two main research questions. The first applies the methods developed by Wilbert Heeringa using Levenshtein-distance and aggregative statistics to the linguistic data provided by Mennecier, the second aiming to test whether Levenshtein distance is suitable for automatic detection of loanwords in linguistic data. A secondary goal in these two main theses is to test whether the reduction of the wordlist used in the analysis will have a significant impact on the results.

The first thesis analysed can be formulated as follows. It has been proved that the Levenshtein-measure is effective in separating dialects within a single language; we aim to prove that it will work equally well to separate two different language groups. We assume this to be a straightforward task and will make use of the preclassifications made by Mennecier to reinforce this claim.

Secondly, we aim to discover whether Levenshtein-distance can be adopted to automatically detect loanwords in phonetic transcriptions. To this end a novel use of Precision/Recall analysis is used on the data, using the preclassification made by Mennecier. Next, the distribution of edit-distances for each pair of respondents is analysed as a mix of distributions, by means of the EM-algorithm, which is also a novel approach to the analysis of phonetic transcriptions.

The tertiary goal, analysed in both main theses, is to test whether reducing the full word-list used in this research to only the words in the Swadesh-100 and Swadesh-200 wordlist will significantly impact the results of the analysis. This will be further clarified in the body of the thesis.