This thesis has two main research questions. The first applies the methods developed by Wilbert Heeringa using Levenshtein-distance and aggregative statistics to the linguistic data provided by Mennecier, the second aiming to test whether Levenshtein distance is suitable for automatic detection of loanwords in linguistic data. A secondary goal in these two main theses is to test whether the reduction of the wordlist used in the analysis will have a significant impact on the results.
The first thesis analysed can be formulated as follows. It has been proved that the Levenshtein-measure is effective in separating dialects within a single language; we aim to prove that it will work equally well to separate two different language groups. We assume this to be a straightforward task and will make use of the preclassifications made by Mennecier to reinforce this claim.
Secondly, we aim to discover whether Levenshtein-distance can be adopted to automatically detect loanwords in phonetic transcriptions. To this end a novel use of Precision/Recall analysis is used on the data, using the preclassification made by Mennecier. Next, the distribution of edit-distances for each pair of respondents is analysed as a mix of distributions, by means of the EM-algorithm, which is also a novel approach to the analysis of phonetic transcriptions.
The tertiary goal, analysed in both main theses, is to test whether reducing the full word-list used in this research to only the words in the Swadesh-100 and Swadesh-200 wordlist will significantly impact the results of the analysis. This will be further clarified in the body of the thesis.