You can draw really nice maps with
RuG/L04, but the
question is: how well do those maps display the actual situation? The
software can't but display what is available in the data. In the end,
you'll have to judge the results compared to other sources, other
research.
You do have some choices available in RuG/L04. What
comparison method should be applied? How should you use the data? What
clustering method is the most appropriate? The last question is
discussed
elsewhere.
As for the other questions, RuG/L04 has one tool available:
Local incoherence.
Local incoherence means something like the lack of
coherence on a local scale. It is a formula that expresses the quality of a
dialect measurement as a numeric value. It is based on the idea
that the dialect in one location differs less from the dialect in another
location in the near vicinity, than from the dialect of another location,
still in the vicinity, but a bit further away. Differences between
locations geographically far apart are discarded, because at that level
coincidence sets in.
You can calculate the local incoherence with the linc
program. How to use the program, and the exact definition of local
incoherence, are explained in the manual of the program.
You can only use local incoherence to compare multiple measurements for one and
the same area, because the result highly depends on the geography of
the area and
the exact geographic distribution of the locations. And of course, one
dialect area is not like another. And for instance, if you add
locations to a previously analysed area, then the local incoherence can
go both up or down, but this says nothing about the relative
reliability of the extended set of locations.
Finally, you should note that local incoherence is a simple method. Generally
speaking, of two measurements, the one with the best result of local
incoherence will be the better measurement. But this doesn't have to be
true in each and every case.
If you ran all the examples of
part 2 and
part 3 of this tutorial, then you now have four
tables of dialect differences for the state of Pennsylvania.
You can calculate the local incoherence of these like this:
linc -L fon.dif PA.coo
linc -L lex-lev.dif PA.coo
linc -L lex-bin.dif PA.coo
linc -L lex-giw.dif PA.coo
You will get these results:
phonetic, Levenshtein: 0.728728
lexical, Levenshtein: 1.32183
lexical, binary: 1.31965
lexical, G.I.W.: 1.2249
Smaller values mean better measurements. As you can see here, of the lexical
methods, the Gewichteter Identitätswert gives the best result.
The phonetic measurement has a much better score than any of the lexical
measurements. For several reasons, it is quite likely that a phonetic
comparison is much more precise than a lexical one. But that doesn't
mean you should discard the lexical measurements. It may be that these
are less accurate than phonetic measurements, but it can still bring to
light details that are not expressed as phonetic differences.
The local incoherence is a useful tool for fine-tuning a measurement, such as
determining what parameter settings to use to get the best result. Here
is an example.
Data contains noise. Impurities. You may assume that words that are extremely
rare in the data set, that those words are largely noise. Suppose you
only use words that occur at least twice. Does the result of the
measurement improve? And if it does: how often should a word occur
before you include it in your measurements? Twice? Thrice? Ten times?
The leven program has an option to exclude
infrequent words. Let's do a measurement of lexical differences, using
the Levenshtein method, including only words that occur at least twice
in the data set (option: -f 2).
Afterwards, we determine the local incoherence:
leven -f 2 -n 67 -l PA.lbl -o lex-lev02.dif lex/*.lex
linc -L lex-lev02.dif PA.coo
Local incoherence decreased from 1.32183 to 1.23576, quite an improvement. Try
higher limits. What limit gives the best result? Make a cluster map of
the best result, and compare it to the original map. Are there visible
differences?
Run some tests with removing infrequent words using binary measurements or
G.I.W. What is the optimal limit in these cases?
Try the -F option (uppercase F), and see how this
effects things. Does it always improve things, or never, or does it vary?
There are a few variants of the Levenshtein algorithm that are generally
applicable, also with phonetic differences. Try the effect of the options
listed below. Pay attention to differences in local incoherence and the
visible effects in the cluster map and MDS map.
- Use one of the alternative normalisation functions of the Levenshtein
distance: option
-N
- Character-based G.I.W.: option
-g
- Setting the cost of an indel equal to the cost of
a substitution: option
-e