RuG/L04

Manuals

leven

Description

using the Levenshtein algorithm, derive a difference matrix for a set of locations based on sets of strings for each location

Warning

Older versions of this program had a bug in the calculation of Cronbach Alpha. See bugtest for details.

Synopsis

leven -n number [-d filename | [-i filename | -s filename | -g ] [-l filename] [-L] [-o filename] [-p filename] [-b percentage] [-B] [-c [-x]] [-e] [-h] [-r] [-R float] [-f number] [-F] [-m] [-N number] [-q | -Q] [-t] [-S filename] [-a float] [-v float] [-z float] datafile(s)

leven -P -n number [-C number] [-i filename | -s filename ] [-o filename] [-B] [-e] [-h] [-r] [-R float] [-N number] [-q | -Q] [-S filename] [-a float] [-v float] [-z float] datafile

Options

-a float
weight learning: parameter alpha
-b percentage
do bootstrapping with given percentage (usually: 100)
-B
binary comparison instead of Levensthein
-c
Cronbach Alpha
-C
select column (this option can be used more than once)
-d filename
output file for standard deviation
-e
equal weight for indel and subst
-f int
minimum number of occurrences required for each variant
-F
skip files with less than two variants, depends on option -f
-g
character-based G.I.W.
-h
Hamming distance instead of Levensthein
-i filename
file with indel values
-l filename
file with location labels
-L
skip locations that are not listed in the file with location labels
-m
calculate median difference instead of mean difference
-n int
number of locations
-N int
select another normalisation function
-o filename
output file
-p filename
file defining a linkage partition
-P
do pairwise calculations
-r
numeric distance instead of Levensthein
-R
with -r: Minkowski value (default: 2)
-s filename
file with substitution and indel values
-q
quiet
-Q
a bit quiet (no listing of input files)
-t
test all input files
-S filename
weight learning: output file with updated substitution and indel values
-v float
weight learning: with -u, increment for all alignments
-x
also give Cronbach's Alpha using old (incorrect) formula
-z float
weight learning: with -u, exponent in PMI formula

Testing

Before doing any time-consuming calculations, it is recommended to run the program in test mode, using option -t, to check if all input files are syntactically correct.

Cancelling

The program will abort if a file _CANCEL_.L04 exists in the current directory, or if it is created while the program is running. This is useful for stopping long running calculations from a GUI, such as pyL04.

Number of locations

The total number of locations used in the datafiles must be specified with the option -n. This number must be identical to the highest index number in the location file, if used.

Weights for indels and substitutions

By default, all indels have the weight 1, and all substitutions have weight 2. There are two ways to alter this behaviour:

Specifying weights for indels only

Use the option -i to specify a file that lists the indel weights. Substitution weights are calculated by adding the indel weights for each pair of character codes.

The indel file should be written as follows:

    PRINT "F:"   # That is, a string starting with F: without quotes
    NEWLINE      # This line should only be present if the file
                 #     is to be used with leven-r
    PRINT max
    NEWLINE
    FOR i = 1 TO max
        PRINT w[i]
        NEWLINE
max is the highest character code number used in the dialect data.

w[i] is the weight for inserting or deleting a character with code i.

Note: There are three alternative versions of the program: leven, leven-r, and leven-s. For leven, all weights should be integers in the range 0 through 65535. For leven-s (compiled with LEVEN_SMALL defined), all weights should be integers in the range 0 through 255. For leven-r (compiled with LEVEN_REAL defined), all weights should be reals, in the range from 0 to MAX_FLT (a platform dependent limit).

Specifying weights for indels and substitutions

Use the option -s to specify a file that lists both indel weights and substitution weights.

The substitution file should be written as follows:

    PRINT "F:"   # That is, a string starting with F: without quotes
    NEWLINE      # This line should only be present if the file
                 #     is to be used with leven-r
    PRINT max
    NEWLINE
    FOR i = 1 TO max
         FOR j = 0 TO i - 1
             PRINT w[i,j]
             NEWLINE
max is the highest character code number used in the dialect data.

w[i,j], for j > 0, is the weight for substituting a character with code i for a character with code j, or vice versa.

w[i,0] is the weight for inserting or deleting a character with code i.

Note: There are three alternative versions of the program: leven, leven-r, and leven-s. For leven, all weights should be integers in the range 0 through 65535. For leven-s (compiled with LEVEN_SMALL defined), all weights should be integers in the range 0 through 255. For leven-r (compiled with LEVEN_REAL defined), all weights should be reals, in the range from 0 to MAX_FLT (a platform dependent limit).

Specifying labels for locations

Use the option -l to specify a file with labels for locations. If this option is absent, numbers are used in the output file instead. You can't use labels in the data files if you omit this option.

Datafiles

Each additional datafile should hold dialect data, one file for each dialect item. All words in a single file are compared for each pair of locations.

A weight can be specified to a datafile (default 1), preceded by an asterisk. This can be used to alter the relative importance of one datafile compared to another. Only the last weight specified in a particular file will be used, and it will be used for all of that datafile, even for those strings preceding the weight specification.

Each location is introduced with a location number (counting from 1 upward) or a colon followed by a location label.

Words can be specified in two ways, with a leading minus as an ascii string, or with a leading plus as a sequence of decimal numbers.

Example:

    * 1.2
    5
    - fiets
    - fIts
    + 102 116 115
    : GR7
    - fietse
    - fiets
    - fiets
    5
    - flits
As you can see, each location can have more than one word. In this example there are two locations: location number 5 (note that this location is mentioned twice), which has four words, of which the third is specified as a sequence of numbers; and the location with label GR7, which has three words.

The order of the locations is immaterial. Not every location needs to be present in each data file. In the end, the sum of all differences between two locations is divided by the number of times that averaged difference was calculated, taking any weights specified in the datafiles into account.

The rest of this section is a bit technical. You can skip to the next section, if you want.

For this file, the actual difference between location number 5 and location GR7 is calculated by choosing a (near-)optimal one-by-one alignment of the strings of these locations. Optimal, meaning: the sum of the pair-wise differences is minimised. The actual difference is the average of these pair-wise differences. If two locations have different numbers of strings, all strings of one or both locations are duplicated once or more, until the locations have the same number of strings. In this case, the strings for location number 5 are duplicated three times, and those of location GR7 four times, resulting in twelve strings at each location.

The rationale for this procedure is as follows. Look at this example:

What would happen if you calculated the average difference of all possible pairs? You would get:
Q:ABBAAA
P:
B100111
A011000
A011000
gives 8 / 18 = .444 between P and Q for this word. But if you look at how the variant strings of this word are distributed among P and Q you can see that they are in identical proportions. So the difference should be zero. The (near-)optimal one-by-one alignment gives you this value:
P:
 
Q:
B
|
B
A
|
A
A
|
A
B
|
B
A
|
A
A
|
A
(variants repeated twice)
 
(variants sorted to line up with first row)
differences: 0 0 0 0 0 0
gives 0 / 6 = 0.

Note that determining the optimal one-by-one alignment is a variant of the Travelling Salesman Problem, which is computationally expensive. A heuristic is used that gives you a reasonable approximation.

Unicode
A datafile can contain the following line:
    %utf8
The remainder of this file will be interpreted as UTF-8 (data only, labels are not effected). See Unicode.

Defining a linkage partition

If you are interested in only a small subset of the difference matrix, you can use a linkage partition file. Such a file looks like this:
    : 1 2
    : 2 3
Lines not starting with a colon are ignored.

The numbers are location index numbers, numbered from 1 to max. You can not use labels in this file. In the example, the differences between locations 1 and 2, and between locations 2 and 3 are calculated, but not between the locations with numbers 1 and 3.

Output files

By default, results are written to the output file in the form of a difference matrix. If no labels are available, numbers from 1 to max are used as labels, where max is the number of locations.

The top of the output file contains comments that summarise the options used, and list some properties about the input data. Here is a short fragment:

    #   Items  Locations  File
    #                     Variants: frequency(repeats)
    #     345      237    clearing_up.fon
    #                     268: 1(229) 2(24) 3(7) 4(2) 5(3) 6 7 11
Explanation: the input file was clearing_up.fon. There were 345 items used, from 237 locations. There were 268 variant items in the input file. 229 variants occurred only once, 24 variants occurred twice, seven items occurs three times, etc., a single variant occurred six times, another single variant occurred seven times, etc.

By default, the difference between two locations is the (weighted) mean of all differences calculated between those two locations. If the option -m is used, the difference between two locations is the median of the calculated differences. Weights in the data files will be ignored. Note: this requires a lot of memory.

If a linkage partition is used, differences are written to file as a value followed by two index numbers, followed by -- if available -- two location labels. Example:

    .487308    2    1    Calcutta      Bombay
    .835438    3    2    "New Delhi"   Calcutta
Note: the order used in the linkage partition file is lost.

If the option -d is used, then, for each pair of locations, the standard deviation of the calculated differences between those two locations will be returned in the output file specified by the option -d in the form of a difference matrix. Weights in the data files will be ignored.

Normalisation

At the top of the source code leven.c is a function normalise(), which is currently defined as follows:
    double normalise (double f, int len1, int len2, int minlen, int maxlen)
    {
        double
            result = 0;

        switch (normalise_function) {
            case 1:
                result = f / (len1 + len2);
                break;
            case 2:
                result = f / ((len1 > len2) ? len1 : len2);
                break;
            case 3:
                result = sqrt (f / (len1 + len2));
                break;
            case 4:
                result = f / minlen;
                break;
            case 5:
                result = f / maxlen;
                break;
            case 6:
                result = f;
                break;
            default:
                errit ("Normalisation function number %i undefined", normalise_function);
        }

        return result;
    }
This function is called as soon as the Levenshtein algorithm has been applied to two strings. The arguments to normalise() are the value calculated by the Levenshtein algorithm, the lengths of both strings, and the minimum and maximum length of the least-cost alignment(s). (There may be several such alignments, of unequal length.)

The purpose of the function normalise() is to normalise the Levenshtein distance. This is especially needed in cases where not all pairs of strings are available for all pairs of locations, and the longer strings that are available would skew the overall results, if normalisation were omitted.

By default (case 1), the normalisation function normalises by dividing by the sum of the lengths of the two strings. This makes sense, because in cases where the two strings don't have any character in common, the Levenshtein distance is proportional to the sum of the lengths of the two strings.

You might want to experiment with alternate normalisation functions. You can select another predefined normalisation with the command line option -N (case 2 ... case 5). Check the source code to see what types of normalisation are available. You can easily add your own types of normalisation (case 6, etc), and select these with the -N option. If you do add to the definition of the function, make sure you save and compile the source under a different file name.

Cronbach Alpha

If you use the option -c Cronbach Alpha will be calculated. The result is printed on standard output and included as a comment at the top of the output file.

Using this option may require a lot of memory, depending on number and size of datafiles. When the program runs out of memory, it will try to continue without Cronbach Alpha.

It may be a good idea to limit the amount of memory available to the program (for instance with ulimit -v). Because, if the program starts using swap space on disk, it slows down to a crawl.

In versions older than version 1.80 of the leven program and version 0.26 of the giw program, there was a different formula used for calculating Cronbach Alpha. That formula was incorrect. If you use option -x in addition to option -c, you get both old (imprecise) and new (correct) value, so you can see how much 'damage' the old formula did.

Equal weight for indel and substitution

Usually, with a plain Levenshtein comparison, the weight of an indel is half the weight of a substitution. With the option -e the weight of an indel is set equal to the weight of a substitution.

Character-based G.I.W.

G.I.W. or Gewichteter Identitätswert is a measure to express distance based on frequency of elements. If an element is rare, the co-occurrence of that element in two places suggests a strong correlation between those two places.

A word-based distance of G.I.W. can be calculated with the giw program. That method is only suitable when there are relatively few word variants in a dataset.

Character-based G.I.W. (cGIW) is a modification of the plain Levenshtein distance. The costs of a substitution are listed below. (The cost of an indel is 0.5 in both cases, or 1 if the option -e was used.)

plaincGIW
equal 0 f / F
different 1 1

f
The number of times the character in question is present, in the datafile being processed
F
The total count of all characters, in the datafile being processed

Binary comparison

If the option -B is used a binary comparison will be made, instead of a Levenshtein distance calculation. If two strings are not the same, the difference is set to 1 (or the weight of the datafile, if specified).

Hamming disance

If the option -h is used, the Hamming distance will be calculated, instead of the Levenshtein distance. Strings in a single data file must be of the same length.

Numeric disance

If the option -r is used, the numeric distance will be calculated, instead of the Levenshtein distance. For this measurement, a string is interpreted as a list of integer values, using the ascii code of the characters. Strings in a single data file must be of the same length.

The numeric distance between strings s1 and s2, with Minkowski value R is:

( Σ |s1i - s2i|R ) 1/R
The option -R determines the Minkowski value. The default is 2 (Euclidean distance).

Minimum number of occurrences required for each variant

The option -f can be used to set a minimum number each variant form should be present in the data before it is actually used. If you set this values to 2, each variant that occurs only once will be ignored. The rational behind this is that infrequent variants may represent just noise.

This option is mainly useful for comparison of lexical variants. With phonetic data, variants that occur only once are very common, and usually don't represent noise.

Skip files with less than two variants

If the option -F is used, all data files that contain only one variant are skipped. This not only effects processing speed, but calculated differences as well. If there is only one variant form, it won't add to the sum of individual differences, but these data are counted, and used in the calculation of the average of the differences. This effect is notable unless the datafile contains an equal number of strings for each and every location.

Note that this option is effected by the option -f. A datafile may have more than one variant, but only one if infrequent variants are removed with the option -f. If you use option -F as well, the file will be skipped completely.

Pairwise calculations

When the option -P is used, the program processes data in a different input format, and in a different manner.

Let's suppose this is all the input data you are working with, listed in a schematic way:

        LOC1   LOC2   LOC3
   Wa    a1     a2     a3
   Wb    b1     b2     b3
   Wc    c1            c3

You have data from three locations, LOC1, LOC2, and LOC3, and for three different words, Wa, Wb, Wc. Item a1 is the variant of word Wa in LOC1, etc. You can have zero or one variant for each location and word, but not more than one. In this example, an item c2 is missing

The program processes this data to give you this:

        LOC1__LOC2   LOC1__LOC3   LOC2_LOC3
   Wa    L(a1,a2)     L(a1,a3)     L(a2,a3)
   Wb    L(b1,b2)     L(b1,b3)     L(b2,b3)
   Wc    NA           L(c1,c2)     NA
Where L(a1,a2) = Levenshtein difference between strings a1 and a2, etc.
NA means not available.

The input file should actually look like this:

    : Wa
    [1]- a1
    [2]- a2
    [3]- a3

    : Wb
    [1]- b1
    [2]- b2
    [3]- b3

    : Wc
    [1]- c1
    [3]- c3
Empty lines are optional, as well as comment lines.

In this case, the labels are no location labels but word labels. They are not actually used, other than as a sign that a new row of data begins. Instead of labels you can also use numbers (without a leading colon). These labels or numbers are copied to the output file.

There are no place labels in the file, nor are they read from a separate label file.

The numbers in square brackets indicate the column numbers. As you can see above, you can skip numbers: there is no number 2 for word Wc.

Instead of starting the actual word variant with a minus, you can use a plus, followed by a list of numbers, just like in the normal datafiles.

You still need to use the option -n to set the number of locations. In the above example it should be 3, because there are three columns.

This is how the output file could look like:

    : Wa
    1 2 0.23
    1 3 0.84
    2 3 0.49

    : Wb
    1 2 0.04
    1 3 0.73
    2 3 0.58

    : Wc
    1 2 NA
    1 3 0.37
    2 3 NA
The labels indicate the start of a new row. The pair of numbers in front of every value indicates the columns numbers that were compared.

You can have as many columns (locations) as you want. Usually, each column is compared with every other column. If you have 16 input columns, you would get 120 output columns. You can limit the output (and the computation time) if you are interested in only certain column combinations. With the option -C you can select a column (by number) that should be tested against all other columns. You can use the option -C more than once to have more combinations tested. Unselected columns are not compared with other unselected columns (unless you select no columns at all).

For instance, if you select column 2, the output above would look like this instead:

    : Wa
    1 2 0.23
    2 3 0.49

    : Wb
    1 2 0.04
    2 3 0.58

    : Wc
    1 2 NA
    2 3 NA

The combination 1 3 is missing.

Weight learning by means of Pointwise Mutual Information

Use of the option -S (upper case) initiates the learning of substitution values. Initial substitution values must be provided in a file using the option -s (lower case). These values are optimised using Pointwise Mutual Information. When the optimisation is completed, the updated substitution values are saved to file, and the program continues its normal operation using these optimised values.

Three more options modify the behaviour of the learning process:

-a float
The learning parameter alpha determines the speed at which values are updated in each iteration. This can be 1 (full update, fast, less accurate) or less (partial update, slower, more accurate).
Default: 0.3
-v float
A tiny increment for each alignment count, so there are no zero counts left.
Default: 0.001
-z float
Exponent Z in PMI formula:
2log ( p(x, y)Z / (p(x) × p(y)) )
Default: 1