leven -P -n number [-C number] [-i filename | -s filename ] [-o filename] [-B] [-e] [-h] [-r] [-R float] [-N number] [-q | -Q] [-S filename] [-a float] [-v float] [-z float] datafile
Specifying weights for indels only
Use the option -i to specify a file that lists the indel weights. Substitution weights are calculated by adding the indel weights for each pair of character codes.
The indel file should be written as follows:
PRINT "F:" # That is, a string starting with F: without quotes NEWLINE # This line should only be present if the file # is to be used with leven-r PRINT max NEWLINE FOR i = 1 TO max PRINT w[i] NEWLINEmax is the highest character code number used in the dialect data.
w[i] is the weight for inserting or deleting a character with code i.
Note: There are three alternative versions of the program: leven, leven-r, and leven-s. For leven, all weights should be integers in the range 0 through 65535. For leven-s (compiled with LEVEN_SMALL defined), all weights should be integers in the range 0 through 255. For leven-r (compiled with LEVEN_REAL defined), all weights should be reals, in the range from 0 to MAX_FLT (a platform dependent limit).
Specifying weights for indels and substitutions
Use the option -s to specify a file that lists both indel weights and substitution weights.
The substitution file should be written as follows:
PRINT "F:" # That is, a string starting with F: without quotes NEWLINE # This line should only be present if the file # is to be used with leven-r PRINT max NEWLINE FOR i = 1 TO max FOR j = 0 TO i - 1 PRINT w[i,j] NEWLINEmax is the highest character code number used in the dialect data.
w[i,j], for j > 0, is the weight for substituting a character with code i for a character with code j, or vice versa.
w[i,0] is the weight for inserting or deleting a character with code i.
Note: There are three alternative versions of the program: leven, leven-r, and leven-s. For leven, all weights should be integers in the range 0 through 65535. For leven-s (compiled with LEVEN_SMALL defined), all weights should be integers in the range 0 through 255. For leven-r (compiled with LEVEN_REAL defined), all weights should be reals, in the range from 0 to MAX_FLT (a platform dependent limit).
A weight can be specified to a datafile (default 1), preceded by an asterisk. This can be used to alter the relative importance of one datafile compared to another. Only the last weight specified in a particular file will be used, and it will be used for all of that datafile, even for those strings preceding the weight specification.
Each location is introduced with a location number (counting from 1 upward) or a colon followed by a location label.
Words can be specified in two ways, with a leading minus as an ascii string, or with a leading plus as a sequence of decimal numbers.
Example:
* 1.2 5 - fiets - fIts + 102 116 115 : GR7 - fietse - fiets - fiets 5 - flitsAs you can see, each location can have more than one word. In this example there are two locations: location number 5 (note that this location is mentioned twice), which has four words, of which the third is specified as a sequence of numbers; and the location with label GR7, which has three words.
The order of the locations is immaterial. Not every location needs to be present in each data file. In the end, the sum of all differences between two locations is divided by the number of times that averaged difference was calculated, taking any weights specified in the datafiles into account.
The rest of this section is a bit technical. You can skip to the next section, if you want.
For this file, the actual difference between location number 5 and location GR7 is calculated by choosing a (near-)optimal one-by-one alignment of the strings of these locations. Optimal, meaning: the sum of the pair-wise differences is minimised. The actual difference is the average of these pair-wise differences. If two locations have different numbers of strings, all strings of one or both locations are duplicated once or more, until the locations have the same number of strings. In this case, the strings for location number 5 are duplicated three times, and those of location GR7 four times, resulting in twelve strings at each location.
The rationale for this procedure is as follows. Look at this example:
Q: | A | B | B | A | A | A | |
P: | |||||||
B | 1 | 0 | 0 | 1 | 1 | 1 | |
A | 0 | 1 | 1 | 0 | 0 | 0 | |
A | 0 | 1 | 1 | 0 | 0 | 0 |
P: Q: |
B | B |
A | A |
A | A |
B | B |
A | A |
A | A |
(variants repeated twice) (variants sorted to line up with first row) |
differences: | 0 | 0 | 0 | 0 | 0 | 0 |
Note that determining the optimal one-by-one alignment is a variant of the Travelling Salesman Problem, which is computationally expensive. A heuristic is used that gives you a reasonable approximation.
%utf8The remainder of this file will be interpreted as UTF-8 (data only, labels are not effected). See Unicode.
: 1 2 : 2 3Lines not starting with a colon are ignored.
The numbers are location index numbers, numbered from 1 to max. You can not use labels in this file. In the example, the differences between locations 1 and 2, and between locations 2 and 3 are calculated, but not between the locations with numbers 1 and 3.
The top of the output file contains comments that summarise the options used, and list some properties about the input data. Here is a short fragment:
# Items Locations File # Variants: frequency(repeats) # 345 237 clearing_up.fon # 268: 1(229) 2(24) 3(7) 4(2) 5(3) 6 7 11Explanation: the input file was clearing_up.fon. There were 345 items used, from 237 locations. There were 268 variant items in the input file. 229 variants occurred only once, 24 variants occurred twice, seven items occurs three times, etc., a single variant occurred six times, another single variant occurred seven times, etc.
By default, the difference between two locations is the (weighted) mean of all differences calculated between those two locations. If the option -m is used, the difference between two locations is the median of the calculated differences. Weights in the data files will be ignored. Note: this requires a lot of memory.
If a linkage partition is used, differences are written to file as a value followed by two index numbers, followed by -- if available -- two location labels. Example:
.487308 2 1 Calcutta Bombay .835438 3 2 "New Delhi" CalcuttaNote: the order used in the linkage partition file is lost.
If the option -d is used, then, for each pair of locations, the standard deviation of the calculated differences between those two locations will be returned in the output file specified by the option -d in the form of a difference matrix. Weights in the data files will be ignored.
double normalise (double f, int len1, int len2, int minlen, int maxlen) { double result = 0; switch (normalise_function) { case 1: result = f / (len1 + len2); break; case 2: result = f / ((len1 > len2) ? len1 : len2); break; case 3: result = sqrt (f / (len1 + len2)); break; case 4: result = f / minlen; break; case 5: result = f / maxlen; break; case 6: result = f; break; default: errit ("Normalisation function number %i undefined", normalise_function); } return result; }This function is called as soon as the Levenshtein algorithm has been applied to two strings. The arguments to normalise() are the value calculated by the Levenshtein algorithm, the lengths of both strings, and the minimum and maximum length of the least-cost alignment(s). (There may be several such alignments, of unequal length.)
The purpose of the function normalise() is to normalise the Levenshtein distance. This is especially needed in cases where not all pairs of strings are available for all pairs of locations, and the longer strings that are available would skew the overall results, if normalisation were omitted.
By default (case 1), the normalisation function normalises by dividing by the sum of the lengths of the two strings. This makes sense, because in cases where the two strings don't have any character in common, the Levenshtein distance is proportional to the sum of the lengths of the two strings.
You might want to experiment with alternate normalisation functions. You can select another predefined normalisation with the command line option -N (case 2 ... case 5). Check the source code to see what types of normalisation are available. You can easily add your own types of normalisation (case 6, etc), and select these with the -N option. If you do add to the definition of the function, make sure you save and compile the source under a different file name.
Using this option may require a lot of memory, depending on number and size of datafiles. When the program runs out of memory, it will try to continue without Cronbach Alpha.
It may be a good idea to limit the amount of memory available to the program (for instance with ulimit -v). Because, if the program starts using swap space on disk, it slows down to a crawl.
In versions older than version 1.80 of the leven program and version 0.26 of the giw program, there was a different formula used for calculating Cronbach Alpha. That formula was incorrect. If you use option -x in addition to option -c, you get both old (imprecise) and new (correct) value, so you can see how much 'damage' the old formula did.
A word-based distance of G.I.W. can be calculated with the giw program. That method is only suitable when there are relatively few word variants in a dataset.
Character-based G.I.W. (cGIW) is a modification of the plain Levenshtein distance. The costs of a substitution are listed below. (The cost of an indel is 0.5 in both cases, or 1 if the option -e was used.)
plain | cGIW | |
---|---|---|
equal | 0 | f / F |
different | 1 | 1 |
The numeric distance between strings s1 and s2, with Minkowski value R is:
( Σ |s1i - s2i|R ) 1/RThe option -R determines the Minkowski value. The default is 2 (Euclidean distance).
This option is mainly useful for comparison of lexical variants. With phonetic data, variants that occur only once are very common, and usually don't represent noise.
Note that this option is effected by the option -f. A datafile may have more than one variant, but only one if infrequent variants are removed with the option -f. If you use option -F as well, the file will be skipped completely.
Let's suppose this is all the input data you are working with, listed in a schematic way:
LOC1 LOC2 LOC3 Wa a1 a2 a3 Wb b1 b2 b3 Wc c1 c3
You have data from three locations, LOC1, LOC2, and LOC3, and for three different words, Wa, Wb, Wc. Item a1 is the variant of word Wa in LOC1, etc. You can have zero or one variant for each location and word, but not more than one. In this example, an item c2 is missing
The program processes this data to give you this:
LOC1__LOC2 LOC1__LOC3 LOC2_LOC3 Wa L(a1,a2) L(a1,a3) L(a2,a3) Wb L(b1,b2) L(b1,b3) L(b2,b3) Wc NA L(c1,c2) NAWhere L(a1,a2) = Levenshtein difference between strings a1 and a2, etc.
The input file should actually look like this:
: Wa [1]- a1 [2]- a2 [3]- a3 : Wb [1]- b1 [2]- b2 [3]- b3 : Wc [1]- c1 [3]- c3Empty lines are optional, as well as comment lines.
In this case, the labels are no location labels but word labels. They are not actually used, other than as a sign that a new row of data begins. Instead of labels you can also use numbers (without a leading colon). These labels or numbers are copied to the output file.
There are no place labels in the file, nor are they read from a separate label file.
The numbers in square brackets indicate the column numbers. As you can see above, you can skip numbers: there is no number 2 for word Wc.
Instead of starting the actual word variant with a minus, you can use a plus, followed by a list of numbers, just like in the normal datafiles.
You still need to use the option -n to set the number of locations. In the above example it should be 3, because there are three columns.
This is how the output file could look like:
: Wa 1 2 0.23 1 3 0.84 2 3 0.49 : Wb 1 2 0.04 1 3 0.73 2 3 0.58 : Wc 1 2 NA 1 3 0.37 2 3 NAThe labels indicate the start of a new row. The pair of numbers in front of every value indicates the columns numbers that were compared.
You can have as many columns (locations) as you want. Usually, each column is compared with every other column. If you have 16 input columns, you would get 120 output columns. You can limit the output (and the computation time) if you are interested in only certain column combinations. With the option -C you can select a column (by number) that should be tested against all other columns. You can use the option -C more than once to have more combinations tested. Unselected columns are not compared with other unselected columns (unless you select no columns at all).
For instance, if you select column 2, the output above would look like this instead:
: Wa 1 2 0.23 2 3 0.49 : Wb 1 2 0.04 2 3 0.58 : Wc 1 2 NA 2 3 NA
The combination 1 3 is missing.
Three more options modify the behaviour of the learning process:
2log ( p(x, y)Z / (p(x) × p(y)) )Default: 1