RuG/L04

Manuals

features

2010/07/23
WARNING: There was a serious bug in this program prior to version 2.00

This bug was in the way pre-modifiers were handled. Instead of applying a pre-modifier to just the first following head, it was applied to all the remaining heads in the same word as well.

This means that if you have used pre-modifiers, your results were wrong. Sorry.

Description

feature difference calculation for Levenshtein measurement

Synopsis

features [-a] [-c] [-C int] [-d|-o|-x] [-g] [-l] [-T] [-e filename] [-f filename] [-t filename] configfile datafile(s)

Options

-a
Use ANSI escape sequences for error messages
-c
Continue on data error
-C int
Continue on data error, stop after this many errors
-d
Decimal codes in token list (option -t) and error messages
-e filename
Save to file: error log
-f filename
Save to file: list of all feature sets
-g
Don't map reals to integers
-l
Show labels in error messages
-o
Octal codes in token list (option -t) and error messages
-t filename
Save tokenized files
-T
Save to file: list of all token strings
-x
Hexadecimal codes in token list (option -t) and error messages

Purpose

This program is a preprocessor to the leven program. It translates dialect datafiles from one form into another.

Sequences of tokens in the data that represent one sound are combined into a set of feature values. Each unique set of feature values is replaced with a unique number. These values are written to the output files, which have the name of the input files with the extension .ftr appended.

In addition, the differences between all sets of feature values are calculated, and saved to the file features.table.out, which can be used by the leven program.

A typical usage is:

    features configfile data/*.txt
    leven -s features.table.out  (other options)  data/*.txt.ftr

The input datafiles should be in the same format as used by the leven program, accept that all data must be in the form of ascii strings preceded by a minus sign. Data in the form of sequences of numeral preceded by a plus sign are not allowed.

The new files will have data in numeral format.

As an example, part of an the input datafile could have these lines:

    : Aachen
    - "t7n@stI_-S
In the output file, those lines could be translated to something like:
    : Aachen
    + 17 116 19 3 27 17 77 14

See also

grammar: concise description of the configuration file format.

example: an example of a configuration file.

xstokens: a simpler but less accurate alternative to the features program.

The format of the configuration file

Symbols (keywords, numbers, etc.) are separated by spaces and/or tabs. Empty lines are ignored. The hash symbol (#) is the start of a comment that continues to the end of the line. However, it is possible to define an input string that starts with a hash. Spaces in input strings have to be coded in the configuration file as [[SP]]. The configuration file is case sensitive.

The configuration file has five parts, in a fixed order. All parts must be present, even if one is empty. The five parts start with a key word:

    DEFINES

    FEATURES

    TEMPLATES

    INDELS

    TOKENS

Configuration file, part 1: DEFINES

In this part, some variables can be set.

VERSION

Examples:

    VERSION 0       # you shouldn't use this

    VERSION 1

    VERSION 2
The program features version 1.00 fixes a methodological error of earlier versions. The program will run as before with old configuration files, or if you set VERSION 0. To use the fix, you need to set VERSION 1 or 2, and make some further changes in older configuration files in the part 2: FEATURES.

TOP

Examples:

    TOP 255

    TOP 65535       # this is the default
The Levenshtein program leven reads differences as integer values. These are in the range from 0 to 65535. When the table of differences is very large, it may be necessary to use the alternative compiled program leven-s, which used differences in the range from 0 to 255, and uses less memory.

The alternative compiled program leven-r uses differences as real values (and uses even more memory, and makes the program slower).

The program features maps the calculated differences between feature value sets onto the range from 0 to the value of TOP, unless you specify the option -g on the command line, in which case the result can be used with leven-r.

SUBSTMAX

Examples:

    SUBSTMAX 1.0    # this is the default

    SUBSTMAX 20
This value has two purposes:
  1. All distances between feature values sets are limited to the range from 0 to the value of SUBSTMAX.
  2. VERSION 0 only: If two feature value sets have no features in common (each feature has no value in at least one feature set), than the difference between those two sets is set to the value of SUBSTMAX.

INDEL

Examples:

    INDEL 0.5

    INDEL 10
This is the value of an indel, if it is not specified in another manner. The default is the value of SUBSTMAX divided by two.

METHOD

Examples:

    METHOD SUM            # equal to METHOD MINKOWSKI 1  (this is the default)

    METHOD SQUARE

    METHOD EUCLID         # equal to METHOD MINKOWSKI 2

    METHOD MINKOWSKI 1.4
This determines how, from the differences between individual features, the difference between two sets of features is calculated.

TOKENSTRING

Examples:

    TOKENSTRING RAW   # this is the default

    TOKENSTRING ESC

With TOKENSTRING ESC, tokens can be defined in the configuration file using escape sequences. See below.

START

Examples:

    START 0   # this is the default

    START 1

This defines the start condition of the mini-parser. See below.

RANGE

Example:

    RANGE 1 50 1
    RANGE 50 10000 2.1

You can use RANGE zero, one or more times.

This defines a final mapping from calculated value to output value. If a value falls in the range of the first two values (inclusive), then it is replaced by the third value.

If a calculated value falls inside one of these ranges, SUBSTMAX is ignored.

Configuration file, part 2: FEATURES

In this part, all features are defined.

There are three types of features, indicated by a letter B, N, or D:

B
bitmap (integer)
N
numeric (float)
D
discrete (integer)

If you have two sets of feature values, a and b, and a feature i, the difference between ai and bi is:

    B :    ( a[i] & b[i] ) ? 0.0 : 1.0
    N :    fabs( a[i] - b[i] )
    D :    ( a[i] == b[i] ) ? 0.0 : 1.0

(the following...

    A ? B : C
...is C-code shorthand for:
    if A does NOT return 0
      then do B
      else do C
)

In prose, for bitmaps: if ai and bi have at least one bit in common set to 1 in both bitmaps, then the difference is 0. It is 1 otherwise. The difference between numeric features is the absolute difference between the two values. The difference between two discrete features is 0 if they are equal, and 1 otherwise.

The above values are multiplied with the weight of the feature. So you get:

    B :    ( a[i] & b[i] ) ? 0.0 : w
    N :    fabs( a[i] - b[i] ) * w
    D :    ( a[i] == b[i] ) ? 0.0 : w
The weight of each feature is defined with the definition of the feature itself. The default weight is 1.

Here are some examples of feature definitions for VERSION 1 and 2 (see part DEFINES above):

    N 2 v_advancement   # numeric feature, with default difference 2.0, weight 1.0
    N 1 v_high          # three more numeric features, with default difference and weight 1.0
    N 1 v_long
    N 1 v_rounded

    D 1 .7 breathy      # a discrete feature, with default difference 1.0 and weight 0.7

    B 1 3 type          # a bitmap feature, with default difference 1.0 and weight 3.0

Differences between versions (set in part DEFINES above):

In VERSION 0, the first value in the lines above is missing. There can be at most one value between the first letter and the label. If there is a value, it will set the weight. In VERSION 0, if the difference between two feature sets needs to be calculated, and if the feature is undefined in one or both features sets, the difference is set to 0. That is probably not what you want, so you should VERSION 1 or 2 instead.

In VERSION 1, the first value is the feature's default difference, to be used if two feature sets are compared and one or both has this feature undefined. The default difference gets multiplies by the weight.

In VERSION 2, the default difference is used only if the feature is defined in one feature set, but not in the other. If the feature is undefined in both feature sets, the difference is set to 0.

There are three predefined features:

    N 1 WEIGHT
    N 1 INDEL
    B 1 STATE
These features have a special meaning, explained below. They are not used to calculate the differences between feature value sets in the normal way. However, they can be handled (assigned to and modified) like normal features.

Configuration file, part 3: TEMPLATES

You can use templates so you don't have to write out the complete list of feature assignments for each input token. These templates help a great deal in keeping things well organised.

Examples:

    T vowel                # start of template 'vowel'
    F v_long = 1           # assign value 1 to feature 'v_long'
    F v_rounded = -.5

    T v_close              # start of template 'v_close'
    F v_high = 1.5

In this part of the configuration file, the letter T is used for the definition of a template. Here, the letter T can be used to start a single template only.

In the parts of the configuration file that follow, the letter T is used to execute a template, and it can be used with multiple templates at once.

Configuration file, part 4: INDELS

If there are feature assignments in this part of the configuration file, the resulting set of feature values is used to compare to other sets of feature values when they are used as indels. In this case, the value assigned to the variable INDEL in the first part of the configuration file is ignored.

Example:

    T consonant c_glottal c_fricative   # like the consonant h
    T vowel v_mid v_central             # like the vowel schwa
    F v_rounded = 0                     # between a rounded and unrounded vowel

Configuration file, part 5: TOKENS

This part defines tokens, substrings from the input datafiles, and the effect they have on feature values. Each token consists of one or more letters. A string is parsed from left to right, searching for a substring that matches one defined here. If multiple substrings match, the longest is used.

Tokens come in three flavours, indicated with the letters H, M or P:

H
head
M
modifier
P
pre-modifier

End there is one special token:

EOT
for the empty token an the end of a string, only useful in combination with the mini parser (see below)

One sound, one segment that is to be translated into a single set of feature values, consists of one or more tokens:

  1. 0, 1 or multiple pre-modifiers
  2. head
  3. 0, 1 or multiple modifiers

Each token can change feature values. Usually, the head assigns initial values to features, while modifiers change those values.

If the input consists of two pre-modifiers (P1, P2), a head (H), and two modifiers (M1, M2), like this:

    P1 P2 H M1 M2
... then the feature value changes are processed in this order:
    H M1 M2 P2 P1

So, the actions for the head are processed first, then the modifiers, and lastly, the pre-modifiers in reverse order.

Each token is defined in the configuration file with a letter indicating the type, followed by calls to templates (T) or other feature value changes (F). Examples:

    H y
    T vowel v_close v_front v_rounded

    H @
    T vowel v_mid v_central
    F v_rounded = 0

    # the END OF TEXT token has no substring and sets no features:
    EOT

Examples of actions that can be performed on feature values:

    # Features of type bitmap, B
    F featB = 4     # assign the value 4 (integer)
    F featB - 3     # clear the bits from the value 3: new = old XOR (old AND 3)
    F featB + 3     # set the bits from the value:     new = old OR 3
    F featB ! 3     # flip the bits from the value 3:  new = old XOR 3
    F featB U       # make the bitmap undefined

    # Features of type numeral, N
    F featN = 4     # assign the value 4 (float)
    F featN - 3     # decrease with 3
    F featN + 3     # increase with 3
    F featN * 3     # multiply by 3
    F featN U       # make it undefined

    # features of type discrete, D
    F featD = 4     # assign the value 4 (integer)
    F featD U       # make it undefined

Note that, usually, you don't need to un-define a feature value. All feature values are undefined until a value is assigned. Also note that you can't modify a feature before you have assigned to it. (The features WEIGHT and STATE are the exceptions. WEIGHT is set to 1 as soon as a head is recognised. STATE is set to 0 at the start of each string, and changes are persistent until the end of the string.)

Escape sequences

If TOKENSTRING ESC is set in the DEFINES part of the configuration file, then you can use escape sequences to define token strings. This is useful if the data is not in a standard character set. Escape sequences are:

The last represents a single backslash.

With TOKENSTRING ESC, these are equivalent:

    H A\\+
    H \d065\d092\d043
    H \101\134\053
    H \x41\x5C\x2B

With TOKENSTRING RAW, the same token can only be defined as:

    H A\+

Ignoring tokens

When tokens are defined with a letter I appended to the first letter, then no actions on features are performed. Examples:

    HI x
    MI _y
    PI ^

In case of a token of type head: the complete sound is ignored. There will be no token in the output sequence. Pre-modifiers and modifiers with this head will also be ignored. If STATE was already changed by a pre-modifier, that change will remain in effect.

In case of a token of type modifier or pre-modifier: no feature changes are made, including STATE.

Indel

If a values is assigned to the pseudo-feature INDEL, that will be the value of an indel for the current sound, ignoring what was defined in the main parts DEFINES and INDELS of the configuration file.

Configuration file: Mini parser

It is possible to pose conditions on the processing of input. This is done by putting a number and an operator before a definition. Conditions can be used: Examples:
    H a

    :  7   H a
The first is the ordinary definition. the second is the conditional definition. The token is recognised only if the 'state', interpreted as a bitmap, has at least one bit in the number 7 set (7 = 1 + 2 + 4, or binary: 001 + 010 + 100). It matches if the "bitwise and" is not zero.

When two token definitions of equal length match the input, one token defined with a conditional, the other without, the definition with conditional is used if the condition also matches, and the other definition is used if the condition doesn't match.

If two tokens definition of equal length match, both with conditions, and both conditions match also, it is undetermined which of the two definition is used.

If a definition for a token has a condition on the token itself, and the condition doesn't match, the rule isn't used, and none of the actions on feature values are executed. But you can also use conditions on actions or the call to templates, so the token can match the input and only part of the actions executed. For example:

    ^: 3   F featA + 4
The value of feature FeatA is increased only if the 'state', interpreted as a bitmap, has no positive match with the number 3. It matches if "bitwise and" is zero.

     = 0   F featB = 1
The values of feature FeatB is set to 1 if the state is exactly 0.

    ^= 9   T template1 template2
Both template template1 and template2 are executed if the state does not match the value 9 exactly.

The state is an integer value. It can be changed by changing the value of the pre-defined 'pseudo-feature' STATE. For example:

    F STATE + 2    # set the non-zero bits from value 2 (new STATE = old STATE OR 2)

NOTE: Before the processing of each token, the state is saved. That state is used for all tests done for that token, both the token match itself, as well as the execution of templates or other changes of feature values. Changing the value of the feature STATE will have no effect until the next token is processed.

NOTE: Usually, actions for pre-modifiers are executed last, when the actions for head and modifiers have finished. However, changes to STATE have effect as soon as pre-modifiers are parsed. However (again), changes to other features are made under condition of the state at the time of parsing the corresponding token.

Schematicly, with two pre-modifiers (P1, P2), head (H), and two modifiers (M1, M2):

    STATE1 = current STATE
    parse P1 : - change STATE if requested
    STATE2 = current STATE
    parse P2 : - change STATE if requested
    STATE3 = current STATE
    parse H  : - change features on condition of state STATE3
               - change STATE if requested
    STATE4 = current STATE
    parse M1 : - change features on condition of state STATE4
               - change STATE if requested
    STATE5 = current STATE
    parse M2 : - change features on condition of state STATE5
               - change STATE if requested
      - change features on condition of state STATE2 for P2
      - change features on condition of state STATE1 for P1

At the start of each string, (sequence of tokens making one dialect item), STATE is set to the value of START as defined in the DEFINES section of the configuration file, or to 0 if START is not set.

For determining the feature set of an indel (part INDELS of the configuration file), STATE is set to 0.

Mini parser and EOT

Using EOT with a STATE condition enables you to check that the mini parser is in the right state at the end of the input string. Example:

    : 1 EOT

    ^: 4 EOT

It is OK if at the end of the input string, the mini parser matches state 1, or doesn't match state 4. Any other state causes an error.

If you don't use EOT, any state at the end of the strings is acceptable. This is identical to using EOT without a condition. This is unnecessary:

    EOT

All (pre-)modifiers in combination with EOT are ignored.

Mini parser, an example

Stress is usually marked at the start of a syllable. It would make sense to have a feature 'stress' on a vowel. But there may be one or more consonants between the stress marker and the first vowel of the syllable. So the stress must be remembered until it can be translated into a feature. This is how this can work:

    TEMPLATES

    T vowel
    F stress = 0         # no stress
    : 1 F stress = 1.0   # primary stress
    : 2 F stress = 0.5   # secondary stress
    F STATE - 3          # clear stress bits



    TOKENS

    P "           # primary stress
    F STATE + 1

    P %           # secondary stress
    F STATE + 2


    : 4 EOT       # end of string is accepted when state matches 4

Configuration file: Weights

There is a pseudo-feature WEIGHT, with default value 1. How is this used? An example:

Suppose you have three features, x, y, and z, with feature weights wx, wy, and wz, and you have two sets of feature values, A and B. In addition, both sets of feature values have a pseudo-feature WEIGHT. Suppose that in the part DEFINES of the configuration file, you have set METHOD SUM. The function d() determines the simple difference between two features, based on type of feature (bitmap, numeric, discrete). The difference F between sets A and B is now determined as follows:

    F = ( d(A[x], B[b]) * wx +
          d(A[y], B[y]) * wy +
          d(A[z], B[z]) * wz   )

    F = F * A[WEIGHT] * B[WEIGHT]

    if (F < 0)
        F = 0

    if (F > SUBSTMAX)
        F = SUBSTMAX

If you have used METHOD SQUARE, you get:

    F = ( (d(A[x], B[b]) * wx) ^ 2 +
          (d(A[y], B[y]) * wy) ^ 2 +
          (d(A[z], B[z]) * wz) ^ 2   )

    F = F * A[WEIGHT] * B[WEIGHT]

    if (F < 0)
        F = 0

    if (F > SUBSTMAX)
        F = SUBSTMAX

And with METHOD EUCLID, you get:

    F = ( (d(A[x], B[b]) * wx) ^ 2 +
          (d(A[y], B[y]) * wy) ^ 2 +
          (d(A[z], B[z]) * wz) ^ 2   )

    F = sqrt(F) * A[WEIGHT] * B[WEIGHT]

    if (F < 0)
        F = 0

    if (F > SUBSTMAX)
        F = SUBSTMAX

And with METHOD MINKOWSKI, with value rho, you get:

    F = ( (d(A[x], B[b]) * wx) ^ rho +
          (d(A[y], B[y]) * wy) ^ rho +
          (d(A[z], B[z]) * wz) ^ rho   )

    F = F^(1/rho) * A[WEIGHT] * B[WEIGHT]

    if (F < 0)
        F = 0

    if (F > SUBSTMAX)
        F = SUBSTMAX