RuG/L04

File formats

Definitions of file formats shared between programs
(except some files only used for drawing maps)

General rules for all files

Empty lines, or lines with just white space (spaces and tabs) are ignored.

Leading and trailing white space is ignored.

Lines starting with a hash are ignored. Note: The hash symbol is used to start lines that are comments. You can not have comments following data on the same line. Comments must be on a line of their own. Examples:

    # This will be ignored

    no comment # This will NOT be ignored

Rules for space delimited data files

Space delimited data files are files where each line of data contains the same items, separated by spaces or tabs. These file types are:
Label file
Coordinate file
Indexed cluster group file
All these files have on each line one or more numeric items followed by a text item as the final item. These text items may contain spaces. The Levenshtein software knows how many items there are supposed to be on each line, and has no trouble with a final text item containing spaces. However, to enable import and export of these data files to and from other software packages, these final text items are put between double quotes when they contain spaces (or if they start with a quote).

Note: text items are never put within quotes in data files other than those three types listed above.

Examples:

    77.2  28.6  1  0  New Delhi
    77.2  28.6  1  0  "New Delhi"
Both lines are acceptable as input to the Levenshtein software, and they are interpreted identically. Only the second line will be given as output.

A quote within a quoted text item must be escaped with a backslash, as well as a backslash itself:

    1000  "a string with a \" quote and a \\ backslash"
However, don't use escapes in unquoted strings:
    2000  unquoted"string

Label file

See also: General rules for all files
See also: Rules for space delimited data files

This defines what labels are to be used in some other data set, and in what order. It is a set of index numbers followed by labels. Example:

    3  "New Delhi"
    1  Bombay
    2  Calcutta
Labels should be numbered from 1 to the maximum number of labels. No numbers may be skipped. No numbers may appear more than once. No labels may appear more than once.

The order of lines is not important. The numbering defines the order in which the labels are actually used.

Coordinate file

See also: General rules for all files
See also: Rules for space delimited data files

This assigns map coordinates to names of places or locations. Example:

    77.2  28.6  1  0  "New Delhi"
The first two numbers define the X- and Y-coordinate (in that order). These coordinates can be defined as longitude (west is negative, east is positive) and latitude (south is negative, north is positive), as in this example. Or they can be represented in some user defined linear grid, possibly with different scales for X and Y. What type of coordinates are used must be set as an option to the software.

The third and fourth number are used only by the programs that draw maps. The combination of 1 and 0 is a reasonable default. See map configuration file: markers for the meaning of these numbers.

Difference matrix file

See also: General rules for all files

A difference matrix file defines the differences between a set of items. It is produced by the following procedure:

    PRINT max
    NEWLINE
    FOR i = 1 TO max
        PRINT label[i]
        NEWLINE
    FOR i = 2 TO max
        FOR j = 1 TO i - 1
            PRINT diff[i,j]
            NEWLINE
max is the number of items.
label[i] is the label for item number i.
diff[i,j] is the difference between items with numbers i and j. diff[i,j] is equal to diff[j,i].
If diff[i,j] is unavailable, the text NA is printed.

Matrix transformation file

See also: General rules for all files

This defines how a new difference matrix file or coordinate file should be derived from an old one.

Example:

    : India
    - Bombay
    - Calcutta
    - New Delhi
    : Pakistan
    - Lahore
    - Karachi
    : Nepal
    - Nepal
This says we need a new set consisting of only three items, India, Pakistan, and Nepal. The first is derived from Bombay, Calcutta and New Dehli in the old set, the second is derived from Lahore and Karachi, the third is used unchanged. Not all labels in the old set need be used to derive a new set.

Vector file

See also: General rules for all files

A vector set is a collection of items, each with a label and a fixed number of numeric values. An example:

    3
    New Delhi
    .84
    .53
    .66
    Calcutta
    .33
    .87
    .82
This vector set has two items, New Delhi and Calcutta, each with three values. The number 3 on the first line defines how many values there are for each item.

Hierarchical cluster definition file

See also: General rules for all files

The hierarchical clustering above is defined as:
    1 .12
    L Norwegian
    L Swedish

    2 .15
    C 1
    L Danish

    3 .3
    L Dutch
    L German

    4 .35 Nordic group
    L Icelandic
    C 2

    5 .7
    C 4
    C 3
There are five sub-clusters. Each sub-cluster is defined in three lines. Sub-clusters may be arranged in any order.

The first line of each sub-cluster starts with a positive integer value used to identify this cluster. Numbers can be arbitrarily chosen, as long as they are unique. Following the number is a value indicating the numeric size of the cluster. Following the value is optionally a text describing the cluster (sub-cluster 4 in the example).

The second and third line define what are the two components of this sub-cluster. A line starting with a letter C indicates another sub-cluster (daughter cluster), the number following the letter C identifies that sub-cluster. A line starting with L defines a terminal node (leaf), and the text following the letter L is the label for that node.

Colour file

See also: General rules for all files
See also: Rules for space delimited data files

This file defines a set of colours. Colours are listed in order, one on each line, using three values for its red, green, and blue component. Values should be in the range 0 to 1, or 0 to 255. Example with red, gray and blue:

    1    0    0
    0.5  0.5  0.5
    0    0    1
The same, using the larger range:
    255    0    0
    128  128  128
      0    0  255