RuG/L04

Manuals

perfiles

Description

permutate linguistic data

Synopsis

perfiles [-n int] [-r] inputfiles

Options

-n int
No labels. Fixed number of lines in all input files
-r
Raw input (no quotes, no escape sequences)

Purpose

Suppose you have a set of files with linguistic data. Each file contains all the linguistic items for a single location. The Levenshtein program requires its input to be grouped such that each input file contains one linguistic item in all its variations in all locations. This program can rearrange the data.

Schematicly, suppose you have two files, Place1 and Place2:

Place1

    word1 pronunciation11a pronunciation11b  # two different pronunciations of the same word
    word2 pronunciation21
    word3 pronunciation31

Place2

    word1 pronunciation12
    # no pronunciation for the second word
    word3 pronunciation32

Run this command...

    perfiles Place1 Place2
... and you will end up with three files, word1.per, word2.per, and word3.per:

word1.per

    : Place1
    - pronunciation11a
    - pronunciation11b
    : Place2
    - pronunciation12

word2.per

    : Place1
    - pronunciation21

word3.per

    : Place1
    - pronunciation31
    : Place2
    - pronunciation32

Details

As a precaution, the program won't overwrite existing files. This means you have to run the program once on all input files in one go. You have to give all file names as arguments, or use wild-cards that match all input files.

As you can see in the example above, input files can have, for each item, any number of pronunciations. You don't have to include all pronunciations for a single item on one line. You can split them over several lines, as long as you start each line with the correct label.

Labels and pronunciations are all separated by white space (spaces or tabs).

Anything starting with a hash (#) is treated as a comment, and the remainder of the line is ignored.

In both labels and further data, there are a few special token combinations:

inputoutput
\""
\##
\\\
\spacespace

You can put quotes around items (labels or further data), in which case you can put spaces inside items without having to escape them. These are all identical:

    abc\ def
    "abc def"
    "abc\ def"
Note that trailing spaces are always removed, so these are the same:
    "ghi"
    "ghi "

Use of options

-n number : No labels

If the input files don't have labels, you must use this option. The number says how many lines there are in each input file. Each input file should have exactly this number of lines, and each file should have the data in the same order. If you don't have data for an item, include a blank line. If you have multiple pronunciations for one word, put them all on one line. Names for output files are made from input line numbers.

-r : Raw input

If you use this option, all input data will be treated as is. Quote and backslash have no special meaning and are copied to the output files unchanged. You cannot have spaces in an item. Note: anything starting with a hash (#) is still treated as a comment, and the remainder of the line will be ignored.