RuG/L04 - Manuals

Unicode

how to use Unicode in the programs leven, features, and xstokens

General

Most programs treat data as 8-bit strings, without assumption of encoding. There are two exceptions:

Labels will be decoded as ISO-8859-1, when used to put strings in PostScript images.
The leven program optionally decodes dialect data (not the labels in the datafiles) as UTF-8.

Using Unicode in program leven

You can put Unicode strings in datafiles encoded as lists of numbers, preceded by a plus sign, but this is not easily human readable.

You can also put strings in datafiles as strings encoded in UTF-8. You need to tell leven about this by including this line in each datafile effected:

    %utf8

This line effects the rest of the datafile, so be sure to put it at the top. Labels in datafiles will always be interpreted as raw 8-bit strings.

Using Unicode in program features

The features program processes all files as 8-bit data, but this applies to datafiles as well as feature definition file. So you can use any character encoding you like, single byte or multi byte. But you cannot use the UTF-7 encoding, for obvious reasons.

Using Unicode in program xstokens

For xstokens the same applies as for the features program.