RuG/L04 - Tutorial

The Levenshtein algorithm in its simplest form compares two sequences of characters and all that matters is whether or not two characters are the same. When you want to determine differences based on phonetic differences, then there is much room for improvement. You want to compare sequences of sounds. Though there is a relationship between a sequence of sounds and the sequence of characters used to record that sound sequence, so with a simple Levenshtein measurement you get a reasonable impression of dialect differences (see part 2), it is not the most accurate method imaginable.

There are two issues:

Often, a sound is not recorded with a simple character, but with a sequence of characters.
There is much variation in how much two sounds differ from each other.

Below are two words. On the left, transcribed in phonetic writing. On the right, transcribed with X-Sampa, a coding system useful for writing phonetic script with a standard keyboard, using the standard US-ASCII character set. (See: IPA/X-SAMPA chart by Andrew Mutchler)

Both words consist of a sequence of four sounds. But in electronic form, they are recorded as sequences of eight and seven characters, in this example, using one to three characters per sound. Not only accents are coded by adding extra characters, basic sounds too are often coded using more than one character, like the combination of p\ for the second sound in the second word.

What happens if you compare these two sequences with the Levenshtein algorithm is shown in the diagram below. (See also: Levenshtein demo)

8.1 Splitting sound sequences into tokens

The first step to come to a more accurate measurement is splitting the character sequence into tokens. Each group of characters that represents one phonetic base symbol or one accent is taken as one token. You can use the xstokens program to do this. After this tokenisation, Levenshtein runs like this:

Unfortunately in this case, there is no improvement. You have to go one step further.

Sounds have features. You could say that the more features are different between two sounds, the more different those two sounds are. Instead of comparing sounds as single units, you could split each sound into a sequence of features, and thus, you could compare features.

Vowel sounds differ from each other, among other things, by the location in the mouth of the highest point of the tongue: in front, in the middle, or in the back of the mouth. You can code for this feature with a sequence of two tokens:

tongue	coding
front	Ta1 Tb1
middle	Ta0 Tb1
back	Ta0 Tb0

This way, a front vowel differs one token from a middle vowel, and two tokens from a back vowel.

You can do a similar thing for the jaw position:

jaw	coding
closed	Ja1 Jb1 Jc1
half closed	Ja0 Jb1 Jc1
half opened	Ja0 Jb0 Jc1
opened	Ja0 Jb0 Jc0

And for the position of the lips:

lips	coding
rounded	L1
unrounded	L0

So, each single vowel sound is replaced by a long sequence of tokens. For example, the "i":

i:  Ta1 Tb1  Ja1 Jb1 Jc1  L0
     |   |    |   |   |   |
     |   |    |   |   |   +---> unrounded
     |   |    |   |   |
     |   |    +---+---+---> closed
     |   |
     +---+---> front

This differs only slightly (1 token) from the "y":

y:  Ta1 Tb1  Ja1 Jb1 Jc1  L1
     |   |    |   |   |   |
     |   |    |   |   |   +---> rounded
     |   |    |   |   |
     |   |    +---+---+---> closed
     |   |
     +---+---> front

But the difference with "o" is much bigger (4 tokens):

o:  Ta0 Tb0  Ja0 Jb1 Jc1  L1
     |   |    |   |   |   |
     |   |    |   |   |   +---> rounded
     |   |    |   |   |
     |   |    +---+---+---> half closed
     |   |
     +---+---> back

You can use the xtokens program to do all this recoding, and then use leven to measure the differences.

This method is reasonably simple, and more accurate than an ordinary Levenshtein measurement based on the original character sequences. But it isn't perfect:

The effective weight of a feature is determined by the number of tokens used to code it. So you have to strike the correct balance.
The coding of basic sounds combined with the coding of accents can be unfortunate. Suppose, you have a front vowel with an accent that indicates that it is a little less fronted than it would have been without the accent, and you have a middle vowel with an accent that puts it little towards the front. These two sounds are thus less different than two sounds written without these accents, but with our coding scheme, they will look more different, not less.
The coded sequences are much longer than the uncoded sequences, and since the number of computations for the Levenshtein algorithm increases with the product of the length the two sequences, the overall measurement can become very time-consuming.

8.2 Combining tokens into sounds

We take another step. We split the sequence of characters into basic sounds and accents, and then join each basic sound and its accents into a single token. A simple Levenshtein measurement based on such tokens run like this:

Two sequences of four sounds each, and none of the sounds in one word is identical to any sound in the other word, so the measured difference is maximal. That is not what we want. We don't want the substitution of two different sounds to give the maximal value, but a value based on the actual difference of both sounds. Like this:

First you get the comparison of two a-like sounds, then the deletion of a p, then the comparison of two f-like sounds, then the comparison of two e-like sounds, and finally the insertion of an l.

The procedure to get to such a measurement involves the following steps:

Making a detailed description of tokens and their associated features, and of how the features of accents combine with features of basic sounds.
Converting the files with dialect data, such that each token sequence (one basic sound plus features) that represent a unique sound is replaced by a single, unique token.
Determining the distances between tokens based on the differences of their (phonetic) feature values. Steps 2 and 3 are done with the features program, based on the definition you wrote in step 1.
Determining the Levenshtein differences with the leven program, of the recoded dialect data made in step 2, using the differences between tokens that were calculated in step 3.

Unfortunately, step 1 is quite complex. Whether this procedure justifies the effort depends partly on the condition of the data. You could strive for a feature/value definition that is as scientifically correct as possible, but if the data wasn't collected very carefully, if the data contains too much noise, then it doesn't pay to take your analysis efforts this far.

		a	p\	E	_-	l
	0	1	2	3	4	5
a	1	0	1	2	3	4
_-	2	1	2	3	2	3
p	3	2	3	4	3	4
f	4	3	4	5	4	5
@	5	4	5	6	5	6
_t	6	5	6	7	6	7


0.00	1.27	2.82	4.36	5.58
1.22	0.70	2.25	3.79	5.01
2.55	2.03	3.57	5.11	4.79
4.17	3.65	3.03	4.57	5.80
5.86	5.35	4.73	4.44	5.67

		a	p	\	E	_	-	l
	0	1	2	3	4	5	6	7
a	1	0	1	2	3	4	5	6
_	2	1	2	3	4	3	4	5
-	3	2	3	4	5	4	3	4
p	4	3	2	3	4	5	4	5
f	5	4	3	4	5	6	5	6
@	6	5	4	5	6	7	6	7
_	7	6	5	6	7	6	7	8
t	8	7	6	7	8	7	8	9

		a	p	\	E	_	-	l
	0	1	2	3	4	5	6	7
a	1	0	1	2	3	4	5	6
_	2	1	2	3	4	3	4	5
-	3	2	3	4	5	4	3	4
p	4	3	2	3	4	5	4	5
f	5	4	3	4	5	6	5	6
@	6	5	4	5	6	7	6	7
_	7	6	5	6	7	6	7	8
t	8	7	6	7	8	7	8	9

Tutorial

8. Comparing sound sequences

8.1 Splitting sound sequences into tokens

8.2 Combining tokens into sounds

		a	p	\	E	_	-	l
	0	1	2	3	4	5	6	7
a	1	0	1	2	3	4	5	6
_	2	1	2	3	4	3	4	5
-	3	2	3	4	5	4	3	4
p	4	3	2	3	4	5	4	5
f	5	4	3	4	5	6	5	6
@	6	5	4	5	6	7	6	7
_	7	6	5	6	7	6	7	8
t	8	7	6	7	8	7	8	9