Levenshtein algorithm

Levenshtein

Joseph B. Kruskal.
An overview of sequence comparison.
In David Sankoff and Joseph Kruskal, editors, Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, chapter one. CSLI Publications, 1999.

William A. Gale and Kennneth W. Church.
A program for aligning sentences in bilingual corpora.
Computational Linguistics, 19(1):75-102, 1993.

Levenshtein distance is obtained by finding the cheapest way to transform one string into another. Transformations are the one-step operations of (single-phone) insertion, deletion and substitution. In the simplest versions substitutions cost two units except when the source and target are identical, in which case the cost is zero. Insertions and deletions costs half that of substitutions. This demonstration illustrates a simple algorithm which basically looks at all of the different ways for operations to transform one string to another.

The algorithm starts with the upper left-hand corner of an two-dimensional array indexed in rows by the letters of the source word, and in columns by the letters of the target. It fills out the rest of the array while finding all the distances between each initial prefix of the source on the one hand and each initial prefix of the target. Each [i,j] cell represents the (minimal) distance between the first i letters of the source word and the first j letters of the target.

You can fill in the value of a cell only in case the values of all its neighbours upward and to the left have been filled in.

Levenshtein distance(adresse, address)

		a	d	d	r	e	s	s
	0	1	2	...
a	1
d	2
r	...
e
s
s
e

Top horizontal row is always 1,2,... -- cost of insertions

Left vertical column is always 1,2,... -- cost of deletions

begin at upper left (<= 0)
to fill in a cell:

diag above
left min (above + delete,
diag + replace,
left + insert)
each lower right corner of table contains LevD (for strings transformed from its indices)

Once cells up and to the left have been filled in, we can fill in the adjoining cell. We simply calculate all three values which might appear there -- the result of deleting the letter indexing that column (added to the value directly above), the result of inserting the letter indexing the row (added to the value to the left), and the result of replacing the column letter by the row letter (added to the value diagonally upward and to the left. [lower diagram above]

We can complete the entire array in this way, and once we have, we know the least costly set of operations that map one string to another. This involves some tracing back through the array, but the relevant cells are coloured contrastively in the demonstration. By following where the least costly values were found, we can see which operations are assumed to have resulted in the target string.

The demonstration likewise lets you play with several refinements of the basic algorithm. You can assign "Spoonerism"-like swaps a special value by using the feature "swap". Try the `horse'/`ros' pair under the "Examples" button. You can also vary the costs of transformations by making them sensitive to the phonetic similarity of the elements replacing one another. Then replacing `d' by `t' will for example be less costly than replacing `d' by `u'. These variant uses phonetic features to refine transformation costs.

The direct results of the application of the Levenshtein algorithm is the distance calculated between the strings and the set of transformations that contributed to the least costly set responsible for that distance. But the algorithm is also used directly to align sequences, and the demonstration shows how this works as well. The alignment identifies pairs of letters from the source and target strings which correspond in that the optimal length computation identified them as involved in `substitutions'. Many of the corresponding pairs are identical in source and target (take a look at the result of comparing `industry' to `interest').

There are lots of applications of Levenshtein distance. It is used in biology to find similar sequences of nucleic acids in DNA or amino acids in proteins. It is used in some spell checkers to guess at which word (from a dictionary) is meant when an unknown word is encountered. Wilbert Heeringa's dialectology project uses Levenshtein distance to estimate the proximity of dialect pronunciations. And some translation assistance projects have used the alignment capability of the algorithm in order to discover (the approximate location of) good translation equivalents. This application, using potentially large texts, requires optimisations to run effectively.

diag	above
left	min (above + delete, diag + replace, left + insert)