Language Guesser

This program tries to guess the language of a text. It employs a very simple method that was developed in the early nineties, ref (1). The method only regards frequently used characters and short sequences of characters. Still, this method works pretty well for longer input texts, as you can try yourself. This demo uses the implementation of (2). It can recognize these 178 languages:

list of known languages

The algorithm always produces a winner, even for input languages it doesn't know about. How reliable is this method? The following parameters help to distinguish a reliable from an unreliable result.

There must be enough text to work with. When the text is shorter than MinDocSize, the program won't try to guess.

When the second to best guess has a score very close to the best score, it might actually be the right guess. What languages should be considered good guesses? This is set by ThresholdValue. If this value is set to 1.03, and the best score is 100, then all languages with a score up to 103 (1.03 times 100) are considered valid guesses. (Lower scores are better than higher scores.)

When there are too many candidates with a good score (set by ThresholdValue), then the outcome isn't very useful. That's why there should be no more than MaxCandidates, or the program will tell you "Can't guess language".

There is a lot more information about language guessers to be found on the web, such as on Wikipedia (3). The languages codes used in this demo link to the website (4). More information about the more than 6000 world languages can be found at (5) and (6).

paper by William B. Cavnar and John M. Trenkle that explains the algoritm: N-Gram-Based Text Categorization (PDF)
the implementation used for this demo: textcat
an article about Language identification on Wikipedia
ISO 639 Code Tables
The World Atlas of Language Structures
world map of languages at Phoible

made by P. Kleiweg

MinDocSize

ThresholdValue

MaxCandidates

All