TextCat

TextCat is an implementation of the text categorization algorithm presented in Cavnar, W. B. and J. M. Trenkle, ``N-Gram-Based Text Categorization'' In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994. This paper was available at:

Now you can download it here.

I have applied the technique to implement a written language identification program. At the moment, the system knows about 69 natural languages (counting Esperanto as a natural language).

The textcat programme is not any langer actively maintained by me. However, the SpamAssassin spam filter programme includes a version of TextCat. They have been working on it some more, so perhaps you want to get their version from http://spamassassin.apache.org.

Local links

I have removed the demo
The sources, cf. COPYING and Copyright
The list of languages supported
List of competitors

Installation

Edit the text_cat script to have the first line point to your Perl binary. Edit the text_cat script to have $opt_d point to the LM directory.

Usage

text_cat -h displays usage information.

Remotely related links

Survey on the State of the Art in Human Language Technology contains a chapter on language identification (both for spoken and written language).
LIFI: Language Identification From Images. Quote: The Language Identification From Images project (LIFI) is concerned with the automated identification of the script (alphabet) used in a document image. Our initial phase, from 1994 to 1995, focused on machine-printed documents. The second phase, from 1997 to 1998, focused on handwritten documents. We will soon begin a new project concerning how to use our script identification techniques to segment multi-script document images.
Bibliography on Automatic Spoken Language Identification Bibliography. This bibliography lists research in Automatic Identification of Spoken Language. There are also some links for language identification of written language.
The World's Main Languages. Lots of information on languages and the internet.
Corpus building for minority languages

Interesting test cases

Staat men perplex, wil men eerst wat thee, of direct op visite in een bastion (erg vitaal detail: is er concreet theatraal of eerder absurd, abstract amusement)? (Hilverd Reker)
Shy Pakistani chaps' wan kin always aim to put bananas away and wink at (or chat up) hepatitic llama mamas at Jesus' pita chip snack shack by a quay in China. (Jon Azose)