TextCat

TextCat is an implementation of the text categorization algorithm presented in Cavnar, W. B. and J. M. Trenkle, ``N-Gram-Based Text Categorization'' In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994. This paper was available at:
  1. http://msen.com/~wei/JT-homepage.html
  2. http://spd.erim.org/jt_papers/
  3. John Trenkle's homepage., as papers/sdr94ps.gz.
Now you can download it here.

I have applied the technique to implement a written language identification program. At the moment, the system knows about 69 natural languages (counting Esperanto as a natural language).

The textcat programme is not any langer actively maintained by me. However, the SpamAssassin spam filter programme includes a version of TextCat. They have been working on it some more, so perhaps you want to get their version from http://spamassassin.apache.org.

Local links

Installation

Edit the text_cat script to have the first line point to your Perl binary. Edit the text_cat script to have $opt_d point to the LM directory.

Usage

text_cat -h displays usage information.

Remotely related links

Interesting test cases