TextCat
TextCat is an implementation of the text categorization algorithm
presented in Cavnar, W. B. and J. M. Trenkle, ``N-Gram-Based Text
Categorization'' In Proceedings of Third Annual Symposium on Document
Analysis and Information Retrieval, Las Vegas, NV, UNLV
Publications/Reprographics, pp. 161-175, 11-13 April
1994.
This paper was available at:
- http://msen.com/~wei/JT-homepage.html
- http://spd.erim.org/jt_papers/
- John
Trenkle's homepage., as papers/sdr94ps.gz.
Now you can download it
here.
I have applied the technique to implement a written language
identification program. At the moment, the system knows about 69
natural languages (counting Esperanto as a natural language).
The textcat programme is not any langer actively maintained by me. However, the
SpamAssassin spam filter programme includes a version of TextCat. They have
been working on it some more, so perhaps you want to get their version from http://spamassassin.apache.org.
Local links
Installation
Edit the text_cat script to have the first line point to your Perl binary.
Edit the text_cat script to have $opt_d point to the LM directory.
Usage
text_cat -h displays usage information.
Remotely related links
Interesting test cases
- Staat men perplex, wil men eerst wat thee, of direct
op visite in een bastion (erg vitaal detail: is er
concreet theatraal of eerder absurd, abstract
amusement)? (Hilverd Reker)
- Shy Pakistani chaps' wan kin always aim to put bananas away and
wink at (or
chat up) hepatitic llama mamas at Jesus' pita chip snack shack by a
quay in
China.
(Jon Azose)