This page is obsolete. Please refer to the new page: http://valeriobasile.github.io/twita


new! the TWITA collection is now available for download.

About

TWITA is a collection of tweets identified as being written in the Italian langauge.

This collection of tweets has been harvested using a two-pass language identification, aiming for general Italian language. We used cURL to download from the Twitter Streaming API searching for a list of representative words:

vita Roma forza alla quanto amore Milano Italia fare grazie 
della anche periodo bene scuola dopo tutto ancora tutti fatto
The list consists of the most frequent lemma in the ItWaC corpus; all words that are frequent in other languages (English, Spanish and Portuguese) are filtered out (e.g. come). As a second step, the tweets are input to the language identification software langid.py to detect Italian language.

new! Statistics

Processing

Some frequency lists of hashtags found in TWITA are available for downloads in the downloads page.

Publications