Query Form

Search pattern:
• display first N-grams with frequency ≥    
• variable elements are , constant elements are

Instructions: Frequency list

This Web interface allows you to run interactive queries on an indexed version of the Dutch Twitter corpus (jan 2011-dec 2013, approx 2.5 billion tweets and 22 billion tokens) collected by the University of Groningen. Using a frequency cut-off of 10, the database contains approx. 6.5M distinct unigrams 50M bigrams , 124M trigrams 136M 4-grams , and 118M 5-grams . Note that only n-gram frequencies are given. For viewing trends in word frequencies, locations, and the actual tweets, see Woordfrequenties op Twitter or twiqs.nl (hosted by the e-science center). The data collection has been described in a number of papers by Erik Tjong Kim Sang. N-grams have been indexed in an SQLite database with a total size of 35 gigabytes using software developed by Stefan Evert ( paper )

If you want to rank matches by their association strength instead, click the Associations tab at the top of this page.

For any further questions or bug reports, please contact Gosse Bouma.

Search pattern

The search pattern consists of up to 5 terms, which represent the elements of an N-gram and must be separated by blanks. Our database engine supports five different types of search terms:

Push the Search button to execute your query, Help to display this help page, or Reset Form to start over from scratch. The CSV button returns a CSV table suitable for import into a spreadsheet program or database. The XML button returns the search results in an XML format, allowing this interface to be used as a Web service.

Options

You can customise the display format of search results with the option menus below the search pattern:

Examples

The examples below include comments starting with //, which must not be entered in the search pattern field.

interessante *             // what are people most interested in?

* viool                  // '*' at the start of a query is much slower

sprak ? * [man,vrouw]       // use '?' to skip determiner etc.

 [houd, houdt, houden] van ? * // what do people enjoy? 
                              //notice the space at the start
                              // (use "collapsed" display)

%name ? * geweld       // use with "grouped" display

van * tot *               // a classic of Googleology

anti-establishment ?  // a trick to obtain unigram frequencies