The Google Web 1T 5-Gram Database for European languages is a collection of frequent 5-grams of Web text for 10 European languages collected by Google Research. This Web interface allows you to run interactive queries on an indexed version of the Dutch portion of the database (built from approximately 133 billion words of Dutch web text) and displays the most frequent N-grams matching a specified search pattern. If you want to rank matches by their association strength instead, click the Associations tab at the top of this page. For the Web interface, case-folding and some additional normalization of the N-grams have been performed, so the frequency counts may occasionally be different from those found in the original Google data. The normalized N-grams have been indexed in several SQLite databases with a total size of 30 gigabytes using software developed by Stefan Evert ( paper , demo for English ). For any further questions or bug reports, please contact Gosse Bouma.
The search pattern consists of up to 5 terms, which represent the elements of an N-gram and must be separated by blanks. Unigram queries are currently not allowed, i.e. you have to specify at least 2 terms. Our database engine supports five different types of search terms:
[huis,huisje]→ huis, huisje)
%to stand for an arbitrary substring (e.g.
*matches an arbitrary word (usually the item of interest)
?indicates a skipped token, which will be ignored in the result set
Push the Search button to execute your query, Help to display this help page, or Reset Form to start over from scratch. The CSV button returns a CSV table suitable for import into a spreadsheet program or database. The XML button returns the search results in an XML format, allowing this interface to be used as a Web service.
You can customise the display format of search results with the option menus below the search pattern:
The examples below include comments starting with
//, which must not be entered in the search pattern field.
interessante * // what are people most interested in? * viool // '*' at the start of a query is much slower sprak ? * [man,vrouw] // use '?' to skip determiner etc. [houd, houdt, houden] van ? * // what do people enjoy? //notice the space at the start // (use "collapsed" display) %name ? * geweld // use with "grouped" display van * tot * // a classic of Googleology anti-establishment ? // a trick to obtain unigram frequencies