Improving GrETEL's English
Liesbeth Augustinus and Vincent Vandeghinste
The GrETEL search engine (https://gretel.ccl.kuleuven.be) was originally developed to query Dutch treebanks. The extension to Poly-GrETEL (http://gretel.ccl.kuleuven.be/poly-gretel), a tool to query parallel treebanks, resulted into the inclusion of English data as well. The English EP Treebank consists of the English part of the Europarl 7 corpus (Koehn, 2005), parsed using the Stanford parser (Klein and Manning, 2003).
We have implemented a number of changes in order to improve and speed up the querying process of the English data.
In order to improve treebank search we enriched the English EP Treebank with universal POS tags, which are also used in the Universal Dependencies (UD) project.
The original version of the treebank only contains the PENN POS tags, which include both the general word class and more specific morphosyntactic information (e.g. 'NNS' for plural nouns). In order to look for general word classes, such as 'noun' or 'verb', one has to use an OR-statement or a regular expression in XPath, but this is cumbersome, especially for non-technical users.
The inclusion of the universal POS tags facilitates the construction of generic queries. Furthermore, the enhancement is an intermediate step towards making GrETEL compatible with UD treebanks.
In order to fasten the search process, we have applied the GrInding indexing mechanism as described in Vandeghinste & Augustinus (2014) to the English data. The speed gain allows us to include more data in GrETEL. For English it currently contains 2,268,204 trees (compared to 177,824 trees in the English part of Poly-GrETEL).