How high can Frog leap?
Iris Hendrickx, Ko van der Sloot, Maarten van Gompel and Antal van Den Bosch


Frog is a natural language processing pipeline for Dutch that enriches a text with information on word and sentence boundaries, part-of-speech-tags, lemmas, morphological analysis, syntactic information and named entities. Frog has been, and is used, in the automatic linguistic enrichment of many Dutch corpora, for example the Sonar corpus.

Most of the NLP modules in Frog use a k-nearest neighbour approach and are trained using Timbl, the Tilburg memory-based learning software package. Many modules were created already in the 1990s by ILK Research Group (Tilburg University, the Netherlands) and the CLiPS Research Centre (University of Antwerp, Belgium) but have been updated and retrained over the years. Frog is thus the result of many years of work and still has an active support and continues to be improved.

As Frog was continuously further developed, we aimed to perform a proper evaluation of the different modules within Frog to estimate its current performance.

We aimed to use independently developed and unseen test sets for each of the modules. However, it was not always possible to find a new unseen data set with manually validated annotations for every module. In such cases we ran 10-fold cross validation experiments on the Frog training material to get an indication of the performance of Frog.

We present the outcome of the evaluation of the Frog modules, report on the speed of Frog and discuss our findings on how to further improve Frog.