Next: Conclusions Up: The Alpino Dependency Treebank Previous: Querying the treebank

Evaluation and training

The treebank is primarily constructed to evaluate the performance of the parser. We do this by comparing the dependency structures that the parser generates to the dependency structures that are stored in the treebank. For this purpose we do not use the representation of dependency structures as trees, but the alternative notation as sets of dependency paths that we already saw in the previous section. Comparing these sets we can count the number of relations that are identical in both the best parse that the system generated and the stored structure. From these counts precision, recall and F-score can be calculated. In practice, we primarily use the measure accuracy, which is defined as follows:

accuracy=1- $\frac{D_{f}}{max(D_{t},D_{s})}$

D_s is the set of dependency relations of the best parse that the system generated. D_t is the set of dependency relations of the parse that is stored in the treebank. D_f is the number of incorrect or missing relations in D_s.

The annotated corpus is also used to train the stochastic module of the Alpino grammar that is used to rank the various parses for an example sentence according to their probability. This ranking is done in two steps: first, we construct a model of what a ''best parse'' is. For this step, the annotated corpus is of crucial importance. Second, we evaluate parses of previously unseen sentences by this model and select as most probable parse the parse that best suits the constraints for ''best parse''.

The model for the probability of parses is based on the probabilities of features. These features should not be confused with the features in an HPSG feature structure. The features in this stochastic parsing model are chosen by the grammarian and in principle they can be any characteristic of the parse that can be counted. Features that we use at present are grammar rules, dependency relations and unknown word heuristics. We calculate the frequencies of the features in our corpus and assign weights to them proportional to their probability. This is done in the first step, the training step. In the second step, evaluation of a previously unseen parse, we count for each feature the number of times that it occurs in the parse and multiply that by its weight. The sum of all these counts is a measure for the probability of this parse. We will now describe in more detail the Maximum Entropy model that we use for stochastic parsing (Johnson et al. 1999), first focusing on the training step and then turning to parse evaluation.

The training step of the maximum entropy model consists of the assignment of weights to features. These weights are based on the probabilities of those features. To calculate these probabilities, we need a stochastic training set. We generate such a training set by first parsing each sentence in the corpus using the Alpino parser. The dependency structures of the parses that are generated by the parser (also the incorrect ones) are compared to the correct one in the corpus and evaluated following the above described evaluation method. The parses are then assigned a frequency proportional to the evaluation score.

Given the set of features (characteristics of parses) and the stochastic training set, we can calculate which features are likely to be included in a parse and which features are not. This tendency can be represented by assigning weights to the features. A large positive weight denotes a preference for the model to use a certain feature, whereas a negative weight denotes a dispreference. Various algorithms exist that guarantee to find the global optimal settings for these weights so that the probability distribution in the training set is best represented [8].

Once the weights for the features are set, we can use them in the second step: parse evaluation. In this step we calculate the probability of a parse for a new, previously unseen, sentence. In maximum entropy modeling, the probability of a parse x given sentence y is defined as

$p(y\vert x) = \frac{1}{Z(x)}$ exp $\left(\sum_{i}\lambda_{i}f_{i}(x,y)\right)$

The number of times feature i with weights $\lambda_{i}$ occurs in a parse is denoted by f_i. For each parse the normalization factor Z(x) is the same. Since we only want to calculate which parse is the most likely one and we do not need to know the precise probability of each parse, we only have to maximize

$\sum_{i}\lambda_{i}f_{i}(y)$

The accuracy of this model depends primarily on two factors: the set of features that is used and the size of the training set (see for instance Mullen 2002). Therefore it is important to expand the Alpino Dependency Treebank in order to improve the accuracy.

Next: Conclusions Up: The Alpino Dependency Treebank Previous: Querying the treebank

Noord G.J.M. van
2002-06-13