Tony Mullen, Rob Malouf and Gertjan van Noord
Alfa-Informatica, University of Groningen
Recent years have seen a considerable amount of research in the field of maximum entropy-based ``log linear'' modeling for disambiguation [1,7,9]. This is in large part due to the fact that such models are superior to others in modeling linguistic phenomena which contain internal dependencies, since the log linear modeling framework allows weights to be assigned to features without assuming independence of the features.
An important issue in any area of statistical NLP is the problem of insufficient amounts of training data. The maximum entropy framework of statistical modeling is theoretically sound, but like most other modeling methods its efficacy depends upon having a sufficient amount of informative training data. For parsing, this data is sometimes available in the form of hand-parsed corpora, or ``treebanks''. Such data is often difficult to come by. Several large treebanks exist for English, but the resources available for other languages are significantly less developed. For this reason, it is imperative to get the most out of what is available.
In the current experiments the goal is to maximally exploit the small but informative training data set we have by taking a more sophisticated view of the feature set itself. To this end we employ the method of feature merging, a means of constructing equivalence classes of statistical features based upon common elements within them. This technique allows the model to retain information which would otherwise be discarded in a simple frequency-based feature cutoff by producing new, generalized features which serve as a variant of backed-off features. We discuss work done in Alpino, a new language analysis system for Dutch .
Alpino is a wide-coverage computational analyzer of Dutch which aims at accurate full parsing of unrestricted text. The system is described in more detail in . The grammar produces dependency structures, thus providing a reasonably abstract and theory-neutral level of linguistic representation. The dependency relations encoded in the dependency structures are used to develop and evaluate disambiguation methods.
The Alpino grammar is an extension of the successful OVIS grammar [18,19], a lexicalized grammar in the tradition of Head-driven Phrase Structure Grammar . The grammar formalism is carefully designed to allow linguistically sophisticated analyses as well as efficient and robust processing.
In contrast to earlier work on HPSG, grammar rules in Alpino are relatively detailed. However, as pointed out in , by organizing rules in an inheritance hierarchy, the relevant linguistic generalizations can still be captured. The Alpino grammar currently contains over 250 rules, defined in terms of a few general rule structures and principles. The grammar covers the basic constructions of Dutch as well as a number of more idiosyncratic constructions (see  for details). The lexicon contains definitions for various nominal types (nouns with various complementation patterns, proper names, pronouns, temporal nouns, deverbalized nouns), various complementizer, determiner, and adverb types, adjectives, and about 150 verbal subcategorization types.
The lexicon contains about 100,000 entries. In addition, lexical analysis is extended by a number of heuristics to treat unknown words.
The construction of a dependency structure proceeds in two steps. In the first step a parse forest is constructed. The second step consists of the selection of the best parse from the parse forest.
Note that best-first parsing methods proposed for stochastic context-free grammars (e.g. ) cannot be used in this context. In attribute-value grammars, it might easily happen that the locally most promising sub-trees cannot be extended to global parses because of conflicting feature constraints.
In  some initial experiments with a variety of parse evaluation functions are described. A naive algorithm constructs all possible parse trees, assigns each one a score, and then selects the best one. Since it is too expensive to construct all parse trees, we have implemented an algorithm which computes parse trees from the parse forest as an approximate best-first search. This requires that the parse evaluation function is extended to partial parse trees. We implemented a variant of a best-first search algorithm in such a way that for each state in the search space, we maintain the b best candidates, where b is a small integer (the beam). If the beam is decreased, then we run a larger risk of missing the best parse (but the result will typically still be a relatively `good' parse); if the beam is increased, then the amount of computation increases too.
The Alpino grammar produces dependency structures compatible with the CGN-guidelines. Within the CGN-project , guidelines have been developed for syntactic annotation of spoken Dutch , using dependency structures similar to those used for the German Negra corpus . Dependency structures make explicit the dependency relations between constituents in a sentence. Each non-terminal node in a dependency structure consists of a head-daughter and a list of non-head daughters, whose dependency relation to the head is marked. Control relations are encoded by means of co-indexing. Note that a dependency structure does not necessarily reflect (surface) syntactic constituency.
We have started to annotate various smaller fragments with dependency structures. The largest fragment consists of a subset of the cdbl (newspaper) part of the Eindhoven corpus . This treebank is used in the experiments described below. It contains 1,396 sentences (16,925 words): all sentences of twenty words or less from the first 2500 sentences of Eindhoven-cdbl.
The maximum entropy (maxent) technique is an approach to statistical modeling using log linear distributions. In this framework, events are considered as multiplicities of weighted features. In training, the features' weights are derived based upon the distribution of the features in the training data, using an iterative algorithm such as improved iterative scaling (IIS) . Weights are chosen to maximize the entropy of the model while minimizing the divergence between the model and the training data. In employing the model, events of a given context are evaluated by summing the weights of their representative features and normalizing over the context to obtain a probability distribution, as in equation 1, where p(y|x) represents the probability of event y given context x, and represents the weight for feature fi. The value of each function fi reflects the number of times the feature is active for a given event.
and Z(x) is the normalization factor:
The most important aspect of the maxent modeling technique is that distributions of statistical features are modeled without requiring an assumption that the features be independent. This allows accurate modeling using feature sets in which the features' distributions are dependent upon each other. Exploiting this fact is an important consideration in constructing feature sets for maxent modeling.
For parse selection, we consider a context to be a sentence and the events within this context are the possible parses of the sentence. Each parse is characterized by a set of feature values, and may be compared on the basis of those features with other possible parses. Parsing is performed as described in section 2.2. Following , the best-first search proceeds on the basis of the unnormalized conditional probabilities derived from equation 1 for each possible subtree.
The model depends on the distribution of the features and their informativeness, thus it is important that the features used be germane to the task. In parsing, features should reflect the sort of information pertinent to making structural decisions.
In the present experiments, we employ several types of features corresponding to grammatical rules, valency frames, lexicalized dependency triples, and lexical features constituting surface forms, base forms, and lexical frames. Instances of each feature type were collected from the training data in advance to yield a feature set consisting of 82,371 distinct features.
Examples of these features may be seen below, where example 1 is a rule for creating a VP, 2 contains a valency frame for the noun mens, 3 describes a dependency triple between the noun mens and the adjective modern, and the direction of the modification, and finally example 4 contains lexical information about the word modern as it occurs in context.
The feature set used here exploits the maxent technique in that it relies on features which are overlapping and mutually dependent. The features represent varying degrees of linguistic generality and hence some occur much more frequently than others. Furthermore, the features may also represent information which is redundant in the sense that it is represented in multiple different features, in which case we say that the features ``overlap''. Features which share information in this way are necessarily dependent in their distributions.
The overlapping features allow for a variety of ``backing off'' in which features which share a structure but contain less specific information than others are used in the same model as features with more specific information.
It is desirable that the features be as informative as possible. The model should contain specific features to the extent that the features' distributions are accurately represented in the training data. There is a point, however, regardless of the size of the corpus, at which the specificity of features translates to sparseness in the data, causing noise and leading to deterioration in the model's performance.
A number of approaches have been taken to smoothing exponential models, including imposing a Gaussian prior over the distribution  and building the feature set up by a process of induction to ensure that only maximally representative features are admitted into the model . The most commonly employed and computationally inexpensive approach to reducing noise is to use a frequency-based feature cutoff , in which features which occur fewer times in the training data than some predetermined cutoff are eliminated. This has shown to be an effective way to improve results. Because of its simplicity and effectiveness, it is the approach we have focused on improving on in the present research. Although it is an effective way to reduce noise in a model, there is a risk with a cutoff that information encoded in the discarded features may be useful. A feature may be rare due to some rare element within it, but otherwise useful. To prevent discarding such useful features, we experiment with a method of feature merging similar to that introduced in . This approach considers features as being composed of informative elements. Before any feature cutoff is applied, features which are identical except for particular rare elements are generalized by merging; that is, the features are unioned and considered as a single feature. The elements upon which these merges are done are determined with a pre-set threshold, and merges are done on elements which occur fewer times than this. The merging process eliminates the distinction between two features, thus eliminating the information provided by the element which distinguishes them, while the rest of the information provided by the merged features remains intact in the model.
Individual unique features may be considered as sets of instantiations in the data. A feature which is the result of merging is thus the union of the features which were merged. The count of occurrences of the new feature is the sum of the counts of the merged features. If a cutoff is incorporated subsequently, the newly merged feature is more likely to survive in the model, as its frequency is greater than each of the features before merging. Thus information in features which otherwise might have been lost in a cutoff is retained in the form of a more general feature.
The first step is to determine how the features are composed and what the elements are which make them up. Factors which contribute most to sparseness, such as lexical items and certain grammatical attributes, are good candidates. In the present work, lexical items, both sentence forms and word stems, are considered as elements. Frequency counts are taken for all elements. A threshold is determined using a held-out test set. Using this threshold, a new model is created as follows: in the representation of the original model's features, all instances of elements which occur fewer times than the threshold are replaced by a dummy element. Features which were identical aside from these infrequent elements are thus rendered completely identical. For example, let
|feature 1||= A:B||with count 2|
|feature 2||= A:C||with count 4|
where A, B, and C are elements. We may count the occurrences of each element in the training data and find that the count of A is 20, of B is 5, and of C is 7. We determine by use of held-out test data that an optimal cutoff is, e.g., 10. Since both B and C have counts lower than this, all instances of B and C are replaced by a dummy element X. Thus features 1 and 2 above are both in effect replaced by feature 3, below, whose count is now the sum of those of the features which have been merged.
|feature 3||= A:X||with count 6|
Iterative scaling is performed on the new feature set to obtain the appropriate maximum entropy weights.
A quality of compositionality is necessary in features in order to perform the merging operation. That is, it is necessary that features be composed of discrete elements for which frequency counts can be attained from the data. The features described in section 4 may be viewed as being composed of words, base forms, POS tags, grammar attributes, and other discrete elements which occur together in a particular way. Merging proceeds by first establishing a merging threshold via experiments on held-out data. Frequencies of all elements are gathered from the training data. Finally, features containing elements whose counts are fewer than the threshold are merged. This is done by replacing all instances of sub-threshold elements with a dummy element in features. For example, if it were found that the element modern had a count less below the threshold, all features containing that would be altered. A feature such as
would be changed to
and likewise if the element aardig occurred with a count below the threshold, the same would be done with the feature
so that both features merged as the single feature
with a count equal to the sum of the two merged features.
It is well known that many models benefit from a frequency-based feature cutoff. Using feature merging, we seek to take a more sophisticated view of the features themselves, allowing the same degree of noise reduction as a feature cutoff, while simultaneously generalizing the features to obtain the benefits of backing off. By operating on sub-feature information sources, we hope to discard noisy information from the model with a greater degree of control, maintaining useful information contained by features which would otherwise be lost.
Experiments were performed using the features described above with a training set of 1,220 sentences, whose parses totaled 626,699 training events, initially with a held-out set of 131 sentences and subsequently on a test set of unseen sentences totaling 566 sentences. A merging threshold of 500 and a feature cutoff of 500 were determined by use of the held-out test set. The number of active (non-zero weighted) features in the original model was 75,500, the number of active features in the model with cutoff alone was 11,639, and the number of active features in the model which had been merged prior to the cutoff was 11,890.
For each sentence in the test set, the dependency structure of the
highest scoring parse was extracted, and compared to the gold standard
dependency structure in the treebank . For each sentence,
we calculate the overlap, the number of relations for which the
system agrees with the gold standard parse, and the number of errors:
A second, related, metric is employed as well, which allows the models
to be more broadly compared to others by incorporating not only the
per-sentence concept accuracy of the model but also the baseline
per-sentence concept accuracy (the per-sentence concept accuracy
obtained by arbitrary selection of a parse) and the best possible
per-sentence concept accuracy. This latter reflects the accuracy of
the grammar, and provides an upper bound for the task of picking the
best possible of all available parses. This measure, which we refer
to as the phi measure, is shown in equation 3:
Preliminary results on held-out data to establish thresholds, shown in table 1, were promising, suggesting that incorporation of merging with a merge threshold of 500 elements performed somewhat better than the best possible feature cutoff of 500 elements alone.
Unfortunately, these preliminary results were not supported by experiments on unseen data. As can be seen in table 2, the averaged results of four folds of ten-fold cross validation, representing a total of 566 sentences, show that the use of merging at the threshold determined by the held-out data does not appear to benefit the model.
Results so far are inconclusive. The effectiveness of the merging technique appears to depend greatly on qualities of the feature set itself. It is hoped that further experiments with different types of features will shed light on circumstances in which feature merging may be most effectively employed as a tool to optimize model performance.
This document was generated using the LaTeX2HTML translator Version 99.1 release (March 30, 1999)
Copyright © 1993, 1994, 1995, 1996,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 1 nlprs01.tex
The translation was initiated by Noord G.J.M. van on 2001-10-22