Lexical Analysis.

Next: Creating Parse Forests. Up: Robust Parsing Previous: Robust Parsing

Lexical Analysis.

The lexicon associates a word or a sequence of words with one or more tags. Such tags contain information such as part-of-speech, inflection as well as a subcategorization frame. For verbs, the lexicon typically hypothesizes many different tags, differing mainly in the subcategorization frame. For sentence (1), the lexicon produces 83 tags. Some of those tags are obviously wrong. For example, one of the tags for the word hebben is verb(hebben,pl,part_sbar_transitive(door)). The tag indicates a finite plural verb which requires a separable prefix door, and which subcategorizes for an SBAR complement. Since door does not occur anywhere in sentence (1), this tag will not be useful for this sentence. A filter containing a number of hand-written rules has been implemented which checks that such simple conditions hold. For sentence (1), the filter removes 56 tags. After the filter has applied, feature structures are associated with each of these tags. Often, a single tag is mapped to multiple feature structures. The remaining 27 filtered tags give rise to 89 feature structures.

An important aspect of lexical analysis is the treatment of unknown words. The system applies a number of heuristics for unknown words. Currently, these heuristics attempt to deal with numbers and number-like expressions, capitalized words, words with missing diacritics, words with `too many' diacritics, compounds, and proper names.

If such heuristics still fail to provide an analysis, then the system guesses a tag by inspecting the suffix of the word. A list of suffixes is maintained which predict the tag of a given word. If this still does not provide an analysis, then it is assumed that the word is a noun.

In addition to the treatment of unknown words, the robustness of the system is enhanced by the possibility to skip tokens of the input. Currently this possibility is employed only for certain punctuation marks. Even though punctuation is treated both in the lexicon and the grammar, the syntax of punctuation is irregular enough to warrant the possibility to ignore punctuation. For instance, quotation marks may appear almost anywhere in the input. The corpus contains:

$\exg. De z.g. '' speelstraat , die hier en daar al bestaat ?\\ The so-called '' play-street , that here and there already exists ?\\ \par$

Apparently, the author intended to place speelstraat within quotes, but the second quote is not present. During lexical analysis, tags are optionally extended to include neighbouring words which are classified as `skipable'.

Next: Creating Parse Forests. Up: Robust Parsing Previous: Robust Parsing

Noord G.J.M. van
2001-05-15