Algorithms for Linguistic Processing is title of a PIONIER project recently awarded by NWO to Gertjan van Noord (Alfa-informatica, BCN, University of Groningen). It is a project in the area of computational linguistics. The 5 year project is planned for about 2.5 million Dutch guilders, and will employ a number of postdocs and PhD students. Interested parties are advised to contact Gertjan van Noord (firstname.lastname@example.org). The project is expected to start at the end of 1999. The homepage of the project is located at www.let.rug.nl/~vannoord/alp/. The homepage provides a link to the project proposal.
The proposal Algorithms for Linguistic Processing focuses on two crucial problem areas in computational linguistics: problems of processing efficiency and ambiguity. For the problem of efficiency grammar approximation techniques will be investigated, whereas a number of grammar specialization techniques are proposed for the ambiguity problem.
The first problem, efficiency, is posed by the fact that the grammars which are typically hypothesised by linguists are unattractive from the point of view of computation. For a given grammatical formalism, computational efficiency is expressed by stating the (maximum) number of steps required to compute the analysis of a sentence of length n, where n is the number of words. In the simple case, the number of steps will increase proportionally with n. In such cases the complexity is said to be linear. For more complex grammatical formalisms, the number of steps will increase faster than n (e.g. it increases proportionally with n^3 for so-called `context-free grammars'). For certain even more powerful grammatical formalisms it can be shown that no upper-bound to the number of steps required to find an analysis can be given. The human language user, however, seems to process in linear time; humans understand longer sentences with no noticeable delay. This implies that neither context-free grammars nor more powerful grammatical formalisms are likely models for human language processing. An important issue therefore is how the linearity of processing by humans can be accounted for. For this purpose `grammar approximation' techniques will be explored.
Grammar approximation techniques approximate a given (powerful) grammar by means of devices of a much simpler form: finite-state devices. Finite-state devices are a type of formalism that has been studied extensively in mathematics and computer science. They are known to allow very efficient processing (in linear time). It is also known that they are incapable of treating certain linguistic constructions in full generality. But interestingly, at least some of the constructions that cannot be treated with finite-state devices are also difficult for humans.
A further problem is posed by ambiguity. Many words are ambiguous: a dictionary lists various meanings for a single word. For instance, the Dutch word `zagen' can be the past tense form of the verb `zien' (to see) or the present tense form of the verb `zagen' (to saw). For this reason, a computer program analysing the sentence (1) will need to decide which reading of `zagen' was intended, in order to be able to come up with the appropriate meaning. Ambiguities also arise in certain structural configurations. In (2), for instance, the prepositional phrase `in Vietnam' could be attached both to `de oorlog' or to `luisteren'. Both examples are taken from the Eindhoven corpus.
In many contexts the unintended readings are unlikely or even ridiculous, but it is very difficult to spell out how information concerning the discourse, context, situation and knowledge of the world enforce a particular reading. Therefore, a computer program will come up with such ridiculous readings nonetheless. The phenomenon occurs very frequently for even the most innocent examples. For longer sentences, hundreds or even thousands of readings arise. An important problem therefore is disambiguation: how to rule out quickly these unintended readings? To tackle this disambiguation problem a variety of grammar specialisation techniques will be implemented and compared. These techniques optimise a given grammar on the basis of corpora of representative linguistic behaviour. Such corpora will contain implicit knowledge concerning situations and contexts which can be extracted by means of statistical techniques in order to be helpful in disambiguation. Such specialisation techniques should be able to conclude that certain readings of lexical entries, and certain combinations of readings, are unlikely in certain contexts. In the example (1) above, each reading could be the right one. However, in the absence of any further information, it seems that our best `bet' is to guess that the second interpretation holds, since fathers are not sawed very often. In this perspective, the use of statistics is an aid to reasoning in circumstances where not enough information is available to infer which reading must have been the correct one.
- Mijn vader zagen we niet meer
My father saw we not anymore
We don't saw our father anymore /
We didn't see our father anymore
- Zij luisteren naar de bezwaren tegen de oorlog in Vietnam
They listen to the arguments against the war in Vietnam
They listen to the arguments against the Vietnamese war /
They listen in Vietnam to the arguments against the war
The project aims at general answers to the above-mentioned questions of efficiency and disambiguation. In order to achieve such answers, concrete proposals must be developed and compared. In order to be able to apply and evaluate such concrete proposals, a a linguistically motivated computational grammar for Dutch, comparable in coverage to systems available for English will be developed. At the moment, there is no such grammar available for Dutch, although a non-trivial grammar fragment has been developed within the NWO Priority Programme Language and Speech Technology (TST). Even if certain aspects of the grammar have been tuned to the domain of application in TST, it is fair to say that the basic architecture of the grammar is fully general. Moreover, the grammar was successfully applied in a formal evaluation on the TST task: 95% concept accuracy on (previously unseen) user utterances (full details available from xxx.lanl.gov/abs/cs.CL/9906014).
The TST grammar is taken as a starting point for a more general grammar.
Some of the innovative techniques will be applied in a linguistic research tool for searching bare text-corpora (called `lgrep'). This application is capable of searching text corpora (including arbitrary Dutch texts on the Internet) on the basis of linguistic criteria. It extends existing search tools with the possibility to specify search patterns including linguistic criteria such as part-of-speech labels (such as noun, verb, preposition, etc.), major syntactic category (noun phrase, verb phrase, subordinate sentence, etc.), and grammatical relation (subject, direct-object, specifier, etc.).
A successful implementation of Algorithms for Linguistic Processing will not only provide new insights concerning the way in which natural language is processed, but it will also provide new techniques which are crucial for human language technology, in particular for Dutch.