The proposal Algorithms for Linguistic Processing focuses on two crucial problem areas in computational linguistics: problems of processing efficiency and ambiguity. For the problem of efficiency we propose to investigate grammar approximation techniques, whereas a number of grammar specialization techniques are proposed for the ambiguity problem.
The first problem, efficiency, is posed by the fact that the grammars which are typically hypothesised by linguists are unattractive from the point of view of computation. For a given grammatical formalism, computational efficiency is expressed by stating the (maximum) number of steps required to compute the analysis of a sentence of length n, where n is the number of words. In the simple case, the number of steps will increase proportionally with n. In such cases the complexity is said to be linear. For more complex grammatical formalisms, the number of steps will increase faster than n (e.g. it increases proportionally with n3 for so-called `context-free grammars'). For certain even more powerful grammatical formalisms it can be shown that no upper-bound to the number of steps required to find an analysis can be given. The human language user, however, seems to process in linear time; humans understand longer sentences with no noticeable delay. This implies that neither context-free grammars nor more powerful grammatical formalisms are likely models for human language processing. An important issue therefore is how the linearity of processing by humans can be accounted for. For this purpose we will explore grammar approximation techniques.
Grammar approximation techniques approximate a given (powerful) grammar by means of devices of a much simpler form: finite-state devices. Finite-state devices are a type of formalism that has been studied extensively in mathematics and computer science. They are known to allow very efficient processing (in linear time). It is also known that they are incapable of treating certain linguistic constructions in full generality. But interestingly, at least some of the constructions that cannot be treated with finite-state devices are also difficult for humans.
For example, constructions involving center-embedding are very hard to process for humans, but are regarded as grammatical by linguists. In English particle verb constructions, the particle can either precede or follow the direct object ( put the book down / put down the book). If the direct object contains a relative clause, and the particle follows the direct object, then the examples become very hard to understand. For finite-state devices it is equally impossible to process these sentences. This suggests that finite-state devices could offer language models adequately accounting for the efficiency of human language processing.
A further problem is posed by ambiguity. Many words are ambiguous: a dictionary lists various meanings for a single word. For instance, the Dutch word zagen can be the past tense form of the verb zien (to see) or the present tense form of the verb zagen (to saw). For this reason, a computer program analysing the sentence in (4-a) will need to decide which reading of zagen was intended, in order to be able to come up with the appropriate meaning. Ambiguities also arise in certain structural configurations. In (4-b), for instance, the prepositional phrase in Vietnam could be attached both to de oorlog or to luisteren. Both examples are taken from the Eindhoven corpus .
In many contexts the unintended readings are unlikely or even ridiculous, but it is very difficult to spell out how information concerning the discourse, context, situation and knowledge of the world enforce a particular reading. Therefore, a computer program will come up with such ridiculous readings nonetheless. The phenomenon occurs very frequently for even the most innocent examples. For longer sentences, hundreds or even thousands of readings arise. An important problem therefore is disambiguation: how can we quickly rule out the unintended readings?
We propose to tackle this disambiguation problem by investigating a variety of grammar specialisation techniques. These techniques optimise a given grammar on the basis of corpora of representative linguistic behaviour. Such corpora will contain implicit knowledge concerning situations and contexts which can be extracted by means of statistical techniques in order to be helpful in disambiguation. Such specialisation techniques should be able to conclude that certain readings of lexical entries, and certain combinations of readings, are unlikely in certain contexts.
In the example (4-a) above, each reading could be the right one. However, in the absence of any further information, it seems that our best `bet' is to guess that the second interpretation holds, since fathers are not sawed very often. In this perspective, the use of statistics is an aid to reasoning in circumstances where not enough information is available to infer which reading must have been the correct one.