Prolog has been used for many purposes, but its inventor, Alain Colmerauer, was interested in computational linguistics, and this remains a classic application for the language. Moreover, Prolog offers a number of tools which make life easier for computational linguists, and we are now going to start learning about one of the most useful of these: definite clause grammars, or DCGs as they are usually called.
DCGs are a special notation for defining grammars. So, before we go any further, we’d better learn what a grammar is. We shall do so by discussing context free grammars (or CFGs). The basic idea of context free grammars is simple to understand, but don’t be fooled into thinking that CFGs are toys. They’re not. While CFGs aren’t powerful enough to cope with the syntactic structure of all natural languages (that is, the kind of languages that human beings use), they can certainly handle most aspects of the syntax of many natural languages (for example, English and French) in a reasonably natural way.
So what is a context free grammar? In essence, a finite collection of rules which tell us that certain sentences are grammatical (that is, syntactically correct) and what their grammatical structure actually is. Here’s a simple context free grammar for a small fragment of English:
|s -> np vp|
|np -> det n|
|vp -> v np|
|vp -> v|
|det -> a|
|det -> the|
|n -> woman|
|n -> man|
|v -> shoots|
What are the ingredients of this little grammar? Well, first note that it contains three types of symbol. There’s -> , which is used to define the rules. Then there are the symbols written like this: s , np , vp , det , n , v . These symbols are called non-terminal symbols; we’ll soon learn why. Each of these symbols has a traditional meaning in linguistics: s is short for sentence, np is short for noun phrase, vp is short for verb phrase, and det is short for determiner. That is, each of these symbols is shorthand for a grammatical category. Finally there are the symbols in italics: a, the, woman, man , and shoots . These are terminal symbols, though a computer scientist might call them the alphabet, and linguists might call them lexical items. We’ll usually just call them words.
This grammar contains nine context free rules. A context free rule consists of a single non-terminal symbol, followed by -> , followed by a finite sequence made up of terminal and/or non-terminal symbols. All nine items listed above have this form, so they are all legitimate context free rules. What do these rules mean? They tell us how different grammatical categories can be built up. Read -> as can consist of , or can be built out of . For example, the first rule tells us that a sentence can consist of a noun phrase followed by a verb phrase. The third rule tells us that a verb phrase can consist of a verb followed by a noun phrase, while the fourth rule tells us that there is another way to build a verb phrase: simply use a verb. The last five rules tell us that a and the are determiners, that man and woman are nouns, and that shoots is a verb.
Now consider the string of words a woman shoots a man . Is this grammatical according to our little grammar? And if it is, what structure does it have? The following tree answers both questions:
s np det a n woman vp v shoots np det a n man
Right at the top we have a node marked s . This node has two daughters, one marked np , and one marked vp . Note that this part of the diagram agrees with the first rule of the grammar, which says that an s can be built out of an np and a vp . (A linguist would say that this part of the tree is licensed by the first rule.) In fact, as you can see, every part of the tree is licensed by one of our rules. For example, the two nodes marked np are licensed by the rule that says that an np can consist of a det followed by an n . And, right at the bottom of the diagram, all the words in a woman shoots a man are licensed by a rule. Incidentally, note that the terminal symbols only decorate the nodes right at the bottom of the tree (the terminal nodes) while non-terminal symbols only decorate nodes that are higher up in the tree (the non-terminal nodes).
Such a tree is called a parse tree. Parse trees are important because they give us two kinds of information. Firstly, they give us information about strings. Secondly, they give us information about structure. This is an important distinction to grasp, so let’s have a closer look, and learn some important terminology while we are doing so.
First, if we are given a string of words, and a grammar, and it turns out that we can build a parse tree like the one above (that is, a tree that has s at the top node, and every node in the tree is licensed by the grammar, and the string of words we were given is listed in the correct order along the terminal nodes) then we say that the string is grammatical (according to the given grammar). For example, the string a woman shoots a man is grammatical according to our little grammar (and indeed, any reasonable grammar of English would classify it as grammatical). On the other hand, if there isn’t any such tree, the string is ungrammatical (according to the given grammar). For example, the string woman a woman man a shoots is ungrammatical according to our little grammar (and any reasonable grammar of English would classify it as ungrammatical). The language generated by a grammar consists of all the strings that the grammar classifies as grammatical. For example, a woman shoots a man also belongs to the language generated by our little grammar, and so does a man shoots the woman . A context free recogniser is a program which correctly tells us whether or not a string belongs to the language generated by a context free grammar. To put it another way, a recogniser is a program that correctly classifies strings as grammatical or ungrammatical (relative to some grammar).
But often, in both linguistics and computer science, we are not merely interested in whether a string is grammatical or not, we also want to know why it is grammatical. More precisely, we often want to know what its structure is, and this is exactly the information a parse tree gives us. For example, the above parse tree shows us how the words in a woman shoots a man fit together, piece by piece, to form the sentence. This kind of information would be important if we were using this sentence in some application and needed to say what it actually meant (that is, if we wanted to do semantics). A context free parser is a program which correctly decides whether a string belongs to the language generated by a context free grammar and also tells us what its structure is . That is, whereas a recogniser merely says “Yes, grammatical” or “No, ungrammatical” to each string, a parser actually builds the associated parse tree and gives it to us.
It remains to explain one final concept, namely what a context free language is. (Don’t get confused: we’ve told you what a context free grammar is, but not what a context free language is.) Quite simply, a context free language is a language that can be generated by a context free grammar. Some languages are context free, and some are not. For example, it seems plausible that English is a context free language. That is, it is probably possible to write a context free grammar that generates all (and only) the sentences that native speakers find acceptable. On the other hand, some dialects of Swiss-German are not context free. It can be proved mathematically that no context free grammar can generate all (and only) the sentences that native speakers of Swiss-German find acceptable. 1 So if you wanted to write a grammar for such dialects, you would have to employ additional grammatical mechanisms, not merely context free rules.
That’s the theory, but how do we work with context free grammars in Prolog? To make things concrete: suppose we are given a context free grammar. How can we write a recogniser for it? And how can we write a parser for it? In this chapter we’ll look at the first question in detail. We’ll first show how (rather naive) recognisers can be written in Prolog, and then show how more sophisticated recognisers can be written with the help of difference lists. This discussion will lead us to definite clause grammars, Prolog’s built-in grammar tool. In the following chapter we’ll look at definite clause grammars in more detail, and learn (among other things) how to use them to define parsers.
So: given a context free grammar, how do we define a recogniser in Prolog? In fact, Prolog offers a very direct answer to this question: we can simply write down Prolog clauses that correspond, in an obvious way, to the grammar rules. That is, we can simply turn the grammar into Prolog.
Here’s a simple (though as we shall learn, inefficient) way of doing this. We shall use lists to represent strings. For example, we shall use the list [a,woman,shoots,a,man] to represent the string a woman shoots a man . Now, we have already said that the -> symbol used in context free grammars means can consist of , or can be built out of , and this idea is easily modelled using lists. For example, the rule s -> np vp can be thought of as saying: a list of words is an s list if it is the result of concatenating an np list with a vp list. As we know how to concatenate lists in Prolog (we can use append/3 ), it should be easy to turn these kinds of rules into Prolog. And what about the rules that tell us about individual words? Even easier: we can simply view n -> woman as saying that the list [woman] is an n list.
If we turn these ideas into Prolog, this is what we get:
np(Z):- det(X), n(Y), append(X,Y,Z).
vp(Z):- v(X), np(Y), append(X,Y,Z).
The correspondence between the CFG rules and the Prolog code should be clear. And to use this program as a recogniser, we simply pose the obvious queries. For example:
In fact, because this is a simple declarative Prolog program, we can do more than this: we can also generate all the sentences this grammar produces. Our little grammar generates 20 sentences. Here are the first five:
X = [the,woman,shoots,the,woman] ;
X = [the,woman,shoots,the,man] ;
X = [the,woman,shoots,a,woman] ;
X = [the,woman,shoots,a,man] ;
X = [the,woman,shoots]
Moreover, we’re not restricted to posing questions about sentences: we can ask about other grammatical categories. For example:
And we can generate noun phrases with the following query.
Now this is rather nice. We have a simple, easy to understand program which corresponds with our CFG in an obvious way. Moreover, if we added more rules to our CFG, it would be easy to alter the program to cope with the new rules.
But there is a problem: the program doesn’t use the input sentence to guide the search. Make a trace for the query s([a,man,shoots]) and you will see that the program chooses noun phrases and verb phrases and only afterwards checks whether these can be combined to form the sentence [a,man,shoots] . For example, Prolog will find that [the,woman] is a noun phrase and [shoots,the,woman] a verb phrase and only then will it check whether concatenating these lists happens to yield [a,man,shoots] , which of course it won’t. So, Prolog starts to backtrack, and the next thing it will try is whether concatenating the noun phrase [the,woman] and the verb phrase [shoots,the,man] happens to yield [a,man,shoots] , another non-starter. It will go on like this until it (finally) produces the noun phrase [a,man] and the verb phrase [shoots] . The problem is that the goals np(X) and vp(Y) are called with uninstantiated variables as arguments.
So, how about changing the rules in such a way that append becomes the first goal:
np(Z):- append(X,Y,Z), det(X), n(Y).
vp(Z):- append(X,Y,Z), v(X), np(Y).
Here we first use append/3 to split up the input list. This instantiates the variables X and Y , so that the other goals are all called with instantiated arguments. However, this program is still not very appealing: it uses append/3 a lot and, even worse, it uses append/3 with uninstantiated variables in the first two arguments. We saw in the previous chapter that this is a source of inefficiency. And indeed, the performance of this recogniser is very bad. It is revealing to trace through what actually happens when this program analyses a sentence such as a woman shoots a man . As you will see, relatively few of the steps are devoted to the real task of recognising the sentences: most are devoted to using append/3 to decompose lists. This isn’t much of a problem for our little grammar, but it certainly would be if we were working with a more realistic grammar capable of generating a large number of sentences. We need to do something about this.
A more efficient implementation can be obtained by making use of difference lists . This is a sophisticated (and, once you’ve grasped it, beautiful) Prolog technique that can be used for a variety of purposes.
The key idea underlying difference lists is to represent the information about grammatical categories not as a single list, but as the difference between two lists. For example, instead of representing a woman shoots a man as [a,woman,shoots,a,man] we can represent it as the pair of lists
Think of the first list as what needs to be consumed (or if you prefer: the input list ), and the second list as what we should leave behind (or: the output list ). Viewed from this (rather procedural) perspective the difference list
represents the sentence a woman shoots a man because it says: If I consume all the symbols on the left, and leave behind the symbols on the right, then I have the sentence I am interested in. That is, the sentence we are interested in is the difference between the contents of these two lists.
That’s all we need to know about difference lists to rewrite our recogniser. If we simply bear in mind the idea of consuming something, and leaving something behind in mind, we obtain the following recogniser:
np(X,Z):- det(X,Y), n(Y,Z).
vp(X,Z):- v(X,Y), np(Y,Z).
Consider these rules carefully. For example, the s rule says: I know that the pair of lists X and Z represents a sentence if (1) I can consume X and leave behind a Y , and the pair X and Y represents a noun phrase, and (2) I can then go on to consume Y leaving Z behind , and the pair Y Z represents a verb phrase . The np rule and the second of the vp rules work similarly.
Moreover, the same idea underlies the way this grammar handles the words. For example
means we are handling man as the difference between [man|W] and W . After all, the difference between what is consumed and what is left behind is precisely the word man .
Now, at first this code may be harder to grasp than our previous recogniser. But note that we have gained something important: we haven’t used append/3 . In the difference list based recogniser, it simply isn’t needed, and this makes a big difference.
How do we use this recogniser? Well, here’s how to recognise sentences:
This asks whether we can get an s by consuming the symbols in [a,woman,shoots,a,man] , leaving nothing behind. Similarly, to generate all the sentences in the grammar, we ask
This asks: what values can you give to X , such that we get an s by consuming the symbols in X , leaving nothing behind?
The queries for other grammatical categories also work the same way. For example, to find out if a woman is a noun phrase we ask:
And we generate all the noun phrases in the grammar as follows:
You should trace what happens when this program analyses a sentence such as a woman shoots a man . As you will see, it is a lot more efficient than our append/3 based program. Moreover, as no use is made of append/3 , the trace is a lot easier to grasp. So we have taken a big step forward.
On the other hand, it has to be admitted that the second recogniser is not as easy to understand, at least at first, and it’s a pain having to keep track of all those difference list variables. If only it were possible to have a recogniser as simple as the first and as efficient as the second. And in fact, it is possible: this is where DCGs come in.