Exercise 4: Chart Parsing

Deadline: Friday June, 13

The goal of this exercise is to make a tool which allows users to search for linguistic patterns in text labelled with 'Part of SPeech' tags. (Gsearch is an example of such a tool). The corpus used as example is (part of) the London Oslo Bergen (LOB) corpus. Originally, the text looks like this:
    stop_VV0  electing_VBG  life_NN  peers_NNS    .  
  by_IO  Trevor_NP  Williams_NP  .  
  a_AT1  move_NN  to_TO  stop_VV0  \0Mr_NNSB1  Gaitskell_NP  from_IO  nominating_VBG  any_DD  more_DA  labour_NN  life_NN  peers_NNS  is_
VBZ  to_TO  be_VB0  made_VBN  at_IO  a_AT1  meeting_NN  of_IO  labour_NN  \0MPs_NNSB2  tomorrow_NN1  .  
  \0Mr_NNSB1  Michael_NP  Foot_NP  has_VHZ  put_VBN  down_RP  a_AT1  resolution_NN  on_IO  the_AT1  subject_NN  and_CC  he_PPHO1  is_VBZ 
 to_TO  be_VB0  backed_VBN  by_IO  \0Mr_NNSB1  Will_NP  Griffiths_NP  ,_,  \0MP_NNSB1  for_IO  Manchester_NP  Exchange_NP  .  
  though_CS  they_PPHS2  may_VM  gather_VV0  some_DD  left-wing_JB  support_NN  ,_,  a_AT1  large_JJ  majority_NN  of_IO  labour_NN  \0MP
s_NNSB2  are_VBR  likely_JJ  to_TO  turn_VV0  down_RP  the_AT1  Foot-Griffiths_NP  resolution_NN  .  
    abolish_VV0  Lords_NNSB2    .  
  \0Mr_NNSB1  Foot's_PPIO2  line_NN  will_VM  be_VB0  that_CS  as_CS  labour_VV0  \0MPs_NNSB2  opposed_VVD  the_AT1  government_NN  bill_
NN  which_DDQ  brought_VVD  life_NN  peers_NNS  into_IO  existence_NN  ,_,  they_PPHS2  should_VM  not_XX  now_RL  put_VV0  forward_RR  n
ominees_NNS  .  
Say, a user wants to find all NPs in this text. You could write a grammar for NPs based on POS tags (an NP is, for instance, an AT1 followed by an NN). A chart parser can now be used to find all constituents of category NP in the input. You can use the items in the chart to mark all NPs in the input with brackets, and present this to the user.

The corpus

The corpus has been converted to Prolog. Sentences are coded as sentence(Nr,ListofTags), where ListofTags is a list of Prolog terms of the form tag(Word,Tag).
sentence(4,[tag('though','CS'),tag('they','PPHS2'),tag('may','VM'),tag('gather','VV0'),....
Words and tags are quoted to make sure Prolog treats them as atoms. The corpus (1000 sentences) can be found in lob.pl

The grammar

The chart-parser assumes a grammar where rules are of the form rule(Mother,ListofDaughters), e.g.
rule(np,   [det,n]). 

As the input has been tagged already, it is not necessary to make a real dictionary. Rather, one has to define the relationship between POS tags and categories used in the grammar, e.g.  

lex(det,tag(_,'AT1')).

A determiner is a word whose POS tag is AT1.

The Parser

The parser is a shift-reduce parser which builds a chart using assert. Items in the chart have the form constituent(Begin,End,Category), e.g.
constituent(0, 2, np).
The parser is called as:
 :- bf_shift_reduce(Input)
where Input is a list of Tags.
You can get an overview of the parse result (i.e. the chart) by calling
:- listing(constituent). 
after parsing a sentence.

The parser can be found in parser.pl

  

The exercise

  1. Add rules and `lexical' entries for constituents of category NP and PP to the small fragment already provided in parser.pl
  2. Add a predicate  find(Cat,Nr) to the parser, which writes to the screen the begin and end posistions of all constituents in sentence Nr of category Cat.
  3. Add a predicate match(Cat,Nr) which writes sentence Nr with brackets around the parts of the string which match category Cat.
    1. N.b. if at position N there are 2 or more matching constituents, only the longest matching string needs to be shown ((longest match). 
  4. Add a predicate search(Cat) which inspects every sentence in the corpus and presents those sentences on the screen (with bracketing) which contain a matching Cat. Also write the sentence number.
    1. N.b. Sentences without a match should not be shown.
  5. Test your program by adding rules and lexical entries for PPs containing a singular NP without a determiner. ( from home, etc.) and search for these

Tips

  1. Use write(Cat), write(' '), etc, and nl, to write to the screen.
  2. Negation in Prolog is \+ (e.g.  \+ ( constituent(_,_,np)  ).

Hand in

Hand in the modified parser code, add comments if needed. Give examples of the output you get for exercise 5.

Deadline: Friday, June, 13