Exercise 4: Chart Parsing
Deadline: Friday June, 13
The goal of this exercise is to make a tool which allows users to search for linguistic
patterns in text labelled with 'Part of SPeech' tags.
(Gsearch is an example of such a tool).
The corpus used as example is (part of) the London Oslo Bergen (LOB) corpus.
Originally, the text looks like this:
stop_VV0 electing_VBG life_NN peers_NNS .
by_IO Trevor_NP Williams_NP .
a_AT1 move_NN to_TO stop_VV0 \0Mr_NNSB1 Gaitskell_NP from_IO nominating_VBG any_DD more_DA labour_NN life_NN peers_NNS is_
VBZ to_TO be_VB0 made_VBN at_IO a_AT1 meeting_NN of_IO labour_NN \0MPs_NNSB2 tomorrow_NN1 .
\0Mr_NNSB1 Michael_NP Foot_NP has_VHZ put_VBN down_RP a_AT1 resolution_NN on_IO the_AT1 subject_NN and_CC he_PPHO1 is_VBZ
to_TO be_VB0 backed_VBN by_IO \0Mr_NNSB1 Will_NP Griffiths_NP ,_, \0MP_NNSB1 for_IO Manchester_NP Exchange_NP .
though_CS they_PPHS2 may_VM gather_VV0 some_DD left-wing_JB support_NN ,_, a_AT1 large_JJ majority_NN of_IO labour_NN \0MP
s_NNSB2 are_VBR likely_JJ to_TO turn_VV0 down_RP the_AT1 Foot-Griffiths_NP resolution_NN .
abolish_VV0 Lords_NNSB2 .
\0Mr_NNSB1 Foot's_PPIO2 line_NN will_VM be_VB0 that_CS as_CS labour_VV0 \0MPs_NNSB2 opposed_VVD the_AT1 government_NN bill_
NN which_DDQ brought_VVD life_NN peers_NNS into_IO existence_NN ,_, they_PPHS2 should_VM not_XX now_RL put_VV0 forward_RR n
ominees_NNS .
Say, a user wants to find all NPs in this text. You could write a grammar for
NPs based on POS tags (an NP is, for instance, an AT1 followed by an NN).
A chart parser can now be used to find all constituents of category NP in the input.
You can use the items in the chart to mark all NPs in the input with brackets,
and present this to the user.
The corpus
The corpus has been converted to Prolog. Sentences are coded as
sentence(Nr,ListofTags), where ListofTags is a list of Prolog terms of the form tag(Word,Tag).
sentence(4,[tag('though','CS'),tag('they','PPHS2'),tag('may','VM'),tag('gather','VV0'),....
Words and tags are quoted to make sure Prolog treats them as atoms.
The corpus (1000 sentences) can be found in lob.pl
The grammar
The chart-parser assumes a grammar where rules are of the form
rule(Mother,ListofDaughters), e.g.
rule(np, [det,n]).
As the input has been tagged already, it is not necessary to make a real
dictionary. Rather, one has to define the relationship between POS tags and
categories used in the grammar, e.g.
lex(det,tag(_,'AT1')).
A determiner is a word whose POS tag is AT1.
The Parser
The parser is a shift-reduce parser which builds a chart using assert.
Items in the chart have the form constituent(Begin,End,Category), e.g.
constituent(0, 2, np).
The parser is called as:
:- bf_shift_reduce(Input)
where Input is a list of Tags.
You can get an overview of the parse result (i.e. the chart) by calling
:- listing(constituent).
after parsing a sentence.
The parser can be found in parser.pl
The exercise
- Add rules and `lexical' entries for constituents of
category NP and PP to the small fragment already provided in parser.pl
- Add a predicate find(Cat,Nr) to the parser, which writes to the screen the begin and end posistions
of all constituents in sentence Nr of category Cat.
- Add a predicate match(Cat,Nr) which writes sentence Nr with brackets around
the parts of the string which match category Cat.
- N.b. if at position N there are 2 or more matching constituents, only the longest
matching string needs to be shown ((longest match).
- Add a predicate search(Cat) which inspects every sentence in the corpus
and presents those sentences on the screen (with bracketing) which contain a matching Cat.
Also write the sentence number.
- N.b. Sentences without a match should not be shown.
- Test your program by adding rules and lexical entries for PPs containing a singular NP without a determiner.
( from home, etc.) and search for these
Tips
- Use write(Cat), write(' '), etc, and nl, to write to the screen.
- Negation in Prolog is \+ (e.g. \+ ( constituent(_,_,np) ).
Hand in
Hand in the modified parser code, add comments if needed.
Give examples of the output you get for exercise 5.
Deadline: Friday, June, 13