What is lgrep?

Next: The construction of lgrep Up: Towards a linguistically-informed search Previous: Towards a linguistically-informed search Contents

What is lgrep?

To start with a simple example, such a tool could be useful to search in text corpora for a particular reading of a given word. For instance, the Dutch word bar is ambiguous. It can be a noun (in which case the word means the same as in English), or it can be a degree adverb, as in

$\ex. \a. \begin{flushleft}bar slecht\\ {\em quite bad} \end{flushleft}\b. \begin{flushleft}bar vervelend\\ {\em quite boring} \end{flushleft}\par$
In the latter case, bar is a negative polarity item. In order to collect example sentences of such negative polarity items (for instance in order to investigate the various contexts in which such negative polarity items can occur), a linguist now typically uses a tool to search for a given word. The resulting set of sentences will then need to be checked by the linguist in order to filter out all the unwanted sentences in which bar is used as a noun. Given that in this particular case the wrong examples are much more frequent than the useful examples, this is a time-consuming task. If the tool were to possess linguistic knowledge, as we propose here, it could withdraw the wrong examples itself.

As a much more complicated example, one could ask (in some appropriate format) for sentences in which a prepositional phrase argument has been extra-posed to the right of the verbal group. Note that in order to find appropriate examples the tool should not only be capable of recognising syntactic phrases such as root sentences and prepositional phrases, but the analysis should be deep enough to recognise the difference between prepositional phrases which function as adjuncts and as argument. The tool would then for example return a set of examples:

$\ex. \a. \begin{flushleft}Zou Allende de parlementaire weg verlaten, dan kan hi... ... to accept her authority in order to count on her sympathy. \end{flushleft}\par$
Another example usage of the tool could be to identify examples of verb raising constructions in which an adjunct takes narrow scope, i.e. it is an adjunct modifying one of the verbs embedded in the verb cluster (cf. [113]):

$\ex. \a. \begin{flushleft} Dit vertelde onlangs Mamie Eisenhouwer, de weduwe va... ...er real Wieringa, who doesn't want to be pushed in a corner? \end{flushleft}\par$
It should be clear however that this tool only has limited knowledge of syntactic constructions (otherwise creating the tool would presuppose knowledge that the use of the tool seeks to discover). We envisage that the tool provides an extension of regular expressions capable of recognising matching syntactic brackets, major syntactic categories, and grammatical functions such as subject, (in)direct object and modifier.

The novel feature of this application (in contrast with tools such as tgrep ) will be that it can search in text corpora which need not be syntactically annotated. This has the obvious advantages that much more corpus material is available (especially now that large amounts of text corpora are available through the Internet). A further possible advantage is that it might be easier to change linguistic analyses in a grammar, rather than in an annotated corpus. Of course, the challenge is to make this application fast enough for it to be of any practical use. Moreover, we believe that even if only a small fraction of the described functionality can be achieved, then this could be a useful tool for linguists working with large text corpora.

Next: The construction of lgrep Up: Towards a linguistically-informed search Previous: Towards a linguistically-informed search Contents

2000-07-10