The construction of lgrep

Next: Bibliography Up: Towards a linguistically-informed search Previous: What is lgrep? Contents

The construction of lgrep

The search tool we propose will be somewhat reminiscent of the UNIX tool grep. grep can be used to search in text files for lines matching a given regular expression. In lgrep there are two important differences:

lgrep searches for sentences, rather than lines
lgrep provides for a much richer regular expression language
lgrep provides for an extendible regular expression language
lgrep has facilities to integrate it with Internet search engines

The tool will thus tokenize a given text file into a series of sentences, rather than lines. The technology for this tokenization task exists, even if it is not perfect (for instance [40]). For this reason we will enable an architecture in which a useful default tokenization scheme can be augmented with user specified alternatives.

The regular expression language that lgrep should support will be much more powerful than the regular expression languages typically found in tools such as grep and Perl. For instance, an interface with a part-of-speech tagger is foreseen which will provide for the part-of-speech labels as nullary regular expression operators. For instance, the operator noun will denote all words with part-of-speech noun.

On top of this, the regular expression language might define more complex linguistic categories as regular expression operators such as np. As a very first approximation, np could be defined as [det^,adj*, noun+] (in the FSA Utilities notation; this expression denotes an optional determiner followed by any number (including zero) of adjectives, followed by one or more nouns). Ultimately, we hope to be able to extract the definitions of such operators by means of finite-state approximation techniques from a general grammar of Dutch.

The regular expression operators provided by the regular expression language should be considered default implementations of these. In order that the tool be easily adaptable to different linguistic insights and/or different needs it is of extreme importance that the regular expression operators can be redefined, and that new operators can be defined (typically in terms of existing regular expression operators).

Finally, lgrep will be integrated with Internet search engines in order to be able to regard the Internet as a large text corpus. We have already implemented a small tool which is capable to search for sentences containing some given word on arbitrary web sites. This works fully automatically as follows. Firstly a search query is sent to a search engine. The resulting pages are automatically scanned for relevant links. These links are visited and the corresponding pages are tokenised into sentences. The sentences are scanned for occurrences of the search query. A similar idea is described in [116]. We plan to integrate lgrep in a similar fashion in order to treat the Internet as a large text corpus. As a more ambitious task, we propose to investigate the possibilities of a search engine which is capable to search for linguistic patterns directly.

Next: Bibliography Up: Towards a linguistically-informed search Previous: What is lgrep? Contents

2000-07-10