Next: Bibliography Up: Notes on Query Enhancement Previous: Notes on Query Enhancement

Notes and References

This document gives some supplementary information and references on the material presented in Query Enhancement and Search Engines. It is recommended that when studying this material, have a PC handy. Execute the example searches to see the results for yourself. The details on search in Altavista, which is used as an example, can be reviewed at the Altavista search site.

Most of the topics covered here are explored in more detail in [VR79,SM83]. Other references are provided where applicable.

Slide 1

You might want to re-cap on the definitions of precision and recall here. Check the material from IR and the WWW for definitions.

Slide 2

The concepts introduced here will be explained during the course of the lecture.

Slide 3

Very popular terms have can have such a high weight that they dwarf any weight or influence that the other terms in a query might have. In particular, highly ambiguous terms occur more frequently by their very nature and are considered more important! Making the other terms mandatory has a filtering effect on the retrieved document list - how this is done is not relevant but usually implemented in the same manner as normal filtering (see below).

Slide 4

Wild-card matching may seem obvious but is rather tricky to implement. See [VR79] for details.

Slide 5

Co-occurrence analysis is a standard IR technique for recognising and extracting phrases from a document collection. It treats the entire document collection as a corpus and analyses all pairs of co-occurring words. Co-occurring can mean anything from being directly beside one another in the text to appearing in the same 250-word text window.

This technique originally comes from statistical MT. [Dai94] compares and contrasts some common formulae for phrase recognition. A 200-word window is the usual size chosen.

The use of NLP and parsing techniques for phrase extraction have been extensively investigated but have not been found to be any more effective than statistical analysis (and much slower) [SQ96].

Altavista will attempt to recognise phrases from the order of terms in a query. Check the bottom of the results screen to see if it has done this. It is important to note that it does not submit the individual terms of a phrase to the search in this case. So if you want this done, you must also add the terms to the query in an order they will not be recognised in. If you explicitly attempt to search on a phrase that is not recognised as one by Altavista, it treats the constituents of that phrase as separate keywords in the usual fashion.

Slide 6

Ambiguous keywords also tend to be the most highly weighted as multiple senses result in more documents containing them begin retrieved. Altavista also promotes documents that receive a great many hits to the top of the list (this is why searches on very rare keywords often have sites that have nothing to do with the keywords in the results).

Ambiguity is a huge problem in Natural Language Processing (NLP). Word Sense Disambiguation (WSD) usually requires some contextual clues (absent in IR) or a rich resource that can be used to provide the necessary contextual leverage. [MS99] reviews different statistical WSD methods. [Kwo98] describes a WSD method using Wordnet [Fel98]. Fung et al [FY98] investigated using the retrieval context to disambiguate on-the-fly.

Filtering is a crude way of implementing some very simple WSD on search results. It's also good if you want to focus on something less important in the search results,

e.g. +Nixon +china -Watergate

Typically the retrieved is built first, then pruned using the filtering criteria. Filtering is important for services like personalised news wires, where a user's preferences are stored in a profile and used to filter the stream of incoming news stories.

Slide 7

Relevance feedback is not often implemented in commercial systems. This is because when it works, it works brilliantly, but when it doesn't, the results really stink. In addition, the user needs to select at least 5 relevant documents per iteration and most people don't want to have to do this. Pseudo-feedback (assuming the top N documents are relevant) tries to get around this but is not nearly as effective.

There is a good introduction to relevance feedback in [VR79]. The people at UMass [HC93] have implemented and compared numerous term selection and weighting mechanisms. You cannot do relevance feedback with Altavista but you can crudely simulate it yourself when searching.

Look at the relevant documents you retrieved and see what other terms they contain. Add some to your query and issue it again.

Slide 8

With query expansion, once again we run into the knowledge engineering bottleneck. If you have all the necessary linguistic resources to hand it works fine but usually they are not available in a suitable form. It is also not clear how existing resources can be adapted to perform optimally for IR use. For example, dictionary companies are notoriously reluctant to hand over electronic copies of their products and usually apply severely restrictive contracts to what you can do with them.

Slide 9

Boolean querying is very good if you are looking for something in particular in an enormous document set. Note that the results are not ranked at all, as a document either matches the query or does not. Significant user training is often required to get the most out of a Boolean querying system. However, once the user has worked out how to use the system, results are good.

For idle web browsing, ranked keyword search is preferred.

Slide 10

This helps to deal with some aspects of the ambiguity problem.

Slide 11

This may seem amazing but extensive research has proved that this is indeed the case. The trend is now towards using NL resources to augment standard, statistical IR methods [SQ96].

Slide 12

This is also a good way of locking out sites in the US which tend to swamp any attempts to find European sites. Altavista also allows domain restriction, so you can restrict searches to .uk, .nl etc. Unfortunately there is no way to distinguish the .com and .net hierarchy sites from one another.

SYSTRAN is a transfer-based MT system that has been around for a very long time. The EC uses it extensively.

Cross-Language Information Retrieval - this is a field in its own right, as many of the observed characteristics of standard IR do not hold in the multilingual environment. [Gre98] is a good introduction to current experimental approaches to CLIR. If you can't get the book, most of the authors have made their publications available on-line.

Next: Bibliography Up: Notes on Query Enhancement Previous: Notes on Query Enhancement

Nerbonne J.
1999-09-27