- Slide 1
- You might want to re-cap on the definitions of precision and recall
here. Check the material from IR and the WWW for definitions.
- Slide 2
- The concepts introduced here will be explained during the course of
the lecture.
- Slide 3
- Very popular terms have can have such a high weight that they dwarf
any weight or influence that the other terms in a query might have. In
particular, highly ambiguous terms occur more frequently by their very
nature and are considered more important! Making the other terms
mandatory has a filtering effect on the retrieved document list - how
this is done is not relevant but usually implemented in the same manner
as normal filtering (see below).
- Slide 4
- Wild-card matching may seem obvious but is rather tricky to
implement. See [VR79] for details.
- Slide 5
- Co-occurrence analysis is a standard IR technique for recognising and
extracting phrases from a document collection. It treats the entire
document collection as a corpus and analyses all pairs of co-occurring
words. Co-occurring can mean anything from being directly beside one
another in the text to appearing in the same 250-word text window.
This technique originally comes from statistical MT. [Dai94] compares and contrasts some common
formulae for phrase recognition. A 200-word window is the usual size chosen.
The use of NLP and parsing techniques for phrase extraction have been
extensively investigated but have not been found to be any more
effective than statistical analysis (and much slower) [SQ96].
Altavista will attempt to recognise phrases from the order of terms in
a query. Check the bottom of the results screen to see if it has done
this. It is important to note that it does not submit the individual
terms of a phrase to the search in this case. So if you want this
done, you must also add the terms to the query in an order they will
not be recognised in. If you explicitly attempt to search on a phrase
that is not recognised as one by Altavista, it treats the constituents
of that phrase as separate keywords in the usual fashion.
- Slide 6
- Ambiguous keywords also tend to be the most highly weighted as
multiple senses result in more documents containing them begin retrieved. Altavista
also promotes documents that receive a great many hits to the top of
the list (this is why searches on very rare keywords often have sites
that have nothing to do with the keywords in the results).
Ambiguity is a huge problem in Natural Language Processing (NLP). Word
Sense Disambiguation (WSD) usually requires some contextual clues
(absent in IR) or a rich resource that can be used to provide the
necessary contextual leverage. [MS99] reviews
different statistical WSD methods. [Kwo98] describes a
WSD method using Wordnet [Fel98]. Fung et al [FY98]
investigated using the retrieval context to disambiguate on-the-fly.
Filtering is a crude way of implementing some very simple WSD on
search results. It's also good if you want to focus on something less
important in the search results,
e.g. +Nixon +china -Watergate
Typically the retrieved is built first, then pruned using the
filtering criteria. Filtering is important for services like
personalised news wires, where a user's preferences are stored in a
profile and used to filter the stream of incoming news stories.
- Slide 7
- Relevance feedback is not often implemented in commercial
systems. This is because when it works, it works brilliantly, but when
it doesn't, the results really stink. In addition, the user needs to
select at least 5 relevant documents per iteration and most people
don't want to have to do this. Pseudo-feedback (assuming the top N
documents are relevant) tries to get around this but is not nearly as
effective.
There is a good introduction to relevance feedback in
[VR79]. The people at UMass [HC93] have
implemented and compared numerous term selection and weighting
mechanisms. You cannot do relevance feedback with Altavista but you
can crudely simulate it yourself when searching.
Look at the relevant documents
you retrieved and see what other terms they contain. Add some to your
query and issue it again.
- Slide 8
- With query expansion, once again we run into the knowledge engineering bottleneck. If you
have all the necessary linguistic resources to hand it works fine but
usually they are not available in a suitable form. It is
also not clear how existing resources can be adapted to perform
optimally for IR use. For example, dictionary companies are
notoriously reluctant to hand over electronic copies of their
products and usually apply severely restrictive contracts to what you
can do with them.
- Slide 9
- Boolean querying is very good if you are looking for something in
particular in an enormous document set. Note that the results are not
ranked at all, as a document either matches the query or does
not. Significant user training is often required to get the
most out of a Boolean querying system. However, once the user has
worked out how to use the system, results are good.
For idle web browsing, ranked keyword search is preferred.
- Slide 10
- This helps to deal with some aspects of the ambiguity problem.
- Slide 11
- This may seem amazing but extensive research has proved that this is indeed the case. The trend is now
towards using NL resources to augment standard, statistical IR methods [SQ96].
- Slide 12
- This is also a good way of locking out sites in the US which tend to
swamp any attempts to find European sites. Altavista also allows
domain restriction, so you can restrict searches to .uk, .nl
etc. Unfortunately there is no way to distinguish the .com and .net
hierarchy sites from one another.
SYSTRAN is a transfer-based MT
system that has been around for a very
long time. The EC uses it extensively.
Cross-Language Information Retrieval - this is a field in its own
right, as many of the observed characteristics of standard IR do not
hold in the multilingual environment. [Gre98] is a
good introduction to current experimental approaches to CLIR. If you
can't get the book, most of the authors have made their publications
available on-line.