Search Engines LIX019B05

This course focuses on methods and techniques used in information systems for unstructured text, such as search engines. Text preparation and indexing, boolean and vector models for search systems, user feedback, interfaces, the Pagerank algorithm, and evaluation of search systems will be discussed. To support theory, we experiment with the various techniques with implementations in Python.

Details

Plan

  1. Hoofdstuk 1: introduction; boolean retrieval; posting lists; intersection in python; python code
  2. Hoofdstuk 2: decoding, tokenization and normalization; sub-linear posting list intersection; phrase queries
  3. Hoofdstuk 3: dictionaries; wildcard queries; spell correction; (stdin/stdout/stderr: w3.pdf)
  4. Hoofdstuk 6: scoring and term weighting; term and document frequency weighting; vector space model
  5. Hoofdstuk 8: evaluation
  6. Hoofdstuk 21: link analysis: page rank
  7. Hoofdstuk 9: relevance feedback and query expansion

Exercises

Weekly exercises are to be handed in on Nestor. You must have submitted your solutions to weekly exercises the evening before the next lecture - unless stated otherwise.

For programming exercises, you must submit everything I need in order to run your program. You may assume that I have a Linux workstation, just like LWP, and that I use the same version of Python3 as the current version on LWP. Other versions of Python are not accepted.

For textual exercises, the only choice is between ordinary plain text files, or pdf. Other formats such as .doc, .docx, .rtf are not accepted.

The files you submit as solutions in Nestor should use very simple file names: short, no spaces, no funny characters. There is no need for student-number or name in the file name. Simply use names such as 1a.py, 1b.py, 3a.py.

If you need to submit several files, you must collect these files together in a .zip file. That way, the original file names are preserved.

You may only submit your own work.

Of course, it is ok to help each other. But note that the assignments are individual assignments. Not group assignments. Identical solutions will obtain the lowest possible grade.

If you submit solutions to exercises more than once, then I will assume the latest submission is your final submission. I will ignore the other submissions, and I will grade only the final submission.

In some cases, exercises build on top of exercises from previous weeks. It is assumed that any feedback on previous exercises is taken seriously, so that previous mistakes are corrected. If not (i.e., if I see the same error again), impact on grade is doubled.

Week 1

Submit the book exercises as a PDF document. As for the Python exercises, only submit the final exercise, with the required pickle files. Do *not* submit the twitter data.
  1. Exercises from the book:
    1. 1.1
    2. 1.2
    3. 1.3
    4. 1.4
    5. 1.7
    6. 1.13
  2. Exercises Python:
    1. The following UNIX pipe provides tweets in a tabular format, where each line contains a single tweet. Each line consists of four fields, seperated by TAB. The fields represent: identifier of tweet, user of tweet, text of the tweet, and tokenized text of the tweet.
      zcat /net/corpora/twitter2/Tweets/2016/08/20160828:10.out.gz | 
            /net/corpora/twitter2/tools/tweet2tab id user text words
      
      Write a Python program which builds a database on the basis of this input. You must make a dictionary in which the identifier of the tweet is used as the key. The value is a tuple (user,text,words). You can save the dictionary by means of the "pickle" module. This module can be used to save a Python data-structure on disc, to be used later. For instance, if your dictionary is called "docs", then you can save it in the file "docs.pickle" as follows:
      import pickle
      
      with open('docs.pickle','wb') as f:
          pickle.dump(docs,f)
      
    2. Write a Python program which reads the dictionary back into memory:
      import pickle
      
      docs = pickle.load(open('docs.pickle','rb'))
      
      Your program then reads lines from standard input. Each line is supposed to contain an identifier of a tweet. You print the words of the corresponding tweet to standard output. Please make sure that you know what standard input and standard output is.
    3. Write a Python program - a variant of the first program - which in addition builds a "posting list". This implies that you must build a dictionary where the key is a word, and the value is the list/set/.. of tweet identifiers in which that word occurs. Before you start programming, please consult the next exercises, to come up with an appropriate representation of the list/set/.. of tweet identifiers. The posting list dictionary must be saved to disk as another pickle file.
    4. Write a Python program which takes that pickle files from exercise 1 and 3. The program reads lines from standard input. Each line is supposed to contain a word. The program prints all tweets which contain that word.
    5. Extend the Python program as follows. In case standard input contains two words (rather than one), your program should print all tweets in which *both* words occur. Of course, you are expected to take into account the material of the first lecture. Please think of a good algorithm and good data-structures.
Mijn uitwerkingen vind je hier: bestanden

Week 2

You hand in (in Nestor) the book exercise, and the final programming exercise (including all required auxiliary files).
  1. Exercises from the book:
    1. 2.1
    2. 2.2
    3. 2.3
    4. 2.4
    5. 2.6
    6. 2.8
    7. 2.9
  2. Programming exercise. Write a Python3 program which reads tweets from standard input (same format as last week). Your program prints for each tweet for each word the sorted position list of that word in that tweet. Example:
    $ zcat /net/corpora/twitter2/Tweets/2015/08/20150828:15.out.gz |
           /net/corpora/twitter2/tools/tweet2tab id user text words | head -n2 | python3 ex2.1.py
    het [8]
    En [0]
    bij [4]
    de [5]
    #welweerhandighe [13]
    verstap [1]
    #Jumbo [6]
    been [12]
    schiet [7]
    in [10]
    je [2, 3, 9, 11]
    RT [0]
    ! [16]
    geschonken [14]
    eigen [12]
    worden [15]
    de [11]
    kan [5]
    voor [10]
    @NotarisZelhem [1]
    een [8]
    ton [9]
    woning [13]
    https://t.co/yeWn3MiQPS [17]
    weer [7]
    2017 [4]
    : [2]
    er [6]
    Vanaf [3]
    
  3. Programming exercise. Write a Python3 function bigram(l1,l2) which takes two sorted lists of positions as argument. The function returns "True" if there is a position p in the first list such that there is a position p+1 in the second list. In all other cases the function returns "False". Your solution should exploit the fact that both lists are ordered. Make sure you can replicate the following interaction:
    python3
    >>> from bigram import bigram
    >>> bigram([1,6,8],[4,7,10])
    True
    >>> bigram([1,16,18],[4,7,10,20])
    False
    
    Hint: it may be helpful to use explicit iterators. Read up on the documentation for the built-in functions iter() and next().
  4. Programming exercise. We use the same Twitter data as last week. Write a Python3 program which builds a database where you record for each term in which tweet the term occurs, and in addition records the ordered position list for a term in a tweet. The database is saved with the pickle module.
  5. Programming exercise. Based on the previous exercise, build a program which loads the database and then takes queries from standard input. Each query contains two terms (separated by space or tab), q1 and q2. Your program returns all tweets where q1 directly precedes q2.
Mijn uitwerkingen vind je hier: bestanden

Week 3

  1. Exercises from the book:
    1. 3.4
    2. 3.8
  2. Programming exercise Levenshtein.
    1. Standard Levenshtein Implement een Python function which returns the Levenshtein distance (figure 3.5 in the book) for two given words. Add a main function to the program, so that the program reads lines from standard input. Each line is supposed to contain two words. Those two words with the Levenshtein distance are printed to stanard output. Insertion, deletion and substitution all have a cost of 1. For instance:
      $ echo "de	die" | python3 lev.py
      de	die	1
      
      You need not submit your solution to this exercise.
    2. Variant of Levenshtein. In order to make the exercise a bit more interesting, you will now implement a variant of Levenshtein distance. In this variant, insertion and deletion of a vowel only costs 0.5. Insertion and deletion of all other characters costs 1. Substitution has a cost of 1 as well. Note that it is very easy to make a mistake in this case, therefore test your program carefully! You *do* need to submit the solution to this exercise. Deadline as usual.
  3. Twitter Search with spell correction On the basis of programming exercise 4 of week 1. Adapt the Twitter query system as follows. If there are no hits for the given query term, adapt the query so that it uses a query term that is very close to the original query term (e.g, a query term which does have hits and which is closest in terms of Levenshtein distance). NB. Use the standard version of Levenshtein distance for this exercise. Some suggestions:
    • Compare the query term with all query terms in your database, and chose the term with the lowest Levenshtein distance. This is the baseline method. It is inefficient, but should provide decent results. A perfect implementation of this method will get a grade of 6.
    • If there are multiple terms with the same minimal Levenshtein distance, do an 'OR' query for each of these terms. +0.5
    • If there are multiple terms with the same minimal Levenshtein distance, choose the term with highest amount of hits. +0.5
    • Construct a trigram index (in pre-processing). For all trigrams in the given term, find alternative terms in the trigram index. Compute Levenshtein distance only for those terms. +2
    • Determine the "Jaccard score" of each term, and then compute Levenshtein distance only for terms with maximal Jaccard score. +2
    • Variant: if a variant of the query term is much more frequent than the given query term, also provide hits for the frequent variant. Perhaps add a special notation for the user to enable or disable this functionality. +3
    • Other strategies are allowed, if you motivate them.
    Please submit all files that I need to run your program (I do not keep copies of your files of previous weeks). Add a file README which contains a very short description of the technique you applied. Note that you have *two weeks* to finish this exercise.

week 4

  1. Exercises from the book:
    1. 6.9
    2. 6.10
    3. 6.11
    4. 6.15
    5. 6.17
  2. Programming exercise: take the Twitter-search-engine from the last few weeks as a starting point. Implement a variant which takes two search terms and then returns all tweets which match at least one of the search terms. Add the tf-idf score of each document. Order the returned documents by tf-idf score.
    • creation of data-structure: you need to maintain, for each term, in how many documents that term occurs. It is ok to use the length of the posting list for this purpose.
    • you also need to maintain the overall number of tweets
    • you need to maintain how often a word occurs in a tweet
  3. Question: can you come up with a query, such that the highest scoring document only contains one of the search terms?
  4. Question: come up with a query for which the best document obtains the highest possible tfidf in this dataset. 1 bonus point if your query is the highest scoring of all students. Bonus points are to be divided if there is a tie.