Search Engines LIX019B05

This course focuses on methods and techniques used in information systems for unstructured text, such as search engines. Text preparation and indexing, boolean and vector models for search systems, user feedback, interfaces, the Pagerank algorithm, and evaluation of search systems will be discussed. To support theory, we experiment with the various techniques with implementations in Python.

Details

Plan

  1. Hoofdstuk 1: introduction; boolean retrieval; posting lists; python code
  2. Hoofdstuk 2: decoding, tokenization and normalization; sub-linear posting list intersection; phrase queries
  3. Hoofdstuk 3: dictionaries; wildcard queries; spell correction; (stdin/stdout/stderr: w3.pdf)
  4. Hoofdstuk 6: scoring and term weighting; term and document frequency weighting; vector space model
  5. Hoofdstuk 8: evaluation
  6. Hoofdstuk 21: link analysis: page rank
  7. Hoofdstuk 9: q and a; summary; evaluation

Exercises

Weekly exercises are to be handed in on Nestor. You must have submitted your solutions to weekly exercises the evening before the next lecture - unless stated otherwise.

For programming exercises, you must submit everything I need in order to run your program. You may assume that I have a Linux workstation, just like LWP, and that I use the same version of Python3 as the current version on LWP. Other versions of Python are not accepted.

For textual exercises, the only choice is between ordinary plain text files, or pdf. Other formats such as .doc, .docx, .rtf are not accepted.

The files you submit as solutions in Nestor should use very simple file names: short, no spaces, no funny characters. There is no need for student-number or name in the file name. Simply use names such as 1a.py, 1b.py, 3a.py.

If you need to submit several files, you must collect these files together in a .zip file. That way, the original file names are preserved.

You may only submit your own work.

Of course, it is ok to help each other. But note that the assignments are individual assignments. Not group assignments. Identical solutions will obtain the lowest possible grade.

If you submit solutions to exercises more than once, then I will assume the latest submission is your final submission. I will ignore the other submissions, and I will grade only the final submission.

In some cases, exercises will build on top of exercises from previous weeks. It is assumed that any feedback on previous exercises is taken seriously, so that previous mistakes are corrected. If not (i.e., if I see the same error again), impact on grade is doubled.

Week 1

Submit the book exercises as a PDF document. As for the Python exercises, only submit the final exercise, with the required pickle files. Do *not* submit the twitter data.
  1. Exercises from the book:
    1. 1.1
    2. 1.2
    3. 1.3
    4. 1.4
    5. 1.7
    6. 1.13
  2. Exercises Python:
    1. The following UNIX pipe provides tweets in a tabular format, where each line contains a single tweet. Each line consists of four fields, seperated by TAB. The fields represent: identifier of tweet, user of tweet, text of the tweet, and tokenized text of the tweet.
      zcat /net/corpora/twitter2/Tweets/2017/08/20170828:10.out.gz | 
            /net/corpora/twitter2/tools/tweet2tab id user text words
      
      Write a Python program which builds a database on the basis of this input. You must make a dictionary in which the identifier of the tweet is used as the key. The value is a tuple (user,text,words). You can save the dictionary by means of the "pickle" module. This module can be used to save a Python data-structure on disc, to be used later. For instance, if your dictionary is called "docs", then you can save it in the file "docs.pickle" as follows:
      import pickle
      
      with open('docs.pickle','wb') as f:
          pickle.dump(docs,f)
      
    2. Write a Python program which reads the dictionary back into memory:
      import pickle
      
      docs = pickle.load(open('docs.pickle','rb'))
      
      Your program then reads lines from standard input. Each line is supposed to contain an identifier of a tweet. You print the words of the corresponding tweet to standard output. Please make sure that you know what standard input and standard output is.
    3. Write a Python program - a variant of the first program - which in addition builds a "posting list". This implies that you must build a dictionary where the key is a word, and the value is the "posting list", i.e., the tweet identifiers in which that word occurs. Before you start programming, please consult the next exercises, to come up with an appropriate representation of the posting list of tweet identifiers. Even if it is called "posting list", it may not be the best choice to use a Python list for it. The posting list dictionary must be saved to disk as another pickle file (and as this is an extension of excercise 1, you must also still save the Twitter index dictionary).
    4. Write a Python program which uses both pickle files from exercise 3. The program reads lines from standard input. Each line is supposed to contain a single word. The program prints all tweets which contain that word.
    5. Extend the Python program as follows. In case standard input contains two words (rather than one), your program should print all tweets in which *both* words occur. Of course, you are expected to take into account the material of the first lecture. Please think of a good algorithm and good data-structures.

Week 2

You hand in (in Nestor) the book exercise, and the final programming exercise (including all required auxiliary files).
  1. Exercises from the book:
    1. 2.1
    2. 2.2
    3. 2.3
    4. 2.4
    5. 2.6
    6. 2.8
    7. 2.9
  2. Programming exercise. Write a Python3 program which reads tweets from standard input (same format as last week). Your program prints for each tweet for each word the sorted position list of that word in that tweet. Example:
    $ zcat /net/corpora/twitter2/Tweets/2015/08/20150828:15.out.gz |
           /net/corpora/twitter2/tools/tweet2tab id user text words | head -n2 | python3 ex2.1.py
    het [8]
    En [0]
    bij [4]
    de [5]
    #welweerhandighe [13]
    verstap [1]
    #Jumbo [6]
    been [12]
    schiet [7]
    in [10]
    je [2, 3, 9, 11]
    RT [0]
    ! [16]
    geschonken [14]
    eigen [12]
    worden [15]
    de [11]
    kan [5]
    voor [10]
    @NotarisZelhem [1]
    een [8]
    ton [9]
    woning [13]
    https://t.co/yeWn3MiQPS [17]
    weer [7]
    2017 [4]
    : [2]
    er [6]
    Vanaf [3]
    
  3. Programming exercise. Write a Python3 function bigram(l1,l2) which takes two sorted lists of positions as argument. The function returns "True" if there is a position p in the first list such that there is a position p+1 in the second list. In all other cases the function returns "False". Your solution should exploit the fact that both lists are ordered. Make sure you can replicate the following interaction:
    python3
    >>> from bigram import bigram
    >>> bigram([1,6,8],[4,7,10])
    True
    >>> bigram([1,16,18],[4,7,10,20])
    False
    
    Hint: adapt the intersection algorithm for ordered lists presented during the first lecture.
  4. Programming exercise. We use the same Twitter data as last week. Write a Python3 program which builds a database where you record for each term in which tweet the term occurs, and in addition records the ordered position list for a term in a tweet. The database is saved with the pickle module. Hint: take into account the next exercise to come up with a good datastructure. Hint: in Python, elements of a set must be immutable. This implies that certain otherwise reasonable choices must be avoided.
  5. Programming exercise. Based on the previous exercise, build a program which loads the database and then takes queries from standard input. Each query contains two terms (separated by space), q1 and q2. Your program returns all tweets where q1 directly precedes q2.

Week 3

  1. Exercises from the book:
    1. 3.4
    2. 3.8
  2. Programming exercise Levenshtein.
    1. Standard Levenshtein Implement een Python function which returns the Levenshtein distance (figure 3.5 in the book) for two given words. Add a main function to the program, so that the program reads lines from standard input. Each line is supposed to contain two words, separated by a TAB. Those two words with the Levenshtein distance are printed to standard output, again separated by TAB. Insertion, deletion and substitution all have a cost of 1. For instance:
      $ echo "de	die" | python3 lev.py
      de	die	1
      
      You need not submit your solution to this exercise.
    2. Variant of Levenshtein. In order to make the exercise a bit more interesting, you will now implement a variant of Levenshtein distance. In this variant, insertion and deletion of a vowel only costs 0.5. Insertion and deletion of all other characters costs 1. Substitution has a cost of 1 as well, except if a vowel is substituted by a vowel which has a cost of 0.5. For the purpose of this exercise, vowels simply are characters a, e, i, o and u. Here are some examples: file lev_vowel.in. The following pipe should produce output that is identical to lev_vowel.out.
      cat lev_vowel.in | python3 week3.py
      
      Note that it is very easy to make a mistake in this case, therefore test your program carefully! You *do* need to submit the solution to this exercise. Deadline as usual.
  3. Twitter Search with spell correction On the basis of programming exercise 4 of week 1. Adapt the Twitter query system as follows. If there are no hits for the given query term, adapt the query so that it uses a query term that is very close to the original query term (e.g, a query term which does have hits and which is closest in terms of Levenshtein distance). NB. Use the standard version of Levenshtein distance for this exercise (without the alternative score for vowels). Some suggestions:
    • Compare the query term with all query terms in your database, and chose the term with the lowest Levenshtein distance. This is the baseline method. It is inefficient, but should provide decent results. A perfect implementation of this method will get a grade of 6.
    • If there are multiple terms with the same minimal Levenshtein distance, do an 'OR' query for each of these terms. +0.5
    • If there are multiple terms with the same minimal Levenshtein distance, choose the term with highest amount of hits. +0.5
    • Construct a trigram index (in pre-processing). For all trigrams in the given term, find alternative terms in the trigram index. Compute Levenshtein distance only for those terms. +2
    • Determine the "Jaccard score" of each term, and then compute Levenshtein distance only for terms with maximal Jaccard score. +2
    • Variant: if a variant of the query term is much more frequent than the given query term, also provide hits for the frequent variant. Perhaps add a special notation for the user to enable or disable this functionality. +3
    • Other strategies are allowed, if you motivate them.
    Please submit all files that I need to run your program (I do not keep copies of your files of previous weeks). Add a file README which contains a very short description of the technique you applied. Note that you have *two weeks* to finish this exercise.

week 4

  1. Exercises from the book:
    1. 6.9
    2. 6.10
    3. 6.11
    4. 6.15
    5. 6.17
  2. Programming exercise: take the Twitter-search-engine from the last few weeks as a starting point. Implement a variant which takes two search terms and then returns all tweets which match at least one of the search terms. Add the tf-idf score of each document. Order the returned documents by tf-idf score.
    • creation of data-structure: you need to maintain, for each term, in how many documents that term occurs. It is ok to use the length of the posting list for this purpose.
    • you also need to maintain the overall number of tweets
    • you need to maintain how often a word occurs in a tweet
  3. Question: can you come up with a query, such that the highest scoring document only contains one of the search terms?
  4. Question: come up with a query for which the best document obtains the highest possible tfidf score in this dataset.

week 5

  1. Book:
    1. 8.1
    2. 8.2
    3. 8.4
    4. 8.8
    5. 8.10
  2. Programming exercise. Define a Python function which computes the value of kappa for two given lists of judgements. Ensure your program can accept any judgement values (not just 0 or 1, but also a choice from, say, green, blue or red). Your program should read a file with judgements, and write out the kappa value. The format of the file is the same as in the book, an example is provided here. My implementation prints a kappa value of 0.5926, where P(A) equals 0.81 and P(E) is 0.5336. Another example input file is here. I get a kappa value of 0.4483 for this input. NB1. The numbers above were changed on Sunday, October 8, 19:06. NB2. Please use the definition of Kappa from the sheets. That definition is slightly different from the version in the book, but correct.

week 6

  1. on paper
    1. 21.5
    2. 21.6
    3. Consider figure 21.3: the sequence of vectors of probability. Repeat this example, but now start with the initial vector x_0 = [0.5, 0.3, 0.2]
    4. Similar, but with a vector where all probabilities are equal: [1/3, 1/3, 1/3]
  2. Write a Python program which computes the PageRank vector for a given probability matrix P. The program computes the vectors x_1, x_2, x_3, ... until no further changes happen. The program prints all intermediate vectors. Your program should assume that the input probability matrix already represents the probability of teleports. Your program reads the probablity matrix from standard input. The matrix is provided in a simple text format where each row is a line, which consists of each of the values separated by TAB. An example input that your program should be able to work with is given as pagerank-input.txt. For this particular input, the resulting PageRank vector should be close to:
    0.3635 0.1141 0.1949 0.2028 0.1247
    
    Tip: ensure (for the examples that you test your program with) that the probabilities in each row sum to 1. Check what happens if they do not:
    0.33	0.33	0.33
    0.33	0.33	0.33
    0.33	0.33	0.33
    
    In such cases, it is better to adapt the probabilities, e.g.,:
    0.34	0.33	0.33
    0.33	0.34	0.33
    0.33	0.33	0.34
    
  3. Hacking pagerank:
    0.25	0.25	0.25	0.25
    0.25	0.25	0.25	0.25
    0.25	0.25	0.25	0.25
    0.25	0.25	0.25	0.25
    
    Not much is happening in this case. All sites are equally "good". Suppose that we want to improve the pagerank value of site number four. We build four additional "fake"-sites. Each fake site links only to site number four. Site number four links to each of the fake sites. What is the resulting probability matrix in this case (without teleports)? Does this adaptation leads to an increase in PageRank of the original site?

Bonus exercise

Some students have asked to be allowed to re-submit some of the earlier exercises. This is not possible. Instead, you can submit a solution to the following bonus exercise for an improved grade for the exercises part. The grade for the bonus exercise will be added to the total for all practical exercises. The final sum is still divided by 7 (the number of obligatory exercises). For example, if you have 35 points after 7 exercises, and you get the maximum grade for the bonus exercise, your final grade for the excercises part is 45/7. Bonus exercises are graded in a different way than obligatory exercises. The grades range only from 6 to 10. If your solution is insufficient, no grade is assigned at all. Apply the pagerank algorithm on a data-set of word-dependents frequency pairs. The frequency data can be found in directory /net/shared/vannoord/pairs. There are three files there: pairs5, pairs4 and pairs3. The first one is the smallest. If your solution works for pairs5, you may want to consider the larger data set from file pairs4. After that, pairs3 could be used... The files can also be downloaded as a zip archive as pairs.zip. The files contain lines of three fields, separated by TAB. The fields are HEADword DEPword FREQ. This indicates that the head-word ("drink") selects the DEPword ("beer") FREQ times in a very large corpus. The frequencies should be used as the initial weights of the graph. After applying the pagerank algorithm, the program should display the words ordered by the resulting pagerank value. For this exercise, you submit your code as well as a short PDF report indicating the difficulties you solved and the results you got.

Learning outcomes

At the end of the course, students will be able to critically understand and compare search engines, and key components of search engines. They will be able to experiment with important algorithms and compute with relevant formulas, using real data and concrete but small-scale programs written in Python.

Key algorithms and formulas that students will be able to understand, implement, and use with real data are: