Search Engines LIX019B05

This course focuses on methods and techniques used in information systems for unstructured text, such as search engines. Text preparation and indexing, boolean and vector models for search systems, user feedback, interfaces, the Pagerank algorithm, and evaluation of search systems will be discussed. To support theory, we experiment with the various techniques with implementations in Python.

Details

Plan

  1. Hoofdstuk 1: introduction; boolean retrieval; posting lists; intersection in python; python code
  2. Hoofdstuk 2: decoding, tokenization and normalization; sub-linear posting list intersection; phrase queries
  3. Hoofdstuk 3: dictionaries; wildcard queries; spell correction; (stdin/stdout/stderr: w3.pdf)
  4. Hoofdstuk 6: scoring and term weighting; term and document frequency weighting; vector space model
  5. Hoofdstuk 8: evaluation
  6. Hoofdstuk 21: link analysis: page rank
  7. Hoofdstuk 9: q and a; summary; evaluation

Exercises

Weekly exercises are to be handed in on Nestor. You must have submitted your solutions to weekly exercises the evening before the next lecture - unless stated otherwise.

For programming exercises, you must submit everything I need in order to run your program. You may assume that I have a Linux workstation, just like LWP, and that I use the same version of Python3 as the current version on LWP. Other versions of Python are not accepted.

For textual exercises, the only choice is between ordinary plain text files, or pdf. Other formats such as .doc, .docx, .rtf are not accepted.

The files you submit as solutions in Nestor should use very simple file names: short, no spaces, no funny characters. There is no need for student-number or name in the file name. Simply use names such as 1a.py, 1b.py, 3a.py.

If you need to submit several files, you must collect these files together in a .zip file. That way, the original file names are preserved.

You may only submit your own work.

Of course, it is ok to help each other. But note that the assignments are individual assignments. Not group assignments. Identical solutions will obtain the lowest possible grade.

If you submit solutions to exercises more than once, then I will assume the latest submission is your final submission. I will ignore the other submissions, and I will grade only the final submission.

In some cases, exercises build on top of exercises from previous weeks. It is assumed that any feedback on previous exercises is taken seriously, so that previous mistakes are corrected. If not (i.e., if I see the same error again), impact on grade is doubled.

Week 1

Submit the book exercises as a PDF document. As for the Python exercises, only submit the final exercise, with the required pickle files. Do *not* submit the twitter data.
  1. Exercises from the book:
    1. 1.1
    2. 1.2
    3. 1.3
    4. 1.4
    5. 1.7
    6. 1.13
  2. Exercises Python:
    1. The following UNIX pipe provides tweets in a tabular format, where each line contains a single tweet. Each line consists of four fields, seperated by TAB. The fields represent: identifier of tweet, user of tweet, text of the tweet, and tokenized text of the tweet.
      zcat /net/corpora/twitter2/Tweets/2016/08/20160828:10.out.gz | 
            /net/corpora/twitter2/tools/tweet2tab id user text words
      
      Write a Python program which builds a database on the basis of this input. You must make a dictionary in which the identifier of the tweet is used as the key. The value is a tuple (user,text,words). You can save the dictionary by means of the "pickle" module. This module can be used to save a Python data-structure on disc, to be used later. For instance, if your dictionary is called "docs", then you can save it in the file "docs.pickle" as follows:
      import pickle
      
      with open('docs.pickle','wb') as f:
          pickle.dump(docs,f)
      
    2. Write a Python program which reads the dictionary back into memory:
      import pickle
      
      docs = pickle.load(open('docs.pickle','rb'))
      
      Your program then reads lines from standard input. Each line is supposed to contain an identifier of a tweet. You print the words of the corresponding tweet to standard output. Please make sure that you know what standard input and standard output is.
    3. Write a Python program - a variant of the first program - which in addition builds a "posting list". This implies that you must build a dictionary where the key is a word, and the value is the list/set/.. of tweet identifiers in which that word occurs. Before you start programming, please consult the next exercises, to come up with an appropriate representation of the list/set/.. of tweet identifiers. The posting list dictionary must be saved to disk as another pickle file.
    4. Write a Python program which takes that pickle files from exercise 1 and 3. The program reads lines from standard input. Each line is supposed to contain a word. The program prints all tweets which contain that word.
    5. Extend the Python program as follows. In case standard input contains two words (rather than one), your program should print all tweets in which *both* words occur. Of course, you are expected to take into account the material of the first lecture. Please think of a good algorithm and good data-structures.
My solutions are here: bestanden

Week 2

You hand in (in Nestor) the book exercise, and the final programming exercise (including all required auxiliary files).
  1. Exercises from the book:
    1. 2.1
    2. 2.2
    3. 2.3
    4. 2.4
    5. 2.6
    6. 2.8
    7. 2.9
  2. Programming exercise. Write a Python3 program which reads tweets from standard input (same format as last week). Your program prints for each tweet for each word the sorted position list of that word in that tweet. Example:
    $ zcat /net/corpora/twitter2/Tweets/2015/08/20150828:15.out.gz |
           /net/corpora/twitter2/tools/tweet2tab id user text words | head -n2 | python3 ex2.1.py
    het [8]
    En [0]
    bij [4]
    de [5]
    #welweerhandighe [13]
    verstap [1]
    #Jumbo [6]
    been [12]
    schiet [7]
    in [10]
    je [2, 3, 9, 11]
    RT [0]
    ! [16]
    geschonken [14]
    eigen [12]
    worden [15]
    de [11]
    kan [5]
    voor [10]
    @NotarisZelhem [1]
    een [8]
    ton [9]
    woning [13]
    https://t.co/yeWn3MiQPS [17]
    weer [7]
    2017 [4]
    : [2]
    er [6]
    Vanaf [3]
    
  3. Programming exercise. Write a Python3 function bigram(l1,l2) which takes two sorted lists of positions as argument. The function returns "True" if there is a position p in the first list such that there is a position p+1 in the second list. In all other cases the function returns "False". Your solution should exploit the fact that both lists are ordered. Make sure you can replicate the following interaction:
    python3
    >>> from bigram import bigram
    >>> bigram([1,6,8],[4,7,10])
    True
    >>> bigram([1,16,18],[4,7,10,20])
    False
    
    Hint: it may be helpful to use explicit iterators. Read up on the documentation for the built-in functions iter() and next().
  4. Programming exercise. We use the same Twitter data as last week. Write a Python3 program which builds a database where you record for each term in which tweet the term occurs, and in addition records the ordered position list for a term in a tweet. The database is saved with the pickle module.
  5. Programming exercise. Based on the previous exercise, build a program which loads the database and then takes queries from standard input. Each query contains two terms (separated by space or tab), q1 and q2. Your program returns all tweets where q1 directly precedes q2.
You will find my solutions here: files

Week 3

  1. Exercises from the book:
    1. 3.4
    2. 3.8
  2. Programming exercise Levenshtein.
    1. Standard Levenshtein Implement een Python function which returns the Levenshtein distance (figure 3.5 in the book) for two given words. Add a main function to the program, so that the program reads lines from standard input. Each line is supposed to contain two words. Those two words with the Levenshtein distance are printed to stanard output. Insertion, deletion and substitution all have a cost of 1. For instance:
      $ echo "de	die" | python3 lev.py
      de	die	1
      
      You need not submit your solution to this exercise.
    2. Variant of Levenshtein. In order to make the exercise a bit more interesting, you will now implement a variant of Levenshtein distance. In this variant, insertion and deletion of a vowel only costs 0.5. Insertion and deletion of all other characters costs 1. Substitution has a cost of 1 as well. Note that it is very easy to make a mistake in this case, therefore test your program carefully! You *do* need to submit the solution to this exercise. Deadline as usual.
  3. Twitter Search with spell correction On the basis of programming exercise 4 of week 1. Adapt the Twitter query system as follows. If there are no hits for the given query term, adapt the query so that it uses a query term that is very close to the original query term (e.g, a query term which does have hits and which is closest in terms of Levenshtein distance). NB. Use the standard version of Levenshtein distance for this exercise. Some suggestions:
    • Compare the query term with all query terms in your database, and chose the term with the lowest Levenshtein distance. This is the baseline method. It is inefficient, but should provide decent results. A perfect implementation of this method will get a grade of 6.
    • If there are multiple terms with the same minimal Levenshtein distance, do an 'OR' query for each of these terms. +0.5
    • If there are multiple terms with the same minimal Levenshtein distance, choose the term with highest amount of hits. +0.5
    • Construct a trigram index (in pre-processing). For all trigrams in the given term, find alternative terms in the trigram index. Compute Levenshtein distance only for those terms. +2
    • Determine the "Jaccard score" of each term, and then compute Levenshtein distance only for terms with maximal Jaccard score. +2
    • Variant: if a variant of the query term is much more frequent than the given query term, also provide hits for the frequent variant. Perhaps add a special notation for the user to enable or disable this functionality. +3
    • Other strategies are allowed, if you motivate them.
    Please submit all files that I need to run your program (I do not keep copies of your files of previous weeks). Add a file README which contains a very short description of the technique you applied. Note that you have *two weeks* to finish this exercise.
You will find my solution here: files

week 4

  1. Exercises from the book:
    1. 6.9
    2. 6.10
    3. 6.11
    4. 6.15
    5. 6.17
  2. Programming exercise: take the Twitter-search-engine from the last few weeks as a starting point. Implement a variant which takes two search terms and then returns all tweets which match at least one of the search terms. Add the tf-idf score of each document. Order the returned documents by tf-idf score.
    • creation of data-structure: you need to maintain, for each term, in how many documents that term occurs. It is ok to use the length of the posting list for this purpose.
    • you also need to maintain the overall number of tweets
    • you need to maintain how often a word occurs in a tweet
  3. Question: can you come up with a query, such that the highest scoring document only contains one of the search terms?
  4. Question: come up with a query for which the best document obtains the highest possible tfidf score in this dataset. 1 bonus point if your query is the highest scoring of all students. Bonus points are to be divided if there is a tie.

week 5

  1. Book:
    1. 8.1
    2. 8.2
    3. 8.4
    4. 8.8
    5. 8.10
  2. Programming exercise. Define a Python function which computes the value of kappa for two given lists of judgements. Ensure your program can accept any judgement values (not just 0 or 1). Your program should read a file with judgements, and write out the kappa value. The format of the file is the same as in the book, an example is provided here. My implementation prints a kappa value of 0.5796, where P(A) equals 0.81 and P(E) is 0.54805. Another example input file is here. I get a kappa value of 0.43 for this input.

week 6

  1. on paper
    1. 21.5
    2. 21.6
    3. Consider figure 21.3: the sequence of vectors of probability. Repeat this example, but now start with the initial vector x_0 = [0.5, 0.3, 0.2]
    4. Similar, but with a vector where all probabilities are equal: [1/3, 1/3, 1/3]
  2. Write a Python program which computes the PageRank vector for a given probability matrix P. The program computes the vectors x_1, x_2, x_3, ... until no further changes happen. The program prints all intermediate vectors. Your program should assume that the input probability matrix already represents the probability of teleports. Your program reads the probablity matrix from standard input. The matrix is provided in a simple text format where each row is a line, which consists of each of the values separated by TAB. An example input that your program should be able to work with is given as pagerank-input.txt. For this particular input, the resulting PageRank vector should be close to:
    0.3635 0.1141 0.1949 0.2028 0.1247
    
    Tip: ensure (for the examples that you test your program with) that the probabilities in each row sum to 1. Check what happens if they do not:
    0.33	0.33	0.33
    0.33	0.33	0.33
    0.33	0.33	0.33
    
    In such cases, it is better to adapt the probabilities, e.g.,:
    0.34	0.33	0.33
    0.33	0.34	0.33
    0.33	0.33	0.34
    
  3. Another example:
    0.25	0.25	0.25	0.25
    0.25	0.25	0.25	0.25
    0.25	0.25	0.25	0.25
    0.25	0.25	0.25	0.25
    
    Not much is happening in this case. All sites are equally "good". Suppose that we want to improve the pagerank value of site number four. We build four additional "fake"-sites. Each fake site links only to site number four. Site number four links to each of the fake sites. What is the resulting probability matrix in this case (without teleports)? Does this adaptation leads to an increase in PageRank of the original site?

Bonus

  1. Bonusopdracht. Some students have asked to be allowed to re-submit some of the earlier exercises. This is not possible. Instead, you can submit a solution to the following bonus exercise for an improved grade for the exercises part.

    In tweets, you can refer to other users using a "mention:" @NAME. All mentions taken together can be seen to define a graph of user-names which point to each other. Each node in the graph is a user-name. Each edge in the graph represents a mention. We can apply the PageRank algorithm to this graph in order to get an idea which users are the most "important". For this bonus exercise, you are asked to extract the mentions from our tweet collection. On the basis of these mentions, you create a probability matrix (do use some teleport rate), and then apply the pagerank algorithm. The more data your solution can use, the better. Remember that you can extract data using the following pipe. For other years and months, you can adapt the file names:

    zcat /net/corpora/twitter2/Tweets/2015/08/*.out.gz |
          /net/corpora/twitter2/tools/tweet2tab user words |
         grep '@'
    

    For this exercise, you do not submit code, but you submit a written report (pdf) of about two pages where you explain shortly how you did it, how much data you used, and what results you got. In promising cases, I will make an appointment for a demonstration of your program. For this bonus exercise, you can get (at most) two full points added to the average grade of your exercises so far. There is no point in submitting bonus exercises that are not fully functioning.