InformationRetrieval

Contents

Set-up
1. On volker and similar machines
2. On LWP machines
Examples
Reference
Query gotchas
Other sites

Set-up

On volker and similar machines

PATH=$PATH:/net/aps/64/bin     # put this in your .bashrc

On LWP machines

PATH=$PATH:/net/aps/64/bin     # put this in your .bashrc or similar
easy_install3 --user pymongo   # do this once

If that didn't work, try this:

PATH=$PATH:/net/aps/64/bin                                  # put this in your .bashrc or similar
export PYTHONPATH=$HOME/.local/lib/python3.1/site-packages  # put this in your .bashrc or similar
mkdir -p $PYTHONPATH                                        # do this once
easy_install3 --prefix=$HOME/.local pymongo                 # do this once

Examples

Important: In all these examples you have to change the port number to something no other is using.

Download

Download all examples

Starting the server

Important: Don't forget to stop the server when you're done. Otherwise, next time, when you are working on a different machine, you won't be able to start the server.

sh server_start.sh

Stopping the server

sh server_stop.sh

Building the database

Instead of termIDs, we always use the terms themselves. This simplifies things, but is inefficient for large, real applications.

The textual content of the documents is included in the database. This makes it easier to play with it. In a real application, only the location of the original document is included in the database, and some metadata, like document title, date, summary, etc.

Build the main database:

python3 build.py document...

   1 #!/usr/bin/env python3
   2 
   3 import sys
   4 from pymongo import Connection, ASCENDING
   5 
   6 Host = "localhost"
   7 Port = 9123           # <-- CHANGE THIS
   8 DBName = "temp"
   9 Collection = "docs"
  10 
  11 
  12 # Get the plain text out of the document, for instance, the body
  13 # of an html document, with all tags removed.
  14 # Also get the language the text is written in, either from metadata
  15 # in the document, or by using a language guesser on the extracted text.
  16 # The language is needed for correct tokenization and stemming.
  17 def gettext(doc):
  18     text = doc.decode("utf-8", "replace")
  19     return text, ""
  20 
  21 # Split the text into tokens.
  22 # Separate punctuation from words.
  23 # Fixed word groups should be a single token, e.g.: New York
  24 def tokenize(text, language):
  25     tokens = text.split()
  26     return tokens
  27 
  28 # Transform a token into a list of zero, one, or more terms.
  29 # Transform inflected words into their root form, e.g.: houses -> house
  30 # There can be more than one term, e.g. synonyms, lemma with/without stemming
  31 # Some tokens, like interpunction, don't represent a term.
  32 # IMPORTANT: There must be NO DUPLICATES in the result.
  33 def getterm(token, language):
  34     terms = [token.lower()]
  35     return terms
  36 
  37 # Escape characters that are not allowed in keys.
  38 def escape(key):
  39     return key.replace("$", "\uFF04").replace(".", "\uFF0E")
  40 
  41 
  42 connection = Connection(Host, Port)
  43 db = connection[DBName]
  44 
  45 # Start with an empty collection.
  46 # Remove this line if you want to add documents to an existing collection.
  47 db.drop_collection(Collection)
  48 
  49 for filename in sys.argv[1:]:
  50 
  51     fp = open(filename, "rb")
  52     doc = fp.read()
  53     fp.close()
  54 
  55     text, language = gettext(doc)
  56 
  57     tokens = tokenize(text, language)
  58 
  59     terms = set()  # Set of terms in current text.
  60     pos = {}       # List of positions for each term.
  61     p = 0          # Term positions start at 0.
  62     for token in tokens:
  63         term = getterm(token, language)
  64         if len(term) > 0:
  65             for t in term:
  66                 terms.add(t)
  67                 t = escape(t)
  68                 if not t in pos:
  69                     pos[t] = []
  70                 pos[t].append(p)
  71             # Tokens that don't represent terms don't increment the position.
  72             p += 1
  73 
  74     db[Collection].insert({"name": filename,
  75                            "text": text,
  76                            "lang": language,
  77                            "terms": sorted(terms),
  78                            "pos": pos,
  79                            "size": p})
  80 
  81 # Make the indexes.
  82 # Only do this if you are not going to add more documents, but before you start querying the database.
  83 db[Collection].create_index([("name", ASCENDING)], unique=True, dropDups=True)
  84 db[Collection].create_index([("terms", ASCENDING)])
  85 
  86 connection.close()

build.py

Build an index of trigrams:

python3 build_tri.py

   1 #!/usr/bin/env python3
   2 
   3 import sys
   4 from pymongo import Connection, ASCENDING
   5 
   6 Host = "localhost"
   7 Port = 9123           # <-- CHANGE THIS
   8 DBName = "temp"
   9 Collection = "docs"
  10 
  11 connection = Connection(Host, Port)
  12 db = connection[DBName]
  13 
  14 terms = db[Collection].distinct("terms")
  15 
  16 trigrams = {}
  17 for term in terms:
  18     trm = "$" + term + "$"
  19     for i in range(len(trm) - 2):
  20         tri = trm[i:i+3]
  21         if not tri in trigrams:
  22             trigrams[tri] = set()
  23         trigrams[tri].add(term)
  24 
  25 db.drop_collection("tri")
  26 for tri in trigrams:
  27     db["tri"].insert({"tri": tri, "terms": sorted(trigrams[tri])})
  28 db["tri"].create_index([("tri", ASCENDING)], unique=True)

build_tri.py

An alternative is to include a set of trigrams directly within each document object and have Mongo build an index on those embedded trigrams. Then you could search for documents by trigrams directly. Which alternative is faster or more flexible or more efficient?

Querying the database

Search documents with any of the specified terms:

python3 query_any.py term...

   1 #!/usr/bin/env python3
   2 
   3 import sys
   4 from pymongo import Connection, ASCENDING
   5 
   6 Host = "localhost"
   7 Port = 9123           # <-- CHANGE THIS
   8 DBName = "temp"
   9 Collection = "docs"
  10 
  11 
  12 # Escape characters that are not allowed in keys.
  13 def escape(key):
  14     return key.replace("$", "\uFF04").replace(".", "\uFF0E")
  15 
  16 
  17 connection = Connection(Host, Port)
  18 db = connection[DBName]
  19 col = db[Collection]
  20 
  21 for doc in col.find(spec={"terms": {"$in": sys.argv[1:]}},
  22                     fields=["name", "text", "terms.$", "pos", "size"],
  23                     sort=[("name", ASCENDING)]):
  24     sys.stdout.write(doc["name"] + "\n\n")
  25     sys.stdout.write(doc["text"])
  26     sys.stdout.write("\nsize: " + str(doc["size"]) + "\n\n")
  27     sys.stdout.write("found: " +  doc["terms"][0] + "\n")  # "terms.$" returns only one term
  28     for term in sys.argv[1:]:
  29         t = escape(term)
  30         if t in doc["pos"]:
  31             sys.stdout.write(term + ": " + str(doc["pos"][t]) + "\n")
  32     sys.stdout.write("=" * 72 + "\n")
  33 
  34 connection.close()

query_any.py

Search documents with all of the specified terms:

python3 query_all.py term...

   1 #!/usr/bin/env python3
   2 
   3 import sys
   4 from pymongo import Connection, ASCENDING
   5 
   6 Host = "localhost"
   7 Port = 9123           # <-- CHANGE THIS
   8 DBName = "temp"
   9 Collection = "docs"
  10 
  11 
  12 # Escape characters that are not allowed in keys.
  13 def escape(key):
  14     return key.replace("$", "\uFF04").replace(".", "\uFF0E")
  15 
  16 
  17 connection = Connection(Host, Port)
  18 db = connection[DBName]
  19 col = db[Collection]
  20 
  21 for doc in col.find(spec={"terms": {"$all": sys.argv[1:]}},
  22                     fields=["name", "text", "pos", "size"],
  23                     sort=[("name", ASCENDING)]):
  24     sys.stdout.write(doc["name"] + "\n\n")
  25     sys.stdout.write(doc["text"])
  26     sys.stdout.write("\nsize: " + str(doc["size"]) + "\n\n")
  27     for term in sys.argv[1:]:
  28         sys.stdout.write(term + ": " + str(doc["pos"][escape(term)]) + "\n")
  29     sys.stdout.write("=" * 72 + "\n")
  30 
  31 connection.close()

query_all.py

Search terms that contain all of the specified trigrams:

python3 query_tri.py trigram...

   1 #!/usr/bin/env python3
   2 
   3 import sys
   4 from pymongo import Connection
   5 
   6 Host = "localhost"
   7 Port = 9123           # <-- CHANGE THIS
   8 DBName = "temp"
   9 
  10 connection = Connection(Host, Port)
  11 db = connection[DBName]
  12 
  13 termset = set()
  14 first = True
  15 for tri in db["tri"].find(spec={"tri": {"$in": sys.argv[1:]}},
  16                           fields=["terms"]):
  17     if first:
  18         termset.update(tri["terms"])
  19         first = False
  20     else:
  21         termset.intersection_update(tri["terms"])
  22 
  23 for term in sorted(termset):
  24     sys.stdout.write(term + "\n")

query_tri.py

Search documents that contain any of the terms with all of the specified trigrams:

python3 query_any.py `python3 query_tri.py trigram...`

Reference

http://docs.mongodb.org/manual/reference/
http://api.mongodb.org/python/current/
pydoc3 pymongo.collection

Query gotchas

A query for a > 2 AND b < 3 :

{"a": {"$gt": 2}, "b": {"$lt": 3}}

This is NOT a query for a > 2 AND a < 3 :

{"a": {"$gt": 2}, "a": {"$lt": 3}}

Why? Because you have a structure with two keys with the same name, "a". This is an invalid construction.

This is the correct query for a > 2 AND a < 3 :

{"$and": [{"a": {"$gt": 2}}, {"a": {"$lt": 3}}]}

Other sites

Ferret: A Resourceful Substring Search Engine In Go

CategoryInformationRetrieval