Contents
Set-up
On volker and similar machines
PATH=$PATH:/net/aps/64/bin # put this in your .bashrc
On LWP machines
PATH=$PATH:/net/aps/64/bin # put this in your .bashrc or similar easy_install3 --user pymongo # do this once
If that didn't work, try this:
PATH=$PATH:/net/aps/64/bin # put this in your .bashrc or similar export PYTHONPATH=$HOME/.local/lib/python3.1/site-packages # put this in your .bashrc or similar mkdir -p $PYTHONPATH # do this once easy_install3 --prefix=$HOME/.local pymongo # do this once
Examples
Important: In all these examples you have to change the port number to something no other is using.
Download
Download all examples
Starting the server
Important: Don't forget to stop the server when you're done. Otherwise, next time, when you are working on a different machine, you won't be able to start the server.
sh server_start.sh
Stopping the server
sh server_stop.sh
Building the database
Instead of termIDs, we always use the terms themselves. This simplifies things, but is inefficient for large, real applications.
The textual content of the documents is included in the database. This makes it easier to play with it. In a real application, only the location of the original document is included in the database, and some metadata, like document title, date, summary, etc.
Build the main database:
python3 build.py document...
1 #!/usr/bin/env python3
2
3 import sys
4 from pymongo import Connection, ASCENDING
5
6 Host = "localhost"
7 Port = 9123 # <-- CHANGE THIS
8 DBName = "temp"
9 Collection = "docs"
10
11
12 # Get the plain text out of the document, for instance, the body
13 # of an html document, with all tags removed.
14 # Also get the language the text is written in, either from metadata
15 # in the document, or by using a language guesser on the extracted text.
16 # The language is needed for correct tokenization and stemming.
17 def gettext(doc):
18 text = doc.decode("utf-8", "replace")
19 return text, ""
20
21 # Split the text into tokens.
22 # Separate punctuation from words.
23 # Fixed word groups should be a single token, e.g.: New York
24 def tokenize(text, language):
25 tokens = text.split()
26 return tokens
27
28 # Transform a token into a list of zero, one, or more terms.
29 # Transform inflected words into their root form, e.g.: houses -> house
30 # There can be more than one term, e.g. synonyms, lemma with/without stemming
31 # Some tokens, like interpunction, don't represent a term.
32 # IMPORTANT: There must be NO DUPLICATES in the result.
33 def getterm(token, language):
34 terms = [token.lower()]
35 return terms
36
37 # Escape characters that are not allowed in keys.
38 def escape(key):
39 return key.replace("$", "\uFF04").replace(".", "\uFF0E")
40
41
42 connection = Connection(Host, Port)
43 db = connection[DBName]
44
45 # Start with an empty collection.
46 # Remove this line if you want to add documents to an existing collection.
47 db.drop_collection(Collection)
48
49 for filename in sys.argv[1:]:
50
51 fp = open(filename, "rb")
52 doc = fp.read()
53 fp.close()
54
55 text, language = gettext(doc)
56
57 tokens = tokenize(text, language)
58
59 terms = set() # Set of terms in current text.
60 pos = {} # List of positions for each term.
61 p = 0 # Term positions start at 0.
62 for token in tokens:
63 term = getterm(token, language)
64 if len(term) > 0:
65 for t in term:
66 terms.add(t)
67 t = escape(t)
68 if not t in pos:
69 pos[t] = []
70 pos[t].append(p)
71 # Tokens that don't represent terms don't increment the position.
72 p += 1
73
74 db[Collection].insert({"name": filename,
75 "text": text,
76 "lang": language,
77 "terms": sorted(terms),
78 "pos": pos,
79 "size": p})
80
81 # Make the indexes.
82 # Only do this if you are not going to add more documents, but before you start querying the database.
83 db[Collection].create_index([("name", ASCENDING)], unique=True, dropDups=True)
84 db[Collection].create_index([("terms", ASCENDING)])
85
86 connection.close()
Build an index of trigrams:
python3 build_tri.py
1 #!/usr/bin/env python3
2
3 import sys
4 from pymongo import Connection, ASCENDING
5
6 Host = "localhost"
7 Port = 9123 # <-- CHANGE THIS
8 DBName = "temp"
9 Collection = "docs"
10
11 connection = Connection(Host, Port)
12 db = connection[DBName]
13
14 terms = db[Collection].distinct("terms")
15
16 trigrams = {}
17 for term in terms:
18 trm = "$" + term + "$"
19 for i in range(len(trm) - 2):
20 tri = trm[i:i+3]
21 if not tri in trigrams:
22 trigrams[tri] = set()
23 trigrams[tri].add(term)
24
25 db.drop_collection("tri")
26 for tri in trigrams:
27 db["tri"].insert({"tri": tri, "terms": sorted(trigrams[tri])})
28 db["tri"].create_index([("tri", ASCENDING)], unique=True)
An alternative is to include a set of trigrams directly within each document object and have Mongo build an index on those embedded trigrams. Then you could search for documents by trigrams directly. Which alternative is faster or more flexible or more efficient?
Querying the database
Search documents with any of the specified terms:
python3 query_any.py term...
1 #!/usr/bin/env python3
2
3 import sys
4 from pymongo import Connection, ASCENDING
5
6 Host = "localhost"
7 Port = 9123 # <-- CHANGE THIS
8 DBName = "temp"
9 Collection = "docs"
10
11
12 # Escape characters that are not allowed in keys.
13 def escape(key):
14 return key.replace("$", "\uFF04").replace(".", "\uFF0E")
15
16
17 connection = Connection(Host, Port)
18 db = connection[DBName]
19 col = db[Collection]
20
21 for doc in col.find(spec={"terms": {"$in": sys.argv[1:]}},
22 fields=["name", "text", "terms.$", "pos", "size"],
23 sort=[("name", ASCENDING)]):
24 sys.stdout.write(doc["name"] + "\n\n")
25 sys.stdout.write(doc["text"])
26 sys.stdout.write("\nsize: " + str(doc["size"]) + "\n\n")
27 sys.stdout.write("found: " + doc["terms"][0] + "\n") # "terms.$" returns only one term
28 for term in sys.argv[1:]:
29 t = escape(term)
30 if t in doc["pos"]:
31 sys.stdout.write(term + ": " + str(doc["pos"][t]) + "\n")
32 sys.stdout.write("=" * 72 + "\n")
33
34 connection.close()
Search documents with all of the specified terms:
python3 query_all.py term...
1 #!/usr/bin/env python3
2
3 import sys
4 from pymongo import Connection, ASCENDING
5
6 Host = "localhost"
7 Port = 9123 # <-- CHANGE THIS
8 DBName = "temp"
9 Collection = "docs"
10
11
12 # Escape characters that are not allowed in keys.
13 def escape(key):
14 return key.replace("$", "\uFF04").replace(".", "\uFF0E")
15
16
17 connection = Connection(Host, Port)
18 db = connection[DBName]
19 col = db[Collection]
20
21 for doc in col.find(spec={"terms": {"$all": sys.argv[1:]}},
22 fields=["name", "text", "pos", "size"],
23 sort=[("name", ASCENDING)]):
24 sys.stdout.write(doc["name"] + "\n\n")
25 sys.stdout.write(doc["text"])
26 sys.stdout.write("\nsize: " + str(doc["size"]) + "\n\n")
27 for term in sys.argv[1:]:
28 sys.stdout.write(term + ": " + str(doc["pos"][escape(term)]) + "\n")
29 sys.stdout.write("=" * 72 + "\n")
30
31 connection.close()
Search terms that contain all of the specified trigrams:
python3 query_tri.py trigram...
1 #!/usr/bin/env python3
2
3 import sys
4 from pymongo import Connection
5
6 Host = "localhost"
7 Port = 9123 # <-- CHANGE THIS
8 DBName = "temp"
9
10 connection = Connection(Host, Port)
11 db = connection[DBName]
12
13 termset = set()
14 first = True
15 for tri in db["tri"].find(spec={"tri": {"$in": sys.argv[1:]}},
16 fields=["terms"]):
17 if first:
18 termset.update(tri["terms"])
19 first = False
20 else:
21 termset.intersection_update(tri["terms"])
22
23 for term in sorted(termset):
24 sys.stdout.write(term + "\n")
Search documents that contain any of the terms with all of the specified trigrams:
python3 query_any.py `python3 query_tri.py trigram...`
Reference
pydoc3 pymongo.collection
Query gotchas
A query for a > 2 AND b < 3 :
{"a": {"$gt": 2}, "b": {"$lt": 3}}
This is NOT a query for a > 2 AND a < 3 :
{"a": {"$gt": 2}, "a": {"$lt": 3}}
Why? Because you have a structure with two keys with the same name, "a". This is an invalid construction.
This is the correct query for a > 2 AND a < 3 :
{"$and": [{"a": {"$gt": 2}}, {"a": {"$lt": 3}}]}
Other sites
Ferret: A Resourceful Substring Search Engine In Go