Search speed: state or database?

Question

Search speed: state or database?

I have a bunch of word lists on my server, and I planned to create a simple open source JSON API that returns if the password is in list ¹ as a verification method. I do this in Python with Flask and literally just return if input is present.

One small problem: the word list is about 150 million entries and 1.1 GB of text.

My API (minimal) is below. Is it more efficient to store every line in MongoDB and repeatedly search or store the whole thing in memory with a singleton and fill it at startup when I call app.run ? Or are differences subjective?
Also, is this really good practice? I think searches may start to get taxed if I open it to the public. I also suggested someone suggest Trie for an effective search.

Update: I did a bit of work, and finding documents is very slow with so many entries. Is it possible to use a database with corresponding indexes for a single column of data that needs to be searched efficiently?

 from flask import Flask from flask.views import MethodView from flask.ext.pymongo import PyMongo import json app = Flask(__name__) mongo = PyMongo(app) class HashCheck(MethodView): def post(self): return json.dumps({'result' : not mongo.db.passwords.find({'pass' : request.form["password"])}) # Error-handling + test cases to come. Negate is for bool. def get(self): return redirect('/') if __name__ == "__main__": app.add_url_rule('/api/', view_func=HashCheck.as_view('api')) app.run(host="0.0.0.0", debug=True)

_{1: I am a safety nut.} _{I use it in my registration forms and reject general input.} _{One of the word lists is UNIQPASS.}

+4

python flask mongodb

Amelia May 05, '13 at 1:03

source share

4 answers

Given that your list is completely static and suitable in memory, I see no good reason to use a database.

I agree that Trie will be more effective for your purpose. A hash table will also work.

PS: this is too bad for Python Global Interpreter Lock. If you used a language with real multi-threaded processing, you can use the invariable data structure and start the server through several cores with shared memory.

+4

japreiss May 05, '13 at 1:08

source share

I would suggest checking and trying redis as an option. Fast, very fast and has good python bindings. I would try to create a set in redis of the word list, and then use the SISMEMBER function to check if the word is in the set. SISMEMBER is an O (1) operation, so it should be faster than a mongo request.

This means that you want the whole list to be remembered, of course, and that you are ready to end the mongo.,

Here's more info about redis SISMEMBER , and python bindings for redis

+2

reptilicus May 14, '13 at 4:53

source share

I would recommend kyotocabinet , it is very fast. I used it in similar circumstances:

 import kyotocabinet as kyc from flask import Flask from flask.views import MethodView import json app = Flask(__name__) dbTree = kyc.DB() if not dbTree.open('./passwords.kct', DB.OREADER): print >>sys.stderr, "open error: " + str(dbTree.error()) raise SystemExit app = Flask(__name__) class HashCheck(MethodView): def post(self): return json.dumps({'result' : dbTree.check(request.form["password"]) > 0 }) # Error-handling + test cases to come. Negate is for bool. def get(self): return redirect('/') if __name__ == "__main__": app.add_url_rule('/api/', view_func=HashCheck.as_view('api')) app.run(host="0.0.0.0", debug=True)

0

arainchi May 16 '13 at 1:00

source share

Rob moore · Accepted Answer · 2013-05-14T04:34:03+0000

I would suggest that this is a hybrid approach. As requests are completed, two checks are performed. The first in the local cache and the second in the MongoDB store. If the first crashes, but the second completes successfully, add it to the memory cache. Over time, the application will be “to blame” for the most common “bad passwords” / entries.

This has two advantages:
1) Common words are quickly rejected from memory.
2) The cost of launching is close to zero and amortized by many requests.

When saving a list of words in MongoDB, I would make the _id field hold each word. By default, you will get an ObjectId, which is a complete departure in this case. Then we can also use the automatic index on _id. I suspect that the poor performance you saw was due to the fact that there was no pointer in the "pass" field. You can also try adding it to the 'pass' field:

 mongo.db.passwords.create_index("pass")

To complete the _id script: insert the word:

 mongo.db.passwords.insert( { "_id" : "password" } );

The queries look like this:

 mongo.db.passwords.find( { "_id" : request.form["password"] } )

As @Madarco mentioned, you can also shave one more bit during the query, ensuring that the results are returned from the index, restricting the returned fields to only the _id field ( { "_id" : 1} ).

 mongo.db.passwords.find( { "_id" : request.form["password"] }, { "_id" : 1} )

HTH - Rob

PS I am not a Python / Pymongo expert, so it may not have the correct syntax 100%. Hope this is still helpful.

Search speed: state or database?

More articles: