Python and tfidf, do it faster?

Question

Python and tfidf, do it faster?

I am implementing the tf-idf algorithm in a web application using Python, however it is very slow. What I basically do:

1) Create 2 dictionaries:

First dictionary: key (document identifier), value (list of all found words (including repetition) in the document)
Second dictionary; key (document identifier), value (set containing unique document words)

Now there is a user request to receive tfidf the results of the document d. What am I doing:

2) Scroll through the unique words of the second dictionary for document d, and for each unique word w we get:

2.1) tf score (how many times w appears in d: loop over the list of words of the first dictionary for the document)

2.2) df assessment (how many documents does w contain: a cycle on the set of words of all documents (second dictionary) and checking for the presence of w). I use a set because it seems to check faster if the set contains a word compared to the list.

Step 2.2 is terribly slow. For example, having 1000 documents, and for a document with 2313 unique words, it takes about 5 minutes to display the results.

Is there any other way to make step 2.2 faster? Are dictionaries slow to iterate?

+4

python dictionary tf-idf

Dt Aug 27 '11 at 16:35

source share

2 answers

Is it academic work or are you doing this for production? If you use for production, why not use something already available (i.e. http://code.google.com/p/tfidf/ )? On the other hand, if you do this as an academic exercise, I can still take the ganger in the existing implementation to see what they do differently (if at all).

I also suggest using cProfile to view your code to find out where this account is.

+3

Demian brecht Aug 27 '11 at 16:42

source share

hymloth · Accepted Answer · 2011-08-27T17:03:08+0000

Well, you need to somehow rethink and reverse engineer the way you store your data, or, in other words, implement the “orthodox” version of your “inverted index”.

Your bottleneck is the on-the-fly calculation of document frequency (DF) for conditions. It would be a reasonable idea that this be dynamic, so every time you update your corpus (collection of documents), do some processing and update DF for each term in the document (and, of course, keep the results in constant mode, aka the database and etc.).

The only structure you need is a nested dictionary like this

{ "term1" : { "DF" : x, "some_doc_id" : tf , "some_other_doc_id" : tf, etc } , "term2" : ... etc.. }

Correctly updated every time you "feed" your body.

And, of course, keep your case somewhere ...

As a hobby and part of my job, I implement a small python search engine - redis. You can get other ideas. Take a look here .

Python and tfidf, do it faster?

More articles: