Data structure parameters for efficiently storing sets of whole pairs on disk?

I have a bunch of code that deals with document clustering. One step involves calculating the similarity (for some unimportant definition of "similar") of each document with each other document in this corpus and preserving the similarity for future use. The similarity is maintained, and I do not care what the specific similarity is for the purposes of my analysis, only what is done in it. For example, if documents 15378 and 3278 are similar to 52%, an ordered pair (3278, 15378) gets stored in the bucket [0,5,0,6]. Documents are sometimes obtained either added to or deleted from the enclosure after an initial analysis, so matching pairs are added or removed from the buckets as needed.

I am looking for strategies to store these lists of ID pairs. We found the SQL database (where most of our other data for this project lives) is too slow and too large for disk space for our purposes, so at the moment we save each bucket as a compressed list of integers on disk (originally zlib-compressed, but now instead of lz4 for speed). What I like about this:

  • Reading and writing pretty fast
  • Subsequent additions to the case are pretty easy to add (a little less for lz4 than for zlib because lz4 does not have a built-in framing mechanism, but is doable)
  • When writing and reading data can be transmitted over the stream, so they do not need to be stored in memory all at once, which would be prohibitively high, given the size of our cases.

Things of this kind suck:

  • Deletions are a huge pain and are mainly associated with streaming through all buckets and write out new ones that let through any pairs that contain the identifier of the document that was deleted.
  • I suspect that I could still work better both in terms of speed and compactness with a more specific data structure and / or compression strategy.

So: what data structures should I look at? I suspect that the correct answer is some kind of exotic brief data structure, but this is not a space that I know very well. In addition, if it matters: all document identifiers are unsigned 32-bit ints, and the current code that processes this data is written in C, like Python extensions, so maybe we will stick to the general technology family, if possible .

+6
source share
3 answers

How to use one hash table or B-tree for each bucket?

Hash tables on disk are standard. Perhaps the BerkeleyDB libraries (available in the python warehouse) will work for you; but keep in mind that since they come with transactions, they can be slow and may require some tweaking. There are several options: gdbm, tdb, which you should try. Just make sure you check the APIs and initialize them with the appropriate size. Some of them will not automatically change, and if you download too much data, their performance simply drops.

In any case, you can use something even lower level, without transactions, if you have a lot of changes.

The pair of ints is long - and most databases should accept a long key; in fact, many will accept arbitrary byte sequences as keys.

+1
source

Why not just save a table containing things that have been deleted since the last rewrite?

This table may be the same structure as your main bucket, possibly with a Bloom filter for quick membership checks.

You can re-record the data of the main bucket without deleted elements, either when you are going to rewrite it in any case for some other modification, or when the ratio of deleted elements: the size of the bucket exceeds a certain threshold.


This scheme can work either by storing each remote pair next to each bucket, or by storing a separate table for all deleted documents: I'm not sure what works best for your requirements.

Keeping one table, it’s hard to know when you can delete an item if you don’t know how many buckets it affects, without re-writing all the buckets every time the delete table gets too big. It may work, but it will stop the world a little.

You also need to perform two checks for each pair that you enter (i.e. for (3278, 15378) , you should check if only 3278 or 15378 , and not just check if there will be a pair (3278, 15378) was removed.

Conversely, a table for each basket of each deleted pair will take longer, but it will be a little faster to check and easier to collapse when re-recording the bucket.

+1
source

You are trying to invent something that already exists in NoSQL data warehouses of a new age. There are 2 very good candidates for your requirements.

  • Redis
  • Mongodb

Both support data structures, such as dictionaries, lists, queues. Operations such as adding, modifying, or deleting are also available in both cases and very fast.

The performance of both of them depends on the amount of data that can be in RAM. Since most of your data is integer based, this should not be a problem.

My personal suggestion is to go with Redis with a good save configuration (that is, data should be periodically saved from RAM to disk).

Here is a short list of redis data structures: http://redis.io/topics/data-types-intro

The redis database is an easy installation, and the client is available in Python.

-1
source

Source: https://habr.com/ru/post/944465/


All Articles