Indexing files (using binary trees?) In Python

Background

I have many (thousands!) Data files with a standard field-based format (I think tab delimiters, the same fields on every line, in every file). I discuss various ways to make this data searchable / searchable. (Some options include RDBMS, NoSQL stuff, using grep / awk and friends, etc.).

Sentence

In particular, one idea that appeals to me is to “index” files in some way. Since these files are read-only (and static), I presented some persistent files containing binary trees (one for each indexed field, as in other data stores). I am open to thinking about how to do this, or to hear that this is just crazy. Basically, my favorite search engine did not give me any ready-made solutions for this.

I understand that this is a little badly formed, and decisions are welcome.

Additional Information

  • files are long, not wide
    • millions of lines per hour, more than 100 files per hour
    • tab divided, not many columns (~ 10)
    • short fields (e.g. 50 characters per field)
  • queries refer to fields, combinations of fields, and may be historical.

Disadvantages of various solutions:

(All of them are based on my observations and tests, but I am open for correction)

Bdb

  • has problems scaling to large file sizes (in my experience, when they have 2 GB or so, performance can be terrible).
  • solo writer (if you can get around this, I want to see the code!)
  • it’s hard to do multiple indexing, that is, indexing across different fields at the same time (of course, you can do this by copying the data over and over).
  • since it only stores strings, serialization / deserialization is done

RDBMSes

Wins:

  • flat table model great for querying, indexing

Losses:

  • In my experience, the problem is with indexing. From what I saw (and please correct me if I am wrong), the problem is with rdbmses that I know (sqlite, postgres) that support either batch loading (then indexing is slow at the end) or loading on a line is low ) Maybe I need more performance tuning.
+4
source share
5 answers

Why reinvent the wheel? Index files anyway, but use Whoosh or Lucene , etc.

Edit: You did not specify your performance requirements at the time I posted this answer. You won’t be able to index “millions of rows per hour” with off-the-shelf software.

+5
source

Physical storage access time will dominate what you do. When you are a profile, you will find that read() is where you spend most of your time.

To reduce I / O latency, compression is the best option.

Create a huge ZIP archive of all your files. One open , less read. You will spend more CPU time. However, I / O time will dominate your processing, so reduce your I / O time by running everything.

+1
source

If the data is already organized in the fields, this does not seem to be the problem of finding / indexing text. This sounds like tabular data that will be well served by the database.

Script file data to the database, index as you wish and request data in any complex way supported by the database.

That is, if you are not looking for a cool educational project. Then, by all means, come up with an interesting file indexing scheme.

+1
source

There are several file-based DBMSs designed to describe data. One of the most widely used Berkeley DB . It is open source, now owned by Oracle, and comes with C and Java accessories (with Python and other language shells for version C).

Since your data is read-only, there are other options. One of them comes to mind CDB . It also has Python interfaces, as well as a clean Python implementation.

Or, if you really like to implement things yourself, you can always implement SSTables from Google Bigtable paper.;)

+1
source

sqlite3 is a quick, small, part of python (so don't install anything) and provides column indexing. It writes files, so you will not need to install a database system.

+1
source

Source: https://habr.com/ru/post/1299014/


All Articles