Background
I have many (thousands!) Data files with a standard field-based format (I think tab delimiters, the same fields on every line, in every file). I discuss various ways to make this data searchable / searchable. (Some options include RDBMS, NoSQL stuff, using grep / awk and friends, etc.).
Sentence
In particular, one idea that appeals to me is to “index” files in some way. Since these files are read-only (and static), I presented some persistent files containing binary trees (one for each indexed field, as in other data stores). I am open to thinking about how to do this, or to hear that this is just crazy. Basically, my favorite search engine did not give me any ready-made solutions for this.
I understand that this is a little badly formed, and decisions are welcome.
Additional Information
- files are long, not wide
- millions of lines per hour, more than 100 files per hour
- tab divided, not many columns (~ 10)
- short fields (e.g. 50 characters per field)
- queries refer to fields, combinations of fields, and may be historical.
Disadvantages of various solutions:
(All of them are based on my observations and tests, but I am open for correction)
Bdb
- has problems scaling to large file sizes (in my experience, when they have 2 GB or so, performance can be terrible).
- solo writer (if you can get around this, I want to see the code!)
- it’s hard to do multiple indexing, that is, indexing across different fields at the same time (of course, you can do this by copying the data over and over).
- since it only stores strings, serialization / deserialization is done
RDBMSes
Wins:
- flat table model great for querying, indexing
Losses:
- In my experience, the problem is with indexing. From what I saw (and please correct me if I am wrong), the problem is with rdbmses that I know (sqlite, postgres) that support either batch loading (then indexing is slow at the end) or loading on a line is low ) Maybe I need more performance tuning.
source share