A database or other way to store and dynamically access HUGE binary objects

I have some large (200 GB normal) flat data files that I would like to store in some kind of database so that I can quickly and intuitively understand that the data is logically organized. Think of it as large sets of very long audio recordings, where each record has the same length (patterns) and can be thought of as a string. One of these files usually has about 100,000 records of 2,000,000 samples in length.

It would be simple enough to save these records as BLOB data strings in a relational database, but there are many cases when I want to load only certain columns of the entire data set into the memory (say, samples 1,000-2,000), which is the most efficient way to work with memory and time?

Please feel free to ask if you need to clarify my details to make a recommendation.

EDIT: to clarify the size of the data ... One file consists of: 100,000 rows (records) per 2,000,000 columns (samples). Most of the relational databases I've researched will allow from a few hundred to several thousand rows in a table. Again, I know little about object-oriented databases, so I wonder if something like this can help here. Of course, any good solution is very welcome. Thanks.

EDIT: To clarify the use of data ... Access to the data will be performed only by the user desktop / distributed server application, which I will write. There is metadata (collection date, filters, sample rate, owner, etc.) for each data set (which I have called a 200 GB file so far). There is also metadata associated with each record (I was hoping it would be a row in the table so that I could just add columns for each piece of metadata in the record). All metadata is consistent. That is, if a certain part of the metadata exists for one record, it also exists for all records in this file. Samples themselves do not have metadata. Each sample represents 8 bits of plain-ol binary data.

+4
source share
4 answers

DB storage might not be ideal for large files. Yes, it can be done. Yes, that might work. But what about database backups? The contents of the file will probably not change often - after they are added, they will remain the same.

My recommendation would be to store the file on disk, but create an index based on the database. Most file systems become unstable or slow when you have> 10k files in / etc / folder. Your application can generate a file name and store metadata in the database, and then organize the generated name on disk. The disadvantages are the contents of the file, which cannot be directly seen from the name. However, you can easily back up modified files without specialized DB backup plugins and complex partitioning, incremental backup schemes. In addition, searching inside a file becomes much easier (skip forward, rewind, etc.). As a rule, it is better to support these operations in the file system than in the database.

+2
source

Interestingly, what makes you think that RDBMS will be limited to only thousands of rows; there is no reason why this would be so.

In addition, at least some databases (as an example of Oracle) allow direct access to portions of LOB data without loading the full LOB, if you just know the offset and length that you want to have. So you can have a table with some metadata to search for, and then a LOB column and, if necessary, an additional metadata table containing metadata in the contents of the large object, so that you have some kind of keyword relationship → (offset, length) for partial loading of large objects.

Some distance from another post here, additional backups (which you might wish) are not quite feasible with databases here (well, maybe, but at least in my experience, as a rule, there is a bad price tag).

+1
source

How large is each sample and how large is each record? You say that each record is 2,000,000 samples or each file? (you can read it anyway)

If it is 2 million samples of 200 GB, then each sample is ~ 10 K, and each record is 200 KB (in order to have 100,000 per file, which is 20 samples per record)?

It seems like a very reasonable size fits in a string in the database, not in a file on disk.

As for loading only a certain range into the memory, if you indexed the identifiers of the sample, you can very quickly request only the necessary subset by loading only this range into the memory from the database query result.

0
source

I think Microsoft SQL does what you need with the varbinary (MAX) WHEN field type used in conjunction with the storage to store the threads.

Read TechNet for more depth: (http://technet.microsoft.com/en-us/library/bb933993.aspx).

Basically, you can enter any descriptive fields, usually in your database, but the actual BLOB is stored in NTFS, controlled by the SQL engine and limited only by the NTFS file system.

I hope this helps - I know that it gives me all kinds of possibilities .; -)

0
source

Source: https://habr.com/ru/post/1388488/


All Articles