C ++ / Quick random access run in a large file

I have large files containing a small number of large data sets. Each data set contains the name and size of the data set in bytes, which allows you to skip and go to the next data set.

I want to quickly create a dataset name index. An example file is about 21 MB and contains 88 data sets. Reading 88 names quickly using std::ifstreamand seekg()to skip between sets of data takes about 1300 ms, which I would like to reduce.

Thus, I read 88 fragments of about 30 bytes in size, at the given positions in the file 21 MB, and takes 1300 ms.

Is there a way to improve this, or is it a limitation of the OS and file system? I am running a test under the 64-bit version of Windows 7.

I know that having a full index at the beginning of a file would be better, but the file format does not have this, and we cannot change it.

+4
source share
3 answers

You can scan the file and create your own header with the key and index in a separate file. Depending on your use case, you can do this once at program startup and every time the file changes. Before accessing big data, a search in a smaller file gives you the index you need.

+2
source

You can use a memory mapped interface (I recommend increasing the implementation. )

, .

+5

. , .

:

, .

, , . , , . .

struct Batch {
    std::string name; // Name of Dataset
    unsigned size;    // Size of Dataset
    unsigned indexOffset;  // Index to next read location
    bool empty = true;     // Flag to tell if this batch is full or empty
    std::vector<DataType> dataset; // Container of Data
}; 

std::vector<Batch> finishedBatches;

// This doesn't matter on the size of the data set; this is just a buffer size on how much memory to digest in reading the file
const unsigned bufferSize = "Set to Your Preference" 1MB - 4MB etc.

void loadDataFromFile( const std::string& filename, unsigned bufferSize, std::vector<Batch>& batches ) {

    // Set ifstream buffer size 

    // OpenFile For Reading and read in and upto buffer size

    // Spawn different thread to populate the Batches and while that batch is loading 
    // in data read in that much buffer data again. You will need to have a couple local 
    // stack batches to work with. So if a batch is complete and you reached the next index 
    // location from the file you can fill another batch.

    // When a single batch is complete push it into the vector to store that batch.
    // Change its flag and clear its vector and you can then use that empty batch again.

    // Continue this until you reached end of file.           

}

This will be a 2-line system. The main thread for opening and reading from a file and searching from a file with a workflow filling in batches, and popping batches into the container and replacing to use the next lot.

0
source

Source: https://habr.com/ru/post/1664489/


All Articles