A quick way to read the nth line from a file

Introduction

I have a C ++ process called MyProcess , which I call nbLines once, where nbLines is the number of lines of a large file called InputDataFile.txt in which the input data should be found. For example, a call

 ./MyProcess InputDataFile.txt 142 

Tell MyProcess that the input should be found on line 142 the InputDataFile.txt file.

Question

The problem is that InputDataFile.txt so large (~ 150 GB) that the search time for the correct line is not insignificant. Inspired by the form of this post , here is my (possibly not optimal) code

 int line = 142; int N = line - 1; std::ifstream inputDataFile(filename.c_str()); std::string inputData; for(int i = 0; i < N; ++i) std::getline(inputDataFile, inputData); std::getline(inputDataFile,inputData); 

goal

My goal is to speed up the search for inputData for MyProcess .

Possible Solution

It would be convenient to compare the index of the first character of each line with the line number in bash . Thus, instead of giving 142 before MyProcess , I could directly specify the index of the first character of interest. MyProcess could immediately jump to this position without searching and counting the characters "\ n". Then it will read the data until the character "\ n" is encountered. Is something like this possible? How can this be implemented?

Of course, I welcome any other solution that will reduce the overall computational time for importing this input.

+5
source share
3 answers

As stated in other answers, it might be a good idea to build a file map. The way I do it (in pseudo-code) will be:

 let offset be a unsigned 64 bit int =0; for each line in the file read the line write offset to a binary file (as 8 bytes rather as chars) offset += length of line in bytes 

Now you have a β€œMap” file, which is a list of 64-bit ints (one for each line in the file). To read the map, you simply calculate where the record for the line is located on the map:

 offset = desired_line_number * 8 // where line number starts at 0 offset2 = (desired_line_number+1) * 8 data_position1 = load bytes [offset through offset + 8] as a 64bit int from map data_position2 = load bytes [offset2 through offset2 + 8] as a 64bit int from map data = load bytes[data_position1 through data_position2-1] as a string from data. 

The idea is that you read the data file once and write the byte offset in the file where each line begins, and then store the offsets in sequence in a binary file using an integer type of a fixed size. The map file must be size number_of_lines * sizeof(integer_type_used) . Then you just need to search for the map file, calculating the offset where you saved the line number offset, and read that offset, as well as the offset of the next lines. From there you have the number range in bytes where your data should be located.

Example:

Data:

 hello\n world\n (\n newline at end of file) 

Create a map.

Map: each grouping [number] will represent a length of 8 bytes in a file

 [0][7][14] //or in binary 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000111 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00001110 

Now say I need line 2:

 line offset = 2-1 * 8 // offset is 8 

So, since we are using the base system 0, which will be the 9th byte in the file. Thus, the number consists of bytes 9-17, which:

 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000111 //or as decimal 7 

So now we know that the output line should start at offset 7 in our data file (this offset is the base 1, it would be 6 if we started the count from 0).

Then we perform the same process to get the initial offset of the next line, which is 14.

Finally, we look at the range of bytes 7-14 (base 1, 6-13 base 0) and save this as a string and get world\n .

C ++ implementation:

 #include <iostream> #include <fstream> int main(int argc, const char * argv[]) { std::string filename = "path/to/input.txt"; std::ifstream inputFile(filename.c_str(),std::ios::binary); std::ofstream outfile("path/to/map/file.bin",std::ios::binary|std::ios::ate); if (!inputFile.is_open() || !outfile.is_open()) { //use better error handling than this throw std::runtime_error("Error opening files"); } std::string inputData; std::size_t offset = 0; while(std::getline(inputFile, inputData)){ //write the offset as binary outfile.write((const char*)&offset, sizeof(offset)); //increment the counter offset+=inputData.length()+2; //add one becuase getline strips the \n and add one to make the index represent the next line } outfile.close(); offset=0; //from here on we are reading the map std::ifstream inmap("/Users/alexanderzywicki/Documents/xcode/textsearch/textsearch/map",std::ios::binary); std::size_t line = 2;//your chosen line number std::size_t idx = (line-1) * sizeof(offset); //the calculated offset //seek into the map inmap.seekg(idx); //read the binary at that location inmap.read((char*)&offset, sizeof(offset)); std::cout<<offset<<std::endl; //from here you just need to lookup from the data file in the same manor return 0; } 
+2
source

There is no β€œquick” method for reading the Nth text line of a file.

Text files contain variable length records. Each entry ends with a new line. The text should be read, by nature, until a new line is found. It can be 1 character or can be 245 characters. There is no standard size.

It is common practice to read each line and ignore the line until you reach the desired line.

If you often need to go to a specific line in a file, you can save a map of line numbers and their file locations.

Otherwise, you can try to read fragments or blocks into the buffer and scan the buffer. This will speed up your program, but you will have to consider the text string, possibly crossing the border of the buffer. Remember that input is most effective when it is supported by a stream (think of a data river).

+1
source

since it is marked with bash , here is a simple function with sed

to determine

 getline() { sed "${2}q;d" "$1"; } 

Using

 getline InputData.txt 142 
0
source

Source: https://habr.com/ru/post/1265884/


All Articles