C ++ Text File Reading Performance

I am trying to port a C # program to C ++. The C # program reads a text file of size 1 ~ 5 gb line by line and performs some analysis on each line. C # code is as follows.

using (var f = File.OpenRead(fname)) using (var reader = new StreamReader(f)) while (!reader.EndOfStream) { var line = reader.ReadLine(); // do some analysis } 

For this 1.6 GB file with 7 million lines, this code takes about 18 seconds.

C ++ code that I wrote first for porting as shown below

 ifstream f(fname); string line; while (getline(f, line)) { // do some analysis } 

C ++ code takes about 420 seconds. The second C ++ code I wrote is as follows.

 ifstream f(fname); char line[2000]; while (f.getline(line, 2000)) { // do some analysis } 

C ++ above takes about 85 seconds.

The last code I tried is the c code, as shown below.

 FILE *file = fopen ( fname, "r" ); char line[2000]; while (fgets(line, 2000, file) != NULL ) { // do some analysis } fclose ( file ); 

The above c code takes about 33 seconds.

Both of the last two codes that parse strings in char [] instead of strings take about 30 seconds to convert char [] to strings.

Is there a way to improve the performance of c / C ++ code for reading a text file line by line to match C # performance? (Added: I am using Windows 7 64-bit OS with VC ++ 10.0, x64)

+6
source share
3 answers

One of the best ways to improve file reading performance is to use memory mapped files ( mmap() on Unix, CreateFileMapping() , etc. on Windows). Then your file appears in memory as one flat piece of bytes, and you can read it much faster than doing buffered I / O.

For a file larger than a gigabyte or so, you will want to use a 64-bit OS (with a 64-bit process). I did this to process a 30 gigabyte file in Python with excellent results.

+9
source

I suggest two things:

Use f.rdbuf()->pubsetbuf(...) to set a larger read buffer. I noticed some significant fstream performance improvements when using large buffer sizes.

Instead of getline(...) use read(...) to read large blocks of data and parse them manually.

0
source

Compilation with optimizations. C ++ has some pretty theoretical overhead that the optimizer will remove. For instance. many simple string methods will be included. This is probably why your version of char[2000] is faster.

0
source

Source: https://habr.com/ru/post/895652/


All Articles