Why is this C code faster than this C ++ code? getting the largest line in a file

I have two versions of the program that do basically the same thing, getting the longest line length in the file, I have a file with about 8 thousand lines, my C code is a little more primitive (of course!) Than the code, which I have in C ++. Program C takes about 2 seconds to run, and C ++ program takes 10 seconds to run (the same file that I am testing for both cases). But why? I expected this to take the same amount of time or a little longer, but not 8 seconds slower!

my code in C:

#include <stdio.h> #include <stdlib.h> #include <string.h> #if _DEBUG #define DEBUG_PATH "../Debug/" #else #define DEBUG_PATH "" #endif const char FILE_NAME[] = DEBUG_PATH "data.noun"; int main() { int sPos = 0; int maxCount = 0; int cPos = 0; int ch; FILE *in_file; in_file = fopen(FILE_NAME, "r"); if (in_file == NULL) { printf("Cannot open %s\n", FILE_NAME); exit(8); } while (1) { ch = fgetc(in_file); if(ch == 0x0A || ch == EOF) // \n or \r or \r\n or end of file { if ((cPos - sPos) > maxCount) maxCount = (cPos - sPos); if(ch == EOF) break; sPos = cPos; } else cPos++; } fclose(in_file); printf("Max line length: %i\n", maxCount); getch(); return (0); } 

my code in C ++:

 #include <iostream> #include <fstream> #include <stdio.h> #include <string> using namespace std; #ifdef _DEBUG #define FILE_PATH "../Debug/data.noun" #else #define FILE_PATH "data.noun" #endif int main() { string fileName = FILE_PATH; string s = ""; ifstream file; int size = 0; file.open(fileName.c_str()); if(!file) { printf("could not open file!"); return 0; } while(getline(file, s) ) size = (s.length() > size) ? s.length() : size; file.close(); printf("biggest line in file: %i", size); getchar(); return 0; } 
+29
c ++ performance c count lines
Jan 13 2018-12-12T00:
source share
8 answers

The C ++ version constantly allocates and frees instances of std :: string. Memory allocation is an expensive operation. In addition to this, constructors / destructors are executed.

However, in version C, read-only memory is used, and it is simply necessary: ​​reading in single characters, setting the line length counter to a new value, if it is higher, for each new line and for it.

+14
Jan 13 2018-12-12T00:
source share

I assume this is a problem with the compiler options that you use, by the compiler or the file system. I just compiled both versions (with optimization) and posted them with a text file of 92,000 lines:

 c++ version: 113 ms c version: 179 ms 

And I suspect the reason the C ++ version is faster is because fgetc is most likely slower. fgetc uses buffered I / O, but calls a function call for each character. I tested it before, and fgetc is not as fast as making a call to read the entire line in one call (for example, compared to fgets ).

+75
Jan 13 2018-12-01T00:
source share

So, in a few comments, I repeated the answers of people that the problem is most likely an additional copy made by your version in C ++, where it copies the lines to memory in line. But I wanted to check it out.

First I implemented the fgetc and getline versions and timed them. I confirmed that in debug mode, the getline version is slower, about 130 μs versus 60 μs for the fgetc version. This is not surprising, given the traditional wisdom that iostreams are slower than using stdio. However, in the past it seemed to me that iostreams significantly accelerated from optimization. This was confirmed when I compared the release time: about 20 μs using getline and 48 μs using fgetc.

The fact that using getline with iostreams is faster than fgetc, at least in release mode, contradicts the argument that copying all this data should be slower than not copying, so I'm not sure that all optimization is to avoid, and I really did not look for any explanation, but it would be interesting to understand what needs to be optimized. edit: when I looked at programs with a profiler, it was unclear how to compare performance, since the different methods looked so different from each other

Anwyay I wanted to see if I could get a faster version by avoiding copying using the get() method in the fstream object and just do exactly what version C does. When I did this, I was very surprised to find that using fstream::get() was rather slow than the fgetc and getline methods both in debugging and in release; About 230 μs for debugging and 80 μs for Release.

To narrow down all the slowdowns, I went ahead and made another version, this time using stream_buf attached to the fstream object and the snextc() method. This version is by far the fastest; 25 μs in debugging and 6 μs in release.

I assume that what the fstream::get() method does is so slow that it creates watchdog objects for every call. Although I have not tested this, I do not see that get() does much more than just get the next character from stream_buf, with the exception of these watchdog objects.

Anyway, the moral of this story is that if you want io fast, you are probably best off using high-level iostream functions, not stdio, but for really quick access to the underlying stream_buf. edit: in fact, this moral can only be applied to MSVC, see the update below to get results from another toolchain.

For reference:

I used VS2010 and chrono from boost 1.47 for synchronization. I built 32-bit binaries (it seems chrono acceleration is required because it cannot find the 64-bit version of this library). I did not configure the compilation options, but they may not have been completely standard, since I did this in the Screenshot Against Me project.

The file I tested with was a version of the text version of Oeuvres Complétes de Frédéric Bastiat in version 1.1 MB of 20,000 lines, volume 1 from Frédéric Bastiat from Project Gutenberg, http://www.gutenberg.org/ebooks/35390

Unlock Time

 fgetc time is: 48150 microseconds snextc time is: 6019 microseconds get time is: 79600 microseconds getline time is: 19881 microseconds 

Debug time:

 fgetc time is: 59593 microseconds snextc time is: 24915 microseconds get time is: 228643 microseconds getline time is: 130807 microseconds 

Here is my version of fgetc() :

 { auto begin = boost::chrono::high_resolution_clock::now(); FILE *cin = fopen("D:/bames/automata/pg35390.txt","rb"); assert(cin); unsigned maxLength = 0; unsigned i = 0; int ch; while(1) { ch = fgetc(cin); if(ch == 0x0A || ch == EOF) { maxLength = std::max(i,maxLength); i = 0; if(ch==EOF) break; } else { ++i; } } fclose(cin); auto end = boost::chrono::high_resolution_clock::now(); std::cout << "max line is: " << maxLength << '\n'; std::cout << "fgetc time is: " << boost::chrono::duration_cast<boost::chrono::microseconds>(end-begin) << '\n'; } 

Here is my version of getline() :

 { auto begin = boost::chrono::high_resolution_clock::now(); std::ifstream fin("D:/bames/automata/pg35390.txt",std::ios::binary); unsigned maxLength = 0; std::string line; while(std::getline(fin,line)) { maxLength = std::max(line.size(),maxLength); } auto end = boost::chrono::high_resolution_clock::now(); std::cout << "max line is: " << maxLength << '\n'; std::cout << "getline time is: " << boost::chrono::duration_cast<boost::chrono::microseconds>(end-begin) << '\n'; } 

version of fstream::get()

 { auto begin = boost::chrono::high_resolution_clock::now(); std::ifstream fin("D:/bames/automata/pg35390.txt",std::ios::binary); unsigned maxLength = 0; unsigned i = 0; while(1) { int ch = fin.get(); if(fin.good() && ch == 0x0A || fin.eof()) { maxLength = std::max(i,maxLength); i = 0; if(fin.eof()) break; } else { ++i; } } auto end = boost::chrono::high_resolution_clock::now(); std::cout << "max line is: " << maxLength << '\n'; std::cout << "get time is: " << boost::chrono::duration_cast<boost::chrono::microseconds>(end-begin) << '\n'; } 

and snextc() version

 { auto begin = boost::chrono::high_resolution_clock::now(); std::ifstream fin("D:/bames/automata/pg35390.txt",std::ios::binary); std::filebuf &buf = *fin.rdbuf(); unsigned maxLength = 0; unsigned i = 0; while(1) { int ch = buf.snextc(); if(ch == 0x0A || ch == std::char_traits<char>::eof()) { maxLength = std::max(i,maxLength); i = 0; if(ch == std::char_traits<char>::eof()) break; } else { ++i; } } auto end = boost::chrono::high_resolution_clock::now(); std::cout << "max line is: " << maxLength << '\n'; std::cout << "snextc time is: " << boost::chrono::duration_cast<boost::chrono::microseconds>(end-begin) << '\n'; } 



update:

I repeated the tests using clang (trunk) in OS X with libC ++. The results for iostream-based implementations remained relatively the same (with optimization enabled); fstream::get() much slower than std::getline() much slower than filebuf::snextc() . But the performance of fgetc() improved relative to the implementation of getline() and has become faster. Perhaps this is due to the fact that copying performed by getline() becomes a problem of this tool chain, while it is not with MSVC? Perhaps the implementation of fgetc () in Microsoft CRT is bad or something like that?

Anyway, here are the times (I used a much larger file, 5.3 MB):

using -os

 fgetc time is: 39004 microseconds snextc time is: 19374 microseconds get time is: 145233 microseconds getline time is: 67316 microseconds 

using -O0

 fgetc time is: 44061 microseconds snextc time is: 92894 microseconds get time is: 184967 microseconds getline time is: 209529 microseconds 

-O2

 fgetc time is: 39356 microseconds snextc time is: 21324 microseconds get time is: 149048 microseconds getline time is: 63983 microseconds 

-O3

 fgetc time is: 37527 microseconds snextc time is: 22863 microseconds get time is: 145176 microseconds getline time is: 67899 microseconds 
+29
Jan 13 2018-12-17T00:
source share

You do not compare apples with apples. Your C program does not copy data from the FILE* buffer to your program memory. It also works with unprocessed files.

Your C ++ program must go through the length of each line several times - once in the stream code to find out when to stop the line that it returns to you, once in the std::string constructor, and once in your code call s.length() .

Perhaps you could improve the performance of your C program, for example, using getc_unlocked , if available to you. But the biggest victory is the need not to copy your data.

EDIT: edited in response to a comment from bames53

+11
Jan 13 2018-12-01T00:
source share

2 seconds in total 8000 lines? I do not know how long your lines are, but there is a chance that you are doing something very wrong.

This trivial Python program runs almost instantly with El Quijote downloaded from Project Gutenberg (40006 lines, 2.2 MB):

 import sys print max(len(s) for s in sys.stdin) 

Time:

 ~/test$ time python maxlen.py < pg996.txt 76 real 0m0.034s user 0m0.020s sys 0m0.010s 

You can improve your C code by buffering input rather than reading char on char.

About why C ++ is slower than C, it should be associated with building string objects and calling the length method. In C, you just count the characters when you go.

+5
Jan 13 '12 at 15:33
source share

I tried to compile and run your programs with 40K lines of C ++ source, and they both finished in about 25 ms. I can only conclude that your input files have extremely long lines, perhaps 10K-100K characters per line. In this case, the C version does not have negative performance from the long line length, while the C ++ version will have to continue to increase the size of the line and copy the old data to a new buffer. If he had to increase the size a sufficient number of times, which could explain the excessive difference in performance.

The key point here is that the two programs do not do the same, so you cannot compare their results. If you could provide an input file, we could provide additional information.

You could use tellg and ignore to make it faster in C ++.

+5
Jan 13 '12 at 15:52
source share

A C ++ program builds string string objects, and a C program just reads characters and looks at characters.

EDIT:

Thanks for the upvotes, but after the discussion, I now think that this answer is incorrect. This was a reasonable first guess, but in this case it seems that different (and very slow) runtimes are caused by other things.

+3
Jan 13 2018-12-18T00:
source share

I'm fine with theoretical people. But let me get the empirical.

I created a file with 13 million lines of text file to work with.

 ~$ for i in {0..1000}; do cat /etc/* | strings; done &> huge.txt 

The source code edited for reading from stdin (should not affect performance too much) did this after almost 2 minutes.

C ++ Code:

 #include <iostream> #include <stdio.h> using namespace std; int main(void) { string s = ""; int size = 0; while (cin) { getline(cin, s); size = (s.length() > size) ? s.length() : size; } printf("Biggest line in file: %i\n", size); return 0; } 

C ++ time:

 ~$ time ./cplusplus < huge.txt real 1m53.122s user 1m29.254s sys 0m0.544s 

Version A 'C':

 #include <stdio.h> int main(void) { char *line = NULL; int read, max = 0, len = 0; while ((read = getline(&line, &len, stdin)) != -1) if (max < read) max = read -1; printf("Biggest line in file %d\n", max); return 0; } 

C:

 ~$ time ./ansic < huge.txt real 0m4.015s user 0m3.432s sys 0m0.328s 

Do your own math ...

-one
Jun 27 2018-12-18T00:
source share



All Articles