Recently, I was instructed to perform several speed checks, so I can tell if it is faster to use php / php-cli or C ++ to insert a certain number of rows into the database.
Before starting, let me tell you a few details so that everything is clear:
- Part of php runs through Apache, directly requested in the browser.
- Hard drive tests run on an SSD. I think that on regular disks everything will be slower. The car itself is nothing special, six years or so.
- All inserts are performed using prepared instructions. We use mysqli on php and mysqlcppconn (the mysql C ++ connector provided by Oracle).
- All inserts are performed by write input. I know that we can stack them, but well, we are testing here.
- Time is displayed through microtime in php and through the header in C ++.
- Of course, the code itself is not equivalent. More on this later.
- All text is in UTF-8. There are Russian, Chinese, Arabic, Spanish, English and all kinds of crazy things. The mysql table is in utf8_4mb.
- The numbers for C ++ code are the results of using std :: vector and -O2 compilation with g ++ (vectors outperform maps, unordered_maps and std :: arrays).
So this is the process:
- Connect to the database.
- Open a text file with N lines.
- Read the line of the file.
- Separate the string with a separator character.
- Use specific parts of the dividing line to get insertion values (for example, 0th, 1st, and 3rd indices).
- Send these parts to a trained operator to insert them.
- Repeat until the file is fully read.
Both codes work exactly as expected. Here are the resulting numbers:
php:
- 5000 records: 1.42 - 1.27 s.
- 20,000 entries: 5.53 - 6.18 s.
- 50,000 records: 14.43 - 15.69 s.
C ++:
- 5000 records: 1.78 - 1.81 s.
- 20,000 entries: 7.19 - 7.22 s.
- 50,000 records: 18.52 - 18.84 s.
php surpasses C ++ as the lines in the file increase ... At first I suspected the line splitting function: splitting in php is done with an "explosion". The algorithm is as naive as it is for C ++ ... The container is passed through the link, and its contents change on the fly. The container moves only once. I made sure that the reserve () container has all the necessary space (remember, I finally select the vectors), which is fixed. The container is created by the main function, and then passed by reference through the code. It never clears or does not change: only its contents change.
template<typename container> void explode(const std::string& p_string, const char p_delimiter, container& p_result) { auto it=p_result.begin(); std::string::const_iterator beg=p_string.begin(), end=p_string.end(); std::string temp; while(beg < end) { if( (*beg)==p_delimiter) { *(it)=temp; ++it; temp=""; } else { temp+=*beg; } ++beg; } *(it)=temp; }
As mentioned above, the task being performed is equivalent, but the code generating it is not. C ++ - the code has the usual try-catch blocks for managing mysql interactions. As for the rest, the main loop runs until EOF is reached, and each iteration checks to see if the insert has entered (in both C ++ and php).
I saw that C ++ is significantly superior to php when working with files and their contents, so I expected this to apply here as well. Somehow I suspect the separation algorithm, but maybe it's just that the database connector is slower (still, when I turned off the interaction with php databases, it is still processing faster), or my code is entity ...
As for profiling, gprof laid it out from C ++ code:
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ns/call ns/call name 60.00 0.03 0.03 50000 600.00 600.00 void anc_str::explotar_cadena<std::vector<std::string, std::allocator<std::string> > >(std::string const&, char, std::vector<std::string, std::allocator<std::string> >&) 40.00 0.05 0.02 insertar(sql::PreparedStatement*, std::string const&, std::vector<std::string, std::allocator<std::string> >&) 0.00 0.05 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN7anc_str21obtener_linea_archivoERSt14basic_ifstreamIcSt11char_traitsIcEE
If "explotar_cadena" "explodes" and "insertar" - "splits this line and sets the prepared statement up". As you can see, 60% of the time is spent there (no wonder ... it works 50,000 times and does this crazy thing). "obtener_linea_archivo" is simply "please upload the next line to a line".
Without interacting with mysql (just upload the file, read the lines and separate them) I get these measurements:
Php
- 5000 records: 0.019 - 0.036 sec.
- 20,000 entries: 0.09 - 0.10 s.
- 50,000 records: 0.14 - 0.17 s.
C ++
- 5000 records: 0.07 - 0.10 seconds.
- 20,000 entries: 0.25 - 0.26 s.
- 50,000 records: 0.49 - 0.55 s.
Well, both times are good and barely noticeable for real living conditions, I am surprised ... So, the question is: should I expect this ?. Anyone with previous experience wanting to lend a hand ?.
Thanks in advance.
Edit: Here is a quick link to a stripped-down version containing input files, C ++ code and php code [ http://www.datafilehost.com/d/d31034d6 ]. Note that there is no sql interaction: only opening a file, splitting lines and measuring time. Please forgive the code with the spear, as well as comments on half of the Spanish language and variable names, as this was done in a hurry. Also, take a look at the gprof results above: I'm not an expert, but I think we are trying to find a better way to split the lines.