The fastest way to use words

What is the fastest way to head a word (std :: string) using C ++?

On Debian Linux using g ++ 4.6.3 with the -O3 flag, this function using boost::to_lower will extract 81,450,625 words in about 24 seconds in a single thread of execution on an AMD Phenom (tm) II X6 1090T (3200 MHz) processor.

 void Capitalize( std::string& word ) { boost::to_lower( word ); word[0] = toupper( word[0] ); } 

This function with std::transform does the same in about 10 seconds. I clean the virtual machine between testing, so I don't think this difference is an accident:

sync && echo 3 > /proc/sys/vm/drop_caches

 void Capitalize( std::string& word ) { std::transform(word.begin(), word.end(), word.begin(), ::tolower); word[0] = toupper( word[0] ); } 

Are there faster ways? I would not want to lose portability for speed, but if there are faster ways to do this that work in std C ++ or std C ++ with boost, I would like to try them.

Thanks.

+6
source share
6 answers

A few ways to do it faster:
1. Do not use to_lower , it is slow. Do not use if , or create a table with 256 entries that map to a lowercase and another capitalized table.
2. Do not use transform , take the pointer to the first character and loop to the zero terminator.
3. If memory is not a problem, use a table that displays 2 sequences of characters. In this case, you will need another table that handles the completion.
4. If you can do this in an assembly, it will be much faster.

+4
source

Had this exact question when it comes to DNA sequences where input is not guaranteed to be uppercase and where boost::to_upper is the neck of the bottle in code. Go to the next:

 template<typename T_it> void SequenceToUpperCase( T_it begin, T_it end ) { // Convert to upper: clear the '32' bit, 0x20 in hex. And with the // inverted bit string (~). for ( auto it = begin; it != end; ++it ) *it &= ~0x20; } 

led to a huge increase in speed. I am sure that it is possible to further optimize, for example, by reversing 8 bytes at the same time, but with the above code, the upper case is close to instant for us. For lowercase: do:

  *it |= 0x20; 
+3
source

If capitalization is really a bottleneck, then write your own capitalization implementation with a manual loop and built-in toupper / tolower functions. Use ASM if necessary.

+1
source

I have an implementation, I found it faster than std :: transform, compiled in g ++ -03 Fedora 18.

  performance time in seconds:
 transform took: 11 s
 my implementation took: 2 s
 Test data size = 26 * 15 * 9999999 chars
 inline void tolowerPtr(char *p) ; inline void tolowerStr(std::string& s) {char* c=const_cast<char*>(s.c_str()); size_t l = s.size(); for(char* c2=c;c2<c+l;c2++)tolowerPtr(c2); }; inline void tolowerPtr(char *p) { switch(*p) { case 'A':*p='a'; return; case 'B':*p='b'; return; case 'C':*p='c'; return; case 'D':*p='d'; return; case 'E':*p='e'; return; case 'F':*p='f'; return; case 'G':*p='g'; return; case 'H':*p='h'; return; case 'I':*p='i'; return; case 'J':*p='j'; return; case 'K':*p='k'; return; case 'L':*p='l'; return; case 'M':*p='m'; return; case 'N':*p='n'; return; case 'O':*p='o'; return; case 'P':*p='p'; return; case 'Q':*p='q'; return; case 'R':*p='r'; return; case 'S':*p='s'; return; case 'T':*p='t'; return; case 'U':*p='u'; return; case 'V':*p='v'; return; case 'W':*p='w'; return; case 'X':*p='x'; return; case 'Y':*p='y'; return; case 'Z':*p='z'; return; }; return ; } void testtransform( std::string& word ) { std::string word2=word; time_t t; time_t t2; time(&t); std::cout << "testtransform: start " << "\n"; int i=0; for(;i<9999999;i++) { word2=word; std::transform(word2.begin(), word2.end(), word2.begin(), ::tolower); } time(&t2); std::cout << word2 << "\n"; std::cout << "testtransform: end " << i << ":"<< t2-t << "\n"; } void testmytolower( std::string& word ) { std::string word2=word; time_t t; time_t t2; time(&t); std::cout << "testmytolower: start " << "\n"; int i=0; for(;i<9999999;i++) { word2=word; cstralgo::tolowerStr(word2); } time(&t2); std::cout << word2 << "\n"; std::cout << "testmytolower: end " << i << ":"<< t2-t << "\n"; } int main(int argc, char* argv[]) { std::string word ="ABCDEFGHIJKLMNOPQRSTUVWXYZ"; word =word+word+word+word+word+word+word+word+word+word+word+word+word+word+word; testtransform( word); testmytolower( word); return 0; } 

I will be happy to know if performance is improved.

+1
source

I would use a for loop to traverse the entire line, character by character, parse them into a function to convert to uppercase. In order to hopefully speed up the work, I will write the capitalization function in the assembly. The component of a C ++ application will be like this:

 #include <iostream> #include <string> extern "C" char asm_capt(char x); using namespace std; int main(void) { string str; cin >> str; string tmp = str; for(int i=0; i<str.length(); i++){ tmp.at(i) = asm_capt(str.at(i)); } } 

Now is the assembly; Suppose you are using Windows and the MASM compiler. The code should be saved in the source .asm file and included in the project with the MASM build settings:

 .model flat .code _asm_capt proc mov rax, rcx cmp rax, 61h jl already_capt sub rax, 20h ret already_capt: ret _asm_capt endp end 

Essentially, it checks to see if the character is smaller (in hexadecimal format) less than 0x61, which implies if you use only letters that it is already capitalized. otherwise, the value decreases by 0x20, which transfers it to lowercase. (See ASCII Table.)


Note: By default, the returned parameter is stored in the RAX register; the first parameter passed to C ++ for assembly is stored in RCX.

0
source

The required pseudocode will be:



for(int i=0 to given_str.length)
{ upper[i]=(char*)given_str[i]-32; // in ascii tbl, any lower case char - upper case char=32
}
return upper;

-1
source

Source: https://habr.com/ru/post/916282/


All Articles