Why does my state machine take so long?

I am working on a state machine that should retrieve calls to form functions

/* I am a comment */ //I am a comment pref("this.is.a.string.which\"can have QUOTES\"", 123456); 

where the extracted data will be pref("this.is.a.string.which\"can have QUOTES\"", 123456); from file. Currently, this process takes about one and a half minutes to process a 41KB file. Is there something that I seriously misunderstand here about this final machine?

 #include <boost/algorithm/string.hpp> std::vector<std::string> Foo() { std::string fileData; //Fill filedata with the contents of a file std::vector<std::string> results; std::string::iterator begin = fileData.begin(); std::string::iterator end = fileData.end(); std::string::iterator stateZeroFoundLocation = fileData.begin(); std::size_t state = 0; for(; begin < end; begin++) { switch (state) { case 0: if (boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) { stateZeroFoundLocation = begin; begin += 4; state = 2; } else if (*begin == '/') state = 1; break; case 1: state = 0; switch (*begin) { case '*': begin = boost::find_first(boost::make_iterator_range(begin, end), "*/").end(); break; case '/': begin = std::find(begin, end, L'\n'); } break; case 2: if (*begin == '"') state = 3; break; case 3: switch(*begin) { case '\\': state = 4; break; case '"': state = 5; } break; case 4: state = 3; break; case 5: if (*begin == ',') state = 6; break; case 6: if (*begin != ' ') state = 7; break; case 7: switch(*begin) { case '"': state = 8; break; default: state = 10; break; } break; case 8: switch(*begin) { case '\\': state = 9; break; case '"': state = 10; } break; case 9: state = 8; break; case 10: if (*begin == ')') state = 11; break; case 11: if (*begin == ';') state = 12; break; case 12: state = 0; results.push_back(std::string(stateZeroFoundLocation, begin)); }; } return results; } 

Billy3

EDIT: Well, this is one of the strangest things I've ever seen. I just rebuilt the project and it works again intelligently. Odd

+4
source share
5 answers

If your 41 kb file is mainly comments or prefixes, it will spend most of its time in state 0. And for each character in state 0 you make at least two function calls.

 if (boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) { 

You can speed it up by pre-testing to see if the current character is β€œp”

 if (*begin == 'p' && boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) { 

If the character is not a "p", then there is no need to make any function calls. In particular, an iterator is not created, which is probably located where time is wasted.

+3
source

I don't know if this is part of the problem, but you have a typo in case 0, "perf" is mistakenly written as "pref".

+1
source

It's hard to say just by looking at it ... but I guess find will do it. Why are you looking for FSM? By definition, you should give them one character at a time ... Add states. Also try making a list of results, not a vector. A lot of copying happens with

 vector<string> 

But basically: Read it!

+1
source

State machines are a workable solution, but for text processing it is best to use a highly optimized state machine generator. In this case, the regular expression. Here it is like a Perl regex:

 # first clean the comments $source =~ s|//.*$||; # replace "// till end of line" with nothing $source =~ s|/\*.*?\*/||s; # replace "/* any text until */" with nothing # depending on your data, you may need a few other # rules here to avoid blanking data, you could replace # the comments with a unique identifier, and then # expand any identifiers that the regex below returns # then find your data while ($source =~ /perf\(\s*"(.+?)",\s*(\d+)\s*\);/g) { # matches your function signature and moves along source # do something with the captured groups, in this case $1 and $2 } 

Since most regular expression libraries are compatible with Perl, you should not convert the syntax. If your search becomes more complex, then the parser is fine.

0
source

If you are parsing, why not use the parser library.

Usually I am Boost.Spirit.Qi .

  • You express your grammar with expressions like EBNF, which certainly makes maintenance easier.
  • This is a header-only library, so you have no problem adding a whole binary to the mix.

Although I can appreciate the minimalist approach, I am afraid that your idea of ​​coding a finite state machine in itself is not so wise. It works well with the toy example, but as the requirements evolve, you will have one monstrous switch , and it becomes increasingly difficult to understand what is happening.

And please do not tell me that you know that this will not develop: I do not believe in oracles;)

0
source

Source: https://habr.com/ru/post/1304533/


All Articles