Why does my state machine take so long?

Question

Why does my state machine take so long?

I am working on a state machine that should retrieve calls to form functions

/* I am a comment */ //I am a comment pref("this.is.a.string.which\"can have QUOTES\"", 123456);

where the extracted data will be pref("this.is.a.string.which\"can have QUOTES\"", 123456); from file. Currently, this process takes about one and a half minutes to process a 41KB file. Is there something that I seriously misunderstand here about this final machine?

 #include <boost/algorithm/string.hpp> std::vector<std::string> Foo() { std::string fileData; //Fill filedata with the contents of a file std::vector<std::string> results; std::string::iterator begin = fileData.begin(); std::string::iterator end = fileData.end(); std::string::iterator stateZeroFoundLocation = fileData.begin(); std::size_t state = 0; for(; begin < end; begin++) { switch (state) { case 0: if (boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) { stateZeroFoundLocation = begin; begin += 4; state = 2; } else if (*begin == '/') state = 1; break; case 1: state = 0; switch (*begin) { case '*': begin = boost::find_first(boost::make_iterator_range(begin, end), "*/").end(); break; case '/': begin = std::find(begin, end, L'\n'); } break; case 2: if (*begin == '"') state = 3; break; case 3: switch(*begin) { case '\\': state = 4; break; case '"': state = 5; } break; case 4: state = 3; break; case 5: if (*begin == ',') state = 6; break; case 6: if (*begin != ' ') state = 7; break; case 7: switch(*begin) { case '"': state = 8; break; default: state = 10; break; } break; case 8: switch(*begin) { case '\\': state = 9; break; case '"': state = 10; } break; case 9: state = 8; break; case 10: if (*begin == ')') state = 11; break; case 11: if (*begin == ';') state = 12; break; case 12: state = 0; results.push_back(std::string(stateZeroFoundLocation, begin)); }; } return results; }

Billy3

EDIT: Well, this is one of the strangest things I've ever seen. I just rebuilt the project and it works again intelligently. Odd

+4

c ++ performance finite-automata

Billy oneal Mar 19 '10 at 2:00

source share

5 answers

I don't know if this is part of the problem, but you have a typo in case 0, "perf" is mistakenly written as "pref".

+1

Ferruccio Mar 19 '10 at 2:03

source share

It's hard to say just by looking at it ... but I guess find will do it. Why are you looking for FSM? By definition, you should give them one character at a time ... Add states. Also try making a list of results, not a vector. A lot of copying happens with

 vector<string>

But basically: Read it!

+1

Chris h Mar 19 '10 at 2:07

source share

State machines are a workable solution, but for text processing it is best to use a highly optimized state machine generator. In this case, the regular expression. Here it is like a Perl regex:

 # first clean the comments $source =~ s|//.*$||; # replace "// till end of line" with nothing $source =~ s|/\*.*?\*/||s; # replace "/* any text until */" with nothing # depending on your data, you may need a few other # rules here to avoid blanking data, you could replace # the comments with a unique identifier, and then # expand any identifiers that the regex below returns # then find your data while ($source =~ /perf\(\s*"(.+?)",\s*(\d+)\s*\);/g) { # matches your function signature and moves along source # do something with the captured groups, in this case $1 and $2 }

Since most regular expression libraries are compatible with Perl, you should not convert the syntax. If your search becomes more complex, then the parser is fine.

0

Eric Strom Mar 19 '10 at 2:19

source share

If you are parsing, why not use the parser library.

Usually I am Boost.Spirit.Qi .

You express your grammar with expressions like EBNF, which certainly makes maintenance easier.
This is a header-only library, so you have no problem adding a whole binary to the mix.

Although I can appreciate the minimalist approach, I am afraid that your idea of coding a finite state machine in itself is not so wise. It works well with the toy example, but as the requirements evolve, you will have one monstrous switch , and it becomes increasingly difficult to understand what is happening.

And please do not tell me that you know that this will not develop: I do not believe in oracles;)

0

Matthieu M. Mar 19 '10 at 7:56

source share

John knoeller · Accepted Answer · 2010-03-19T02:30:22+0000

If your 41 kb file is mainly comments or prefixes, it will spend most of its time in state 0. And for each character in state 0 you make at least two function calls.

 if (boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {

You can speed it up by pre-testing to see if the current character is “p”

 if (*begin == 'p' && boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {

If the character is not a "p", then there is no need to make any function calls. In particular, an iterator is not created, which is probably located where time is wasted.

Why does my state machine take so long?

More articles: