Edit : Re Update2
Here is a really simple explanation that I just realized to explain the problem I'm trying to solve:
std::string s1=garbagetext1+number1+name1+garbagetext4; std::string s3=garbagetext2+(number1+2)+name1+garbagetext5; std::string s5=garbagetext3+(number1+4)+name1+garbagetext6;
It begins to look like work for:
- Toxicification of “junk text / names” - you can sort characters on the fly and use them to match patterns (Lex and Qi
qi::symbol table ( qi::symbol ) can make it easier, but I feel like you can write it in any number of ways ) - conversely, use regular expressions as suggested earlier ( below , and at least twice on the mailing list).
Here is a simple idea:
(\d+) ([az]+).*?(\d+) \2
\d+ matches the sequence of digits in "(subexpression)" ( NUM1 )([az]+) matches the name (just chose the simple definition of "name").*? skip any length of garbage, but as little as possible before starting the subsequent matching\d+ corresponds to another number (sequence of digits) ( NUM2 )\2 followed by the same name ( backreference )
You can see how you narrowed down your hit list to check for “potential” hits. You will only need to / post-validate / see that NUM2 == NUM1 + 2
Two notes:
Add (...)+ around the tail to allow pattern matching
(\d+) ([az]+)(.*?(\d+) \2)+
You might want to make garbage skips ( .*? ) About separators (by making negative statements with a null string ) to avoid more than two missing separators (for example, s\d+=" as a delimitation pattern). Now I leave it out of sight for clarity , here's the gist:
((?!s\d+=").)*? -- beware of potential performance degradation
Alec. The following is an example of how to do a wide range of things in Boost Spirit, in the context of answering your question.
I had to make assumptions about the required input structure; I suggested that
- spaces were strict (spaces, as shown, without translation lines)
- serial numbers should be in ascending order
- serial numbers must be repeated exactly in text values
- the keywords "apple" and "cheese" are in strict alternation.
- whether the keyword is executed before or after the sequence number in a text value is also in strict alternation
Note In the implementation below, there are about a dozen places where significantly less complex options can be made. For example, I could hard code the entire pattern (how de facto a regular expression?), Assuming 4 elements are always expected at the input. However i wanted
However, this solution provides more flexibility:
- keywords are not hard-coded, and you can, for example, easily make the parser accept both keywords for any sequence number
- the comment shows how to create your own parsing exception when the sequence number is not synchronized (not the expected number)
- various serial number options are currently being accepted (ie
s01="apple 001" is ok. See Unsigned Integer Parsers for information on how to configure this behavior) The output structure is either vector<std::pair<int, std::string> > , or a vector struct:
struct Entry { int sequence; std::string text; };
both versions can switch with a single line #if 1/0
The sample uses Boost Spirit Qi for parsing. Conversely, Boost Spirit Karma is used to display the result of a parsing:
format((('s' << auto_ << "=\"" << auto_) << "\"") % ", ", parsed)
The output for the actual content indicated in the message:
parsed: s1="apple 1", s2="2 cheese", s3="apple 3", s4="4 cheese"
Code entry.
#include <boost/spirit/include/qi.hpp> #include <boost/spirit/include/karma.hpp> #include <boost/spirit/include/phoenix.hpp> #include <boost/spirit/include/phoenix_operator.hpp> namespace qi = boost::spirit::qi; namespace karma = boost::spirit::karma; namespace phx = boost::phoenix; #if 1 // using fusion adapted struct #include <boost/fusion/adapted/struct.hpp> struct Entry { int sequence; std::string text; }; BOOST_FUSION_ADAPT_STRUCT(Entry, (int, sequence)(std::string, text)); #else // using boring std::pair #include <boost/fusion/adapted/std_pair.hpp> // for karma output generation typedef std::pair<int, std::string> Entry; #endif int main() { std::string input = "s1=\"lxckvjlxcjvlkjlkje xvcjxzlvcj wqrej lxvcjz ljvl;x czvouzxvcu" "j;ljfds apple 1 xcvljxclvjx oueroi xcvzlkjv; zjx\", s2=\"xzljlkxvc" "jlkjxzvl jxcvljzx lvjlkj wre 2 cheese\", s3=\"apple 3\", s4=\"kxclvj" "xcvjlxk jcvljxlck jxcvl 4 cheese\""; using namespace qi; typedef std::string::const_iterator It; It f(input.begin()), l(input.end()); int next = 1; qi::rule<It, std::string(int)> label; qi::rule<It, std::string(int)> value; qi::rule<It, int()> number; qi::rule<It, Entry(), qi::locals<int> > assign; label %= qi::raw [ ( eps(qi::_r1 % 2) >> qi::string("apple ") > qi::uint_(qi::_r1) ) | qi::uint_(qi::_r1) > qi::string(" cheese") ]; value %= '"' >> qi::omit[ *(~qi::char_('"') - label(qi::_r1)) ] >> label(qi::_r1) >> qi::omit[ *(~qi::char_('"')) ] >> '"'; number %= qi::uint_(phx::ref(next)++) /*| eps [ phx::throw_(std::runtime_error("Sequence number out of sync")) ] */; assign %= 's' > number[ qi::_a = _1 ] > '=' > value(qi::_a); std::vector<Entry> parsed; bool ok = false; try { ok = parse(f, l, assign % ", ", parsed); if (ok) { using namespace karma; std::cout << "parsed:\t" << format((('s' << auto_ << "=\"" << auto_) << "\"") % ", ", parsed) << std::endl; } } catch(qi::expectation_failure<It>& e) { std::cerr << "Expectation failed: " << e.what() << " '" << std::string(e.first, e.last) << "'" << std::endl; } catch(const std::exception& e) { std::cerr << e.what() << std::endl; } if (!ok || (f!=l)) std::cerr << "problem at: '" << std::string(f,l) << "'" << std::endl; }