How to do this pattern matching

Question

How to do this pattern matching

Migration from the [ Spirit-General List ]

_{^{Good morning,}}

I am trying to parse a relatively simple 4 std::strings pattern, extracting any part that matches the pattern into a separate std::string .

In an abstract sense, this is what I want:

 s1=<string1><consecutive number>, s2=<consecutive number><string2>, s3=<string1><consecutive number>, s4=<consecutive number><string2>

Less abstract:

 s1="apple 1", s2="2 cheese", s3="apple 3", s4="4 cheese"

Actual content:

 s1="lxckvjlxcjvlkjlkje xvcjxzlvcj wqrej lxvcjz ljvl;x czvouzxvcu j;ljfds apple 1 xcvljxclvjx oueroi xcvzlkjv; zjx", s2="xzljlkxvc jlkjxzvl jxcvljzx lvjlkj wre 2 cheese", s3="apple 3", s4="kxclvj xcvjlxk jcvljxlck jxcvl 4 cheese"

How to execute this template?

Thanks for all the suggestions,

Alec Taylor

Update 2

Here is a really simple explanation that I just realized to explain the problem I'm trying to solve:
  std::string s1=garbagetext1+number1+name1+garbagetext4; std::string s3=garbagetext2+(number1+2)+name1+garbagetext5; std::string s5=garbagetext3+(number1+4)+name1+garbagetext6; 

_{Edit for context:}

Feel free to add it to stackoverflow (I had problems posting there)
I can’t give you what I have done so far because I wasn’t sure I was within the scope of the boost :: spirit libraries to do what I am trying to do

+4

c ++ boost pattern-matching parsing boost-spirit

sehe Nov 10 '11 at 1:51

source share

3 answers

sehe · Answer 1 · 2011-11-10T02:07:52+0000

Edit : Re Update2

Here is a really simple explanation that I just realized to explain the problem I'm trying to solve:
 std::string s1=garbagetext1+number1+name1+garbagetext4; std::string s3=garbagetext2+(number1+2)+name1+garbagetext5; std::string s5=garbagetext3+(number1+4)+name1+garbagetext6; 

It begins to look like work for:

Toxicification of “junk text / names” - you can sort characters on the fly and use them to match patterns (Lex and Qi qi::symbol table ( qi::symbol ) can make it easier, but I feel like you can write it in any number of ways )
conversely, use regular expressions as suggested earlier ( below , and at least twice on the mailing list).

Here is a simple idea:

  (\d+) ([az]+).*?(\d+) \2

\d+ matches the sequence of digits in "(subexpression)" ( NUM1 )
([az]+) matches the name (just chose the simple definition of "name")
.*? skip any length of garbage, but as little as possible before starting the subsequent matching
\d+ corresponds to another number (sequence of digits) ( NUM2 )
\2 followed by the same name ( backreference )

You can see how you narrowed down your hit list to check for “potential” hits. You will only need to / post-validate / see that NUM2 == NUM1 + 2

Two notes:

Add (...)+ around the tail to allow pattern matching
```
  (\d+) ([az]+)(.*?(\d+) \2)+ 
```
You might want to make garbage skips ( .*? ) About separators (by making negative statements with a null string ) to avoid more than two missing separators (for example, s\d+=" as a delimitation pattern). Now I leave it out of sight for clarity , here's the gist:
```
 ((?!s\d+=").)*? -- beware of potential performance degradation 
```

Alec. The following is an example of how to do a wide range of things in Boost Spirit, in the context of answering your question.

I had to make assumptions about the required input structure; I suggested that

spaces were strict (spaces, as shown, without translation lines)
serial numbers should be in ascending order
serial numbers must be repeated exactly in text values
the keywords "apple" and "cheese" are in strict alternation.
whether the keyword is executed before or after the sequence number in a text value is also in strict alternation

Note In the implementation below, there are about a dozen places where significantly less complex options can be made. For example, I could hard code the entire pattern (how de facto a regular expression?), Assuming 4 elements are always expected at the input. However i wanted

make more assumptions than necessary
learn from experience. Especially the qi::locals<> theme and inherited attributes have been on my agenda for a while.

However, this solution provides more flexibility:

keywords are not hard-coded, and you can, for example, easily make the parser accept both keywords for any sequence number
the comment shows how to create your own parsing exception when the sequence number is not synchronized (not the expected number)
various serial number options are currently being accepted (ie s01="apple 001" is ok. See Unsigned Integer Parsers for information on how to configure this behavior)
The output structure is either vector<std::pair<int, std::string> > , or a vector struct:
```
 struct Entry { int sequence; std::string text; }; 
```
both versions can switch with a single line #if 1/0

The sample uses Boost Spirit Qi for parsing. Conversely, Boost Spirit Karma is used to display the result of a parsing:

 format((('s' << auto_ << "=\"" << auto_) << "\"") % ", ", parsed)

The output for the actual content indicated in the message:

 parsed: s1="apple 1", s2="2 cheese", s3="apple 3", s4="4 cheese"

Code entry.

 #include <boost/spirit/include/qi.hpp> #include <boost/spirit/include/karma.hpp> #include <boost/spirit/include/phoenix.hpp> #include <boost/spirit/include/phoenix_operator.hpp> namespace qi = boost::spirit::qi; namespace karma = boost::spirit::karma; namespace phx = boost::phoenix; #if 1 // using fusion adapted struct #include <boost/fusion/adapted/struct.hpp> struct Entry { int sequence; std::string text; }; BOOST_FUSION_ADAPT_STRUCT(Entry, (int, sequence)(std::string, text)); #else // using boring std::pair #include <boost/fusion/adapted/std_pair.hpp> // for karma output generation typedef std::pair<int, std::string> Entry; #endif int main() { std::string input = "s1=\"lxckvjlxcjvlkjlkje xvcjxzlvcj wqrej lxvcjz ljvl;x czvouzxvcu" "j;ljfds apple 1 xcvljxclvjx oueroi xcvzlkjv; zjx\", s2=\"xzljlkxvc" "jlkjxzvl jxcvljzx lvjlkj wre 2 cheese\", s3=\"apple 3\", s4=\"kxclvj" "xcvjlxk jcvljxlck jxcvl 4 cheese\""; using namespace qi; typedef std::string::const_iterator It; It f(input.begin()), l(input.end()); int next = 1; qi::rule<It, std::string(int)> label; qi::rule<It, std::string(int)> value; qi::rule<It, int()> number; qi::rule<It, Entry(), qi::locals<int> > assign; label %= qi::raw [ ( eps(qi::_r1 % 2) >> qi::string("apple ") > qi::uint_(qi::_r1) ) | qi::uint_(qi::_r1) > qi::string(" cheese") ]; value %= '"' >> qi::omit[ *(~qi::char_('"') - label(qi::_r1)) ] >> label(qi::_r1) >> qi::omit[ *(~qi::char_('"')) ] >> '"'; number %= qi::uint_(phx::ref(next)++) /*| eps [ phx::throw_(std::runtime_error("Sequence number out of sync")) ] */; assign %= 's' > number[ qi::_a = _1 ] > '=' > value(qi::_a); std::vector<Entry> parsed; bool ok = false; try { ok = parse(f, l, assign % ", ", parsed); if (ok) { using namespace karma; std::cout << "parsed:\t" << format((('s' << auto_ << "=\"" << auto_) << "\"") % ", ", parsed) << std::endl; } } catch(qi::expectation_failure<It>& e) { std::cerr << "Expectation failed: " << e.what() << " '" << std::string(e.first, e.last) << "'" << std::endl; } catch(const std::exception& e) { std::cerr << e.what() << std::endl; } if (!ok || (f!=l)) std::cerr << "problem at: '" << std::string(f,l) << "'" << std::endl; }

Gene bushuyev · Answer 2 · 2011-11-11T03:40:21+0000

If you can use the C ++ 11 compiler, parsing these templates is pretty simple using AX ^{& dagger;} :

 #include <axe.h> #include <string> template<class I> void num_value(I i1, I i2) { unsigned n; unsigned next = 1; // rule to match unsigned decimal number and compare it with another number auto num = axe::r_udecimal(n) & axe::r_bool([&](...){ return n == next; }); // rule to match a single word auto word = axe::r_alphastr(); // rule to match space characters auto space = axe::r_any(" \t\n"); // semantic action - print to cout and increment next auto e_cout = axe::e_ref([&](I i1, I i2) { std::cout << std::string(i1, i2) << '\n'; ++next; }); // there are only two patterns in this example auto pattern1 = (word & +space & num) >> e_cout; auto pattern2 = (num & +space & word) >> e_cout; auto s1 = axe::r_find(pattern1); auto s2 = axe::r_find(pattern2); auto text = s1 & s2 & s1 & s2 & axe::r_end(); text(i1, i2); }

To parse the text, just call num_value(text.begin(), text.end()); No changes needed to parse strings in Unicode.

^{& dagger;} I have not tested it.

moshbear · Answer 3 · 2011-11-10T02:11:08+0000

Take a look at Boost.Regex. I saw almost identical alignment in the boot users, and the solution is to use regular expressions for some coincidence work.

How to do this pattern matching

More articles: