How to do this pattern matching

Migration from the [ Spirit-General List ]

Good morning,

I am trying to parse a relatively simple 4 std::strings pattern, extracting any part that matches the pattern into a separate std::string .

In an abstract sense, this is what I want:

 s1=<string1><consecutive number>, s2=<consecutive number><string2>, s3=<string1><consecutive number>, s4=<consecutive number><string2> 

Less abstract:

 s1="apple 1", s2="2 cheese", s3="apple 3", s4="4 cheese" 

Actual content:

 s1="lxckvjlxcjvlkjlkje xvcjxzlvcj wqrej lxvcjz ljvl;x czvouzxvcu j;ljfds apple 1 xcvljxclvjx oueroi xcvzlkjv; zjx", s2="xzljlkxvc jlkjxzvl jxcvljzx lvjlkj wre 2 cheese", s3="apple 3", s4="kxclvj xcvjlxk jcvljxlck jxcvl 4 cheese" 

How to execute this template?

Thanks for all the suggestions,

Alec Taylor

Update 2

Here is a really simple explanation that I just realized to explain the problem I'm trying to solve:

  std::string s1=garbagetext1+number1+name1+garbagetext4; std::string s3=garbagetext2+(number1+2)+name1+garbagetext5; std::string s5=garbagetext3+(number1+4)+name1+garbagetext6; 

Edit for context:

Feel free to add it to stackoverflow (I had problems posting there)

I can’t give you what I have done so far because I wasn’t sure I was within the scope of the boost :: spirit libraries to do what I am trying to do

+4
source share
3 answers

Edit : Re Update2

Here is a really simple explanation that I just realized to explain the problem I'm trying to solve:

 std::string s1=garbagetext1+number1+name1+garbagetext4; std::string s3=garbagetext2+(number1+2)+name1+garbagetext5; std::string s5=garbagetext3+(number1+4)+name1+garbagetext6; 

It begins to look like work for:

  • Toxicification of “junk text / names” - you can sort characters on the fly and use them to match patterns (Lex and Qi qi::symbol table ( qi::symbol ) can make it easier, but I feel like you can write it in any number of ways )
  • conversely, use regular expressions as suggested earlier ( below , and at least twice on the mailing list).

Here is a simple idea:

  (\d+) ([az]+).*?(\d+) \2 
  • \d+ matches the sequence of digits in "(subexpression)" ( NUM1 )
  • ([az]+) matches the name (just chose the simple definition of "name")
  • .*? skip any length of garbage, but as little as possible before starting the subsequent matching
  • \d+ corresponds to another number (sequence of digits) ( NUM2 )
  • \2 followed by the same name ( backreference )

You can see how you narrowed down your hit list to check for “potential” hits. You will only need to / post-validate / see that NUM2 == NUM1 + 2

Two notes:

  • Add (...)+ around the tail to allow pattern matching

      (\d+) ([az]+)(.*?(\d+) \2)+ 
  • You might want to make garbage skips ( .*? ) About separators (by making negative statements with a null string ) to avoid more than two missing separators (for example, s\d+=" as a delimitation pattern). Now I leave it out of sight for clarity , here's the gist:

     ((?!s\d+=").)*? -- beware of potential performance degradation 

Alec. The following is an example of how to do a wide range of things in Boost Spirit, in the context of answering your question.

I had to make assumptions about the required input structure; I suggested that

  • spaces were strict (spaces, as shown, without translation lines)
  • serial numbers should be in ascending order
  • serial numbers must be repeated exactly in text values
  • the keywords "apple" and "cheese" are in strict alternation.
  • whether the keyword is executed before or after the sequence number in a text value is also in strict alternation

Note In the implementation below, there are about a dozen places where significantly less complex options can be made. For example, I could hard code the entire pattern (how de facto a regular expression?), Assuming 4 elements are always expected at the input. However i wanted

However, this solution provides more flexibility:

  • keywords are not hard-coded, and you can, for example, easily make the parser accept both keywords for any sequence number
  • the comment shows how to create your own parsing exception when the sequence number is not synchronized (not the expected number)
  • various serial number options are currently being accepted (ie s01="apple 001" is ok. See Unsigned Integer Parsers for information on how to configure this behavior)
  • The output structure is either vector<std::pair<int, std::string> > , or a vector struct:

     struct Entry { int sequence; std::string text; }; 

    both versions can switch with a single line #if 1/0

The sample uses Boost Spirit Qi for parsing. Conversely, Boost Spirit Karma is used to display the result of a parsing:

 format((('s' << auto_ << "=\"" << auto_) << "\"") % ", ", parsed) 

The output for the actual content indicated in the message:

 parsed: s1="apple 1", s2="2 cheese", s3="apple 3", s4="4 cheese" 

Code entry.

 #include <boost/spirit/include/qi.hpp> #include <boost/spirit/include/karma.hpp> #include <boost/spirit/include/phoenix.hpp> #include <boost/spirit/include/phoenix_operator.hpp> namespace qi = boost::spirit::qi; namespace karma = boost::spirit::karma; namespace phx = boost::phoenix; #if 1 // using fusion adapted struct #include <boost/fusion/adapted/struct.hpp> struct Entry { int sequence; std::string text; }; BOOST_FUSION_ADAPT_STRUCT(Entry, (int, sequence)(std::string, text)); #else // using boring std::pair #include <boost/fusion/adapted/std_pair.hpp> // for karma output generation typedef std::pair<int, std::string> Entry; #endif int main() { std::string input = "s1=\"lxckvjlxcjvlkjlkje xvcjxzlvcj wqrej lxvcjz ljvl;x czvouzxvcu" "j;ljfds apple 1 xcvljxclvjx oueroi xcvzlkjv; zjx\", s2=\"xzljlkxvc" "jlkjxzvl jxcvljzx lvjlkj wre 2 cheese\", s3=\"apple 3\", s4=\"kxclvj" "xcvjlxk jcvljxlck jxcvl 4 cheese\""; using namespace qi; typedef std::string::const_iterator It; It f(input.begin()), l(input.end()); int next = 1; qi::rule<It, std::string(int)> label; qi::rule<It, std::string(int)> value; qi::rule<It, int()> number; qi::rule<It, Entry(), qi::locals<int> > assign; label %= qi::raw [ ( eps(qi::_r1 % 2) >> qi::string("apple ") > qi::uint_(qi::_r1) ) | qi::uint_(qi::_r1) > qi::string(" cheese") ]; value %= '"' >> qi::omit[ *(~qi::char_('"') - label(qi::_r1)) ] >> label(qi::_r1) >> qi::omit[ *(~qi::char_('"')) ] >> '"'; number %= qi::uint_(phx::ref(next)++) /*| eps [ phx::throw_(std::runtime_error("Sequence number out of sync")) ] */; assign %= 's' > number[ qi::_a = _1 ] > '=' > value(qi::_a); std::vector<Entry> parsed; bool ok = false; try { ok = parse(f, l, assign % ", ", parsed); if (ok) { using namespace karma; std::cout << "parsed:\t" << format((('s' << auto_ << "=\"" << auto_) << "\"") % ", ", parsed) << std::endl; } } catch(qi::expectation_failure<It>& e) { std::cerr << "Expectation failed: " << e.what() << " '" << std::string(e.first, e.last) << "'" << std::endl; } catch(const std::exception& e) { std::cerr << e.what() << std::endl; } if (!ok || (f!=l)) std::cerr << "problem at: '" << std::string(f,l) << "'" << std::endl; } 
+4
source

If you can use the C ++ 11 compiler, parsing these templates is pretty simple using AX & dagger; :

 #include <axe.h> #include <string> template<class I> void num_value(I i1, I i2) { unsigned n; unsigned next = 1; // rule to match unsigned decimal number and compare it with another number auto num = axe::r_udecimal(n) & axe::r_bool([&](...){ return n == next; }); // rule to match a single word auto word = axe::r_alphastr(); // rule to match space characters auto space = axe::r_any(" \t\n"); // semantic action - print to cout and increment next auto e_cout = axe::e_ref([&](I i1, I i2) { std::cout << std::string(i1, i2) << '\n'; ++next; }); // there are only two patterns in this example auto pattern1 = (word & +space & num) >> e_cout; auto pattern2 = (num & +space & word) >> e_cout; auto s1 = axe::r_find(pattern1); auto s2 = axe::r_find(pattern2); auto text = s1 & s2 & s1 & s2 & axe::r_end(); text(i1, i2); } 

To parse the text, just call num_value(text.begin(), text.end()); No changes needed to parse strings in Unicode.

& dagger; I have not tested it.

+2
source

Take a look at Boost.Regex. I saw almost identical alignment in the boot users, and the solution is to use regular expressions for some coincidence work.

0
source

Source: https://habr.com/ru/post/1380499/


All Articles