Using acceleration for a stack-based language

I need to parse a fairly simple stack based language, for example.

1 2 add 3 1 sub 

and I came across two options:

  • Write my token lexer, and then continue to parse it.
  • Use boost spirit

I have never used a boost-spirit, but from what I read (documentation and examples), I still can’t decide if it’s excessive to use a boost-spirit for lex and analyze this simple language, or if it would be wise to use it instead of rolling out your own lexer and parser (a thing that I suppose should not be too complicated).

Would I use the boost spirit for a simple stack-based language like the one described above (since I need to learn it first before I can use it)?

+5
source share
3 answers

In the “comprehensive research” category, let me add some “on the fly interpreting” stack machines using Spirit Qi (v2.x) and X3

Note that the AST-ful approach (two-step analysis / execution) is shown in the second answer

In the Spirit of Qi

Here, semantic actions should be “composed” using Phoenix actors:

Live on coliru

 #include <boost/spirit/include/qi.hpp> #include <boost/spirit/include/phoenix.hpp> #include <boost/spirit/repository/include/qi_distinct.hpp> #include <iostream> #include <deque> namespace qi = boost::spirit::qi; namespace px = boost::phoenix; namespace qr = boost::spirit::repository::qi; using Stack = std::deque<int>; namespace actors { struct pop { Stack& s_; Stack::value_type operator()() const { Stack::value_type v = s_.back(); s_.pop_back(); return v; } }; struct push { Stack& s_; template <typename V> void operator()(V const& v) const { s_.push_back(v); } }; struct dump { Stack& s_; void operator()() const { std::copy(s_.begin(), s_.end(), std::ostream_iterator<Stack::value_type>(std::cout, " ")); std::cout << "\n"; } }; } int main() { Stack stack_; boost::spirit::istream_iterator f(std::cin >> std::noskipws), l; // Note the noskipws! bool ok; { using namespace qi; px::function<actors::pop> pop_ = actors::pop{ stack_ }; px::function<actors::push> push_ = actors::push{ stack_ }; px::function<actors::dump> dump_ = actors::dump{ stack_ }; ok = phrase_parse(f, l, *( eps [ dump_() ] >> (lexeme [ qr::distinct(graph) [ lit("add") [ push_( pop_() + pop_()) ] | lit("sub") [ push_(- pop_() + pop_()) ] // bit hackish | lit("mul") [ push_(pop_() * pop_()) ] | lit("div") [ push_(pop_() / pop_()) ] // TODO fix order | lit("pop") [ pop_() ] ] ] | int_ [ push_(_1) ] ) ), space); } if (!ok) std::cout << "Parse failed\n"; if (f != l) std::cout << "Unparsed program data: '" << std::string(f,l) << "'\n"; } 

Print

 1 1 2 3 3 3 3 3 1 3 2 6 

Notes:

In the Spirit X3

The idea is the same, but we can use the correct functional composition using lambdas.

We even use a helper to dynamically generate a parser expression along with a suitable binop :

Live on coliru

 #include <boost/spirit/home/x3.hpp> #include <boost/spirit/include/support_istream_iterator.hpp> #include <iostream> #include <deque> #include <cassert> int main() { std::deque<int> stack_; boost::spirit::istream_iterator f(std::cin >> std::noskipws), l; // Note the noskipws! bool ok; { using namespace boost::spirit::x3; struct stack_tag {}; auto binop = [](auto id, auto f) { auto apply = [=](auto& ctx) { auto& s = get<stack_tag>(ctx); assert(s.size()>=2); auto rhs = s.back(); s.pop_back(); auto lhs = s.back(); s.pop_back(); s.push_back(f(lhs, rhs)); }; return lexeme[as_parser(id) >> !graph] [apply]; }; auto push = [](auto& ctx) { auto& s = get<stack_tag>(ctx); s.push_back(_attr(ctx)); }; auto dump = [](auto& ctx) { auto& s = get<stack_tag>(ctx); std::copy(s.begin(), s.end(), std::ostream_iterator<int>(std::cout, " ")); std::cout << "\n"; }; auto instr = binop("add", [](auto a, auto b) { return a + b; }) | binop("sub", [](auto a, auto b) { return a - b; }) | binop("mul", [](auto a, auto b) { return a * b; }) | binop("div", [](auto a, auto b) { return a / b; }) | int_ [ push ] ; auto parser = skip(space) [ *(eps [ dump ] >> instr) >> eps/*post-skip*/ ]; auto machine = with<stack_tag>(stack_) [parser]; ok = parse(f, l, machine); } if (!ok) std::cout << "Parse failed\n"; if (f != l) std::cout << "Unparsed program data: '" << std::string(f,l) << "'\n"; } 

Of course, he prints the same result.

  • it has none of the flaws of the Qi version
  • it compiles much faster (2.9s vs 9.2s!)
  • Note. X3 requires C ++ 14
+4
source

The first approach can be damn simple in simple C ++:

 int main() { Machine<int> machine; std::for_each( std::istream_iterator<std::string> { std::cin }, {}, [&](auto& instr) { machine.process(instr); } ); } 

This leads to the fact that reading lines separated by spaces is good enough as a lexer (tokenizer).

Now we implement the process the simplest way:

  static const char* opcodes[] = { "add", "sub", "mul", "div", "pop" }; auto op = find(begin(opcodes), end(opcodes), instr); enum { add, sub, mul, div, pop, other }; switch(op - opcodes) { case add: execute(Add{}); break; case sub: execute(Sub{}); break; case mul: execute(Mul{}); break; case div: execute(Div{}); break; case pop: execute(Pop{}); break; case other: { istringstream iss(instr); value_type v; if (iss >> v) execute(v); else throw runtime_error("Invalid instruction '" + instr + "'"); } } 

Adding some debug trace, we get the following result for the program "1 2 add 3 1 sub mul":

 Executing 1: 1 Executing 2: 1 2 Executing add: 3 Executing 3: 3 3 Executing 1: 3 3 1 Executing sub: 3 2 Executing mul: 6 

Live on coliru

Using Boost Spirit

I added it as a separate answer

+2
source

We saw a clean standard library approach .

These are execute d instructions immediately.

Create a parser that creates an AST (abstract syntax tree). In the case of our simple stacking machine, this is just a list of instructions. Let's call him Tape .

Using Boost Spirit

I would advise against lexer. Lexers are supported in Spirit v2 (and not in X3 - yet?). But in practice, they complicate the situation, and the Spirit knows how to cancel the entry in case of non-compliance. Thus, you can just tentatively approach the productions and try the following, if this is not the correct "token".

Here, where using the grammar of the Spirit should look like this:

 Tape program; boost::spirit::istream_iterator f(std::cin >> std::noskipws), l; // Note the noskipws! if (parse(f, l, Parser::program, program)) { std::cout << "Parsed " << program.size() << " instructions\n"; } else { std::cout << "Parse failed\n"; } 

Now AST types:

 struct Add {}; struct Sub {}; struct Mul {}; struct Div {}; struct Pop {}; using Value = int; using Instr = boost::variant<Add, Sub, Mul, Div, Pop, Value>; using Tape = std::vector<Instr>; 

Simple, correct.

Grammar

In X3, making grammar pretty easy. Top down:

 auto instr = opcode_ | int_; auto program = skip(space) [*instr]; 

Now all we need to do is teach him to recognize the operation codes. The beginning would be:

 struct opcodes : symbols<Instr> { opcodes() { this->add("add", Add{})("sub", Sub{})("mul", Mul{})("div", Div{})("pop", Pop{}); } } opcode_; 

An experienced spirit guru will find a problem here: opcode_ not a token, and it does not guarantee parsing of a "different identifier". For instance. "a dd" will match Add . And "additional" will also match.

Fortunately, the X3 makes it easy to make directives on the fly:

 auto opcode_ = [] { struct opcodes : symbols<Instr> { opcodes() { this->add("add", Add{})("sub", Sub{})("mul", Mul{})("div", Div{})("pop", Pop{}); } } codes_; return lexeme[codes_ >> !graph]; }(); 

So now both holes are fixed.

Full demo

Live on coliru

 #include <iostream> #include <deque> #include <boost/spirit/home/x3.hpp> #include <boost/spirit/include/support_istream_iterator.hpp> struct Add {}; struct Sub {}; struct Mul {}; struct Div {}; struct Pop {}; using Value = int; using Instr = boost::variant<Add, Sub, Mul, Div, Pop, Value>; struct Machine { using result_type = void; std::deque<Value> stack_; void operator()(Instr instr) { boost::apply_visitor(*this, instr); } void operator()(Add) { assert(stack_.size()>=2); auto op2 = stack_.back(); stack_.pop_back(); auto op1 = stack_.back(); stack_.pop_back(); stack_.push_back(op1 + op2); } void operator()(Sub) { assert(stack_.size()>=2); auto op2 = stack_.back(); stack_.pop_back(); auto op1 = stack_.back(); stack_.pop_back(); stack_.push_back(op1 - op2); } void operator()(Mul) { assert(stack_.size()>=2); auto op2 = stack_.back(); stack_.pop_back(); auto op1 = stack_.back(); stack_.pop_back(); stack_.push_back(op1 * op2); } void operator()(Div) { assert(stack_.size()>=2); auto op2 = stack_.back(); stack_.pop_back(); auto op1 = stack_.back(); stack_.pop_back(); assert(op2 != 0); stack_.push_back(op1 / op2); } void operator()(Value v) { stack_.push_back(v); } void operator()(Pop) { assert(stack_.size()>=1); stack_.pop_back(); } void trace() const { using namespace std; // debug trace copy(stack_.begin(), stack_.end(), ostream_iterator<Value>(cout, " ")); cout << "\n"; } }; using Tape = std::vector<Instr>; namespace Parser { using namespace boost::spirit::x3; auto opcode_ = [] { struct opcodes : symbols<Instr> { opcodes() { this->add("add", Add{})("sub", Sub{})("mul", Mul{})("div", Div{})("pop", Pop{}); } } codes_; return lexeme[codes_ >> !graph]; }(); auto instr = opcode_ | int_; // TODO auto program = skip(space) [*instr]; } int main() { Tape program; boost::spirit::istream_iterator f(std::cin >> std::noskipws), l; // Note the noskipws! if (parse(f, l, Parser::program, program)) { std::cout << "Parsed " << program.size() << " instructions\n"; } else { std::cout << "Parse failed\n"; } if (f != l) std::cout << "Unparsed program data: '" << std::string(f,l) << "'\n"; Machine machine; for (auto instr : program) { machine(instr); machine.trace(); } } 

Print

 Parsed 7 instructions 1 1 2 3 3 3 3 3 1 3 2 6 

Summary

The main gain here:

  • we declaratively define our grammar. For a more complex grammar, this can be a huge advantage.
  • we get a free retreat - so no need to split tokens ahead of time

    Note. This is a PEG grammar. Rollback only comes down to trying to use the next alternative of the sibling or a failure in the current rule (therefore, the parent rule can try the next alternative of the sister).

    This is significantly different from backtracking, as in regular expression. You will notice a difference with other expressions of the Kleene- * en parser. In PEG grammars, they are always greedy and never return only one element (similar to the "Maximum Munch" rules).

  • We no longer have a messy switch . It is actually hidden inside the Variant Visitor (see apply_visitor ).

  • The execution of the instruction is practically unmodified, but we renamed execute to operator() so that it simulates the concept of Visitor.

+2
source

Source: https://habr.com/ru/post/1236504/


All Articles