Using acceleration for a stack-based language

Question

Using acceleration for a stack-based language

I need to parse a fairly simple stack based language, for example.

1 2 add 3 1 sub

and I came across two options:

Write my token lexer, and then continue to parse it.
Use boost spirit

I have never used a boost-spirit, but from what I read (documentation and examples), I still can’t decide if it’s excessive to use a boost-spirit for lex and analyze this simple language, or if it would be wise to use it instead of rolling out your own lexer and parser (a thing that I suppose should not be too complicated).

Would I use the boost spirit for a simple stack-based language like the one described above (since I need to learn it first before I can use it)?

+5

c ++ boost parsing boost-spirit

Dean Nov 21 '15 at 17:14

source share

3 answers

sehe · Answer 1 · 2015-11-21T22:21:42+0000

In the “comprehensive research” category, let me add some “on the fly interpreting” stack machines using Spirit Qi (v2.x) and X3

Note that the AST-ful approach (two-step analysis / execution) is shown in the second answer

In the Spirit of Qi

Here, semantic actions should be “composed” using Phoenix actors:

Live on coliru

 #include <boost/spirit/include/qi.hpp> #include <boost/spirit/include/phoenix.hpp> #include <boost/spirit/repository/include/qi_distinct.hpp> #include <iostream> #include <deque> namespace qi = boost::spirit::qi; namespace px = boost::phoenix; namespace qr = boost::spirit::repository::qi; using Stack = std::deque<int>; namespace actors { struct pop { Stack& s_; Stack::value_type operator()() const { Stack::value_type v = s_.back(); s_.pop_back(); return v; } }; struct push { Stack& s_; template <typename V> void operator()(V const& v) const { s_.push_back(v); } }; struct dump { Stack& s_; void operator()() const { std::copy(s_.begin(), s_.end(), std::ostream_iterator<Stack::value_type>(std::cout, " ")); std::cout << "\n"; } }; } int main() { Stack stack_; boost::spirit::istream_iterator f(std::cin >> std::noskipws), l; // Note the noskipws! bool ok; { using namespace qi; px::function<actors::pop> pop_ = actors::pop{ stack_ }; px::function<actors::push> push_ = actors::push{ stack_ }; px::function<actors::dump> dump_ = actors::dump{ stack_ }; ok = phrase_parse(f, l, *( eps [ dump_() ] >> (lexeme [ qr::distinct(graph) [ lit("add") [ push_( pop_() + pop_()) ] | lit("sub") [ push_(- pop_() + pop_()) ] // bit hackish | lit("mul") [ push_(pop_() * pop_()) ] | lit("div") [ push_(pop_() / pop_()) ] // TODO fix order | lit("pop") [ pop_() ] ] ] | int_ [ push_(_1) ] ) ), space); } if (!ok) std::cout << "Parse failed\n"; if (f != l) std::cout << "Unparsed program data: '" << std::string(f,l) << "'\n"; }

Print

 1 1 2 3 3 3 3 3 1 3 2 6

Notes:

it becomes ugly (the learning curve is large, see also Boost Spirit: "Semantic actions are evil, etc. )
comments speak volumes; actually correcting this order or operands in sub and div not easy. This will require some advanced Phoenix-fu ( http://www.boost.org/doc/libs/1_59_0/libs/phoenix/doc/html/phoenix/modules/scope/let.html )

In the Spirit X3

The idea is the same, but we can use the correct functional composition using lambdas.

We even use a helper to dynamically generate a parser expression along with a suitable binop :

Live on coliru

 #include <boost/spirit/home/x3.hpp> #include <boost/spirit/include/support_istream_iterator.hpp> #include <iostream> #include <deque> #include <cassert> int main() { std::deque<int> stack_; boost::spirit::istream_iterator f(std::cin >> std::noskipws), l; // Note the noskipws! bool ok; { using namespace boost::spirit::x3; struct stack_tag {}; auto binop = [](auto id, auto f) { auto apply = [=](auto& ctx) { auto& s = get<stack_tag>(ctx); assert(s.size()>=2); auto rhs = s.back(); s.pop_back(); auto lhs = s.back(); s.pop_back(); s.push_back(f(lhs, rhs)); }; return lexeme[as_parser(id) >> !graph] [apply]; }; auto push = [](auto& ctx) { auto& s = get<stack_tag>(ctx); s.push_back(_attr(ctx)); }; auto dump = [](auto& ctx) { auto& s = get<stack_tag>(ctx); std::copy(s.begin(), s.end(), std::ostream_iterator<int>(std::cout, " ")); std::cout << "\n"; }; auto instr = binop("add", [](auto a, auto b) { return a + b; }) | binop("sub", [](auto a, auto b) { return a - b; }) | binop("mul", [](auto a, auto b) { return a * b; }) | binop("div", [](auto a, auto b) { return a / b; }) | int_ [ push ] ; auto parser = skip(space) [ *(eps [ dump ] >> instr) >> eps/*post-skip*/ ]; auto machine = with<stack_tag>(stack_) [parser]; ok = parse(f, l, machine); } if (!ok) std::cout << "Parse failed\n"; if (f != l) std::cout << "Unparsed program data: '" << std::string(f,l) << "'\n"; }

Of course, he prints the same result.

it has none of the flaws of the Qi version
it compiles much faster (2.9s vs 9.2s!)
Note. X3 requires C ++ 14

sehe · Answer 2 · 2015-11-21T20:22:08+0000

The first approach can be damn simple in simple C ++:

 int main() { Machine<int> machine; std::for_each( std::istream_iterator<std::string> { std::cin }, {}, [&](auto& instr) { machine.process(instr); } ); }

This leads to the fact that reading lines separated by spaces is good enough as a lexer (tokenizer).

Now we implement the process the simplest way:

  static const char* opcodes[] = { "add", "sub", "mul", "div", "pop" }; auto op = find(begin(opcodes), end(opcodes), instr); enum { add, sub, mul, div, pop, other }; switch(op - opcodes) { case add: execute(Add{}); break; case sub: execute(Sub{}); break; case mul: execute(Mul{}); break; case div: execute(Div{}); break; case pop: execute(Pop{}); break; case other: { istringstream iss(instr); value_type v; if (iss >> v) execute(v); else throw runtime_error("Invalid instruction '" + instr + "'"); } }

Adding some debug trace, we get the following result for the program "1 2 add 3 1 sub mul":

 Executing 1: 1 Executing 2: 1 2 Executing add: 3 Executing 3: 3 3 Executing 1: 3 3 1 Executing sub: 3 2 Executing mul: 6

Live on coliru

Using Boost Spirit

I added it as a separate answer

sehe · Answer 3 · 2015-11-21T20:57:50+0000

We saw a clean standard library approach .

These are execute d instructions immediately.

Create a parser that creates an AST (abstract syntax tree). In the case of our simple stacking machine, this is just a list of instructions. Let's call him Tape .

Using Boost Spirit

I would advise against lexer. Lexers are supported in Spirit v2 (and not in X3 - yet?). But in practice, they complicate the situation, and the Spirit knows how to cancel the entry in case of non-compliance. Thus, you can just tentatively approach the productions and try the following, if this is not the correct "token".

Here, where using the grammar of the Spirit should look like this:

 Tape program; boost::spirit::istream_iterator f(std::cin >> std::noskipws), l; // Note the noskipws! if (parse(f, l, Parser::program, program)) { std::cout << "Parsed " << program.size() << " instructions\n"; } else { std::cout << "Parse failed\n"; }

Now AST types:

 struct Add {}; struct Sub {}; struct Mul {}; struct Div {}; struct Pop {}; using Value = int; using Instr = boost::variant<Add, Sub, Mul, Div, Pop, Value>; using Tape = std::vector<Instr>;

Simple, correct.

Grammar

In X3, making grammar pretty easy. Top down:

 auto instr = opcode_ | int_; auto program = skip(space) [*instr];

Now all we need to do is teach him to recognize the operation codes. The beginning would be:

 struct opcodes : symbols<Instr> { opcodes() { this->add("add", Add{})("sub", Sub{})("mul", Mul{})("div", Div{})("pop", Pop{}); } } opcode_;

An experienced spirit guru will find a problem here: opcode_ not a token, and it does not guarantee parsing of a "different identifier". For instance. "a dd" will match Add . And "additional" will also match.

Fortunately, the X3 makes it easy to make directives on the fly:

 auto opcode_ = [] { struct opcodes : symbols<Instr> { opcodes() { this->add("add", Add{})("sub", Sub{})("mul", Mul{})("div", Div{})("pop", Pop{}); } } codes_; return lexeme[codes_ >> !graph]; }();

So now both holes are fixed.

Full demo

Live on coliru

 #include <iostream> #include <deque> #include <boost/spirit/home/x3.hpp> #include <boost/spirit/include/support_istream_iterator.hpp> struct Add {}; struct Sub {}; struct Mul {}; struct Div {}; struct Pop {}; using Value = int; using Instr = boost::variant<Add, Sub, Mul, Div, Pop, Value>; struct Machine { using result_type = void; std::deque<Value> stack_; void operator()(Instr instr) { boost::apply_visitor(*this, instr); } void operator()(Add) { assert(stack_.size()>=2); auto op2 = stack_.back(); stack_.pop_back(); auto op1 = stack_.back(); stack_.pop_back(); stack_.push_back(op1 + op2); } void operator()(Sub) { assert(stack_.size()>=2); auto op2 = stack_.back(); stack_.pop_back(); auto op1 = stack_.back(); stack_.pop_back(); stack_.push_back(op1 - op2); } void operator()(Mul) { assert(stack_.size()>=2); auto op2 = stack_.back(); stack_.pop_back(); auto op1 = stack_.back(); stack_.pop_back(); stack_.push_back(op1 * op2); } void operator()(Div) { assert(stack_.size()>=2); auto op2 = stack_.back(); stack_.pop_back(); auto op1 = stack_.back(); stack_.pop_back(); assert(op2 != 0); stack_.push_back(op1 / op2); } void operator()(Value v) { stack_.push_back(v); } void operator()(Pop) { assert(stack_.size()>=1); stack_.pop_back(); } void trace() const { using namespace std; // debug trace copy(stack_.begin(), stack_.end(), ostream_iterator<Value>(cout, " ")); cout << "\n"; } }; using Tape = std::vector<Instr>; namespace Parser { using namespace boost::spirit::x3; auto opcode_ = [] { struct opcodes : symbols<Instr> { opcodes() { this->add("add", Add{})("sub", Sub{})("mul", Mul{})("div", Div{})("pop", Pop{}); } } codes_; return lexeme[codes_ >> !graph]; }(); auto instr = opcode_ | int_; // TODO auto program = skip(space) [*instr]; } int main() { Tape program; boost::spirit::istream_iterator f(std::cin >> std::noskipws), l; // Note the noskipws! if (parse(f, l, Parser::program, program)) { std::cout << "Parsed " << program.size() << " instructions\n"; } else { std::cout << "Parse failed\n"; } if (f != l) std::cout << "Unparsed program data: '" << std::string(f,l) << "'\n"; Machine machine; for (auto instr : program) { machine(instr); machine.trace(); } }

Print

 Parsed 7 instructions 1 1 2 3 3 3 3 3 1 3 2 6

Summary

The main gain here:

we declaratively define our grammar. For a more complex grammar, this can be a huge advantage.
we get a free retreat - so no need to split tokens ahead of time
Note. This is a PEG grammar. Rollback only comes down to trying to use the next alternative of the sibling or a failure in the current rule (therefore, the parent rule can try the next alternative of the sister).
This is significantly different from backtracking, as in regular expression. You will notice a difference with other expressions of the Kleene- * en parser. In PEG grammars, they are always greedy and never return only one element (similar to the "Maximum Munch" rules).
We no longer have a messy switch . It is actually hidden inside the Variant Visitor (see apply_visitor ).
The execution of the instruction is practically unmodified, but we renamed execute to operator() so that it simulates the concept of Visitor.

Using acceleration for a stack-based language

In the Spirit of Qi

In the Spirit X3

Using Boost Spirit

Using Boost Spirit

Grammar

Full demo

Summary

More articles: