Code file analysis faster

I wrote a rather complicated parser for a stack-based language that loads a file into memory, and then goes by comparing tokens to see if it is recognized as an operand or instruction.

Every time I have to parse a new operand / instruction, I std::copy memory from the file buffer to std::string , and then do `

 if(parsed_string.compare("add") == 0) { /* handle multiplication */} else if(parsed_string.compare("sub") == 0) { /* handle subtraction */ } else { /* This is an operand */ } 

Unfortunately, all of these copies slow down parsing.

How should I handle this to avoid all of these copies? I always thought that I did not need a tokenizer, since the language itself and the logic are pretty simple.

Edit: I am adding code where I get copies for various operands and instructions

  // This function accounts for 70% of the total time of the program std::string Parser::read_as_string(size_t start, size_t end) { std::vector<char> file_memory(end - start); read_range(start, end - start, file_memory); std::string result(file_memory.data(), file_memory.size()); return std::move(result); // Intended to be consumed } void Parser::read_range(size_t start, size_t size, std::string& destination) { if (destination.size() < size) destination.resize(size); // Allocate necessary space std::copy(file_in_memory.begin() + start, file_in_memory.begin() + start + size, destination.begin()); } 
+5
source share
4 answers

This copying is not required. You can work on slices.

 struct StrSlice { StrSlice(const std::string& embracingStr, std::size_t startIx, std::size_t length) : begin_(/* todo */), end_(/* todo */) // Assign begin_ and end_ here {} StrSlice(const char* begin, const char* end) : begin_(begin), end_(end) {} // Define some more constructors // Be careful about implicit conversions //... //Define lots of comparasion routines with other strings here bool operator==(const char* str) const { ... } bool operator==(const StrSlice& str) const { ... } // You can take slice of a slice in O(1) time StrSlice subslice(std::size_t startIx, std::size_t length) { assert(/* do some range checks here */); const char* subsliceBegin = begin_ + startIx; const char* subsliceEnd = subsliceBegin + length; return StrSlice(subsliceBegin, subsliceEnd); } private: const char* begin_; const char* end_; }; 

I hope you get this idea. Of course, this slice will break after any change to the associated string, especially memory reallocation. But it seems that your line does not change unless you are reading a new file.

+6
source

Most likely, this is not only copying, but also a cascade of string comparisons (if you have more than two instructions that you specified).

You can try a lookup table (for example, std :: map or std :: unordered_map) that converts the commands to the type of enumeration that you include. Therefore, instead of:

 if(parsed_string.compare("add") == 0) { /* handle multiplication */} else if(parsed_string.compare("sub") == 0) { /* handle subtraction */ } ... else { /* This is an operand */ } 

You would do:

 const auto it = keywords.find(parsed_string); if (it != keywords.end()) { switch (it->second) { case kAdd: // handle addition case kSub: // handle subtraction ... } } else { // handle operand } 

If there are several keywords, this will result in significantly fewer string comparisons, in which case the copies may not be as large. And if so, this sentence can be used in conjunction with others that use β€œslices” or β€œviews” in the actual data to avoid copying.

0
source

How about this:

std :: string Parser :: read_as_string (size_t start, size_t end)
{
return file_in_memory.substr (start, end);
}

Your read_as_string function does nothing more than the standard substr, except for service messages ...

0
source

Comparing the input pair prefix with constant lines for keywords is just for coding, but certainly not fast; if you have N keywords, you will perform O (N) string comparisons. If the strings are of medium length, L, you will perform character comparisons O (N * L). And such comparisons will not allow you to take numbers, identifiers or string literals for which you cannot just compare a constant string. (And copying the prefix, as your example shows, does not help).

What you have to do is create a machine on a finite basis to implement your lexer. This solution is used by almost every parser / compiler of production on the planet, because they are usually very fast. In fact, well-designed FSAs will perform one character search per character in the input string; which is pretty hard to win.

You can create such an FSA manually or use a tool.

See http://en.wikipedia.org/wiki/Lexical_analysis for the main background, and a specific list of commonly used lexer generators.

0
source

Source: https://habr.com/ru/post/1237417/


All Articles