When should I not use regular expressions?

After some research, I realized that it is not possible to parse recursive structures (like HTML or XML) using regular expressions. Is it possible to comprehensively list the coding scripts every day where I should avoid using regular expressions, because it is simply impossible to accomplish this task with regular expressions? Assume that the regular expression engine in question is not PCRE.

+6
source share
3 answers

Do not use regular expressions if:

  • The language you are trying to analyze is not an ordinary language or
  • when there are parsers available that are specifically designed for the data you are trying to parse.

Parsing HTML and XML with regular expressions is usually a bad idea, both because they are not ordinary languages ​​and because libraries already exist that can parse it.

As another example, if you need to check if an integer is in the range 0-255, it’s easier to understand if you use the functions of the language library to parse it into an integer, and then check its numerical value, rather than trying to write a regular expression corresponding to this range.

+26
source

I will plagiarize myself from my blog post, When to use and when not to use regular expressions ...

Public websites should not allow users to enter regular expressions to search. Providing the full power of regular expression to the general public for a website’s search engine can have a devastating effect. There is such a thing as a regular service rejection attack (ReDoS) that should be avoided at all costs.

HTML / XML parsing should not be done using regular expressions. First of all, regular expressions are intended for the analysis of an ordinary language , which is the simplest among the Chomsky hierarchy . Now, with the advent of balancing group definitions in the .NET regular expression expression, you can move on to a slightly more complex area and do a few things with XML or HTML in controlled situations. However, not so much. Parsers are available for both XML and HTML, which will facilitate the work more efficiently and reliably. In .NET, XML can be handled in the old XmlDocument way, or even easier with Linq to XML . Or for HTML there is an HTML Agility Pack .

Conclusion

Regular expressions use them. I still claim that in many cases they can save the programmer a lot of time and effort. Of course, given the infinite time and resources, it was almost always possible to create a procedural solution that was more effective than the equivalent regular expression.

Your decision to refuse regular expression should be based on three things:

1.) Is the regex so slow in your script that it has become a bottleneck?

2.) Is your procedural decision actually faster and easier to write than a regular expression?

3.) Is there a specialized parser that will do the job better?

+7
source

My rule is to use regular expressions if no other solution exists. If a parser already exists (for example, XML, HTML) or you are just looking for strings, not patterns, there is no need to use regular expressions.

Always ask yourself: β€œCan I solve this without using regular expressions?”. The answer to this question will tell you whether to use regular expressions.

+2
source

Source: https://habr.com/ru/post/898053/


All Articles