Regular expression for extracting operands from mathematical expressions

No question about SO concerns my specific issue. I don't know much about regex. I am creating an expression parser in Java using the Regex class for this purpose. I want to extract operands, arguments, operators, characters and function names from an expression, and then store it in an ArrayList. I am currently using this logic

String string = "2!+atan2(3+9,2+3)-2*PI+3/3-9-12%3*sin(9-9)+(2+6/2)" //This is just for testing purpose later on it will be provided by user List<String> res = new ArrayList<>(); Pattern pattern = Pattern.compile((\\Q^\\E|\\Q/\\E|\\Q-\\E|\\Q-\\E|\\Q+\\E|\\Q*\\E|\\Q)\\E|\\Q)\\E|\\Q(\\E|\\Q(\\E|\\Q%\\E|\\Q!\\E)) //This string was build in a function where operator names were provided. Its mean that user can add custom operators and custom functions Matcher m = pattern.matcher(string); int pos = 0; while (m.find()) { if (pos != m.start()) { res.add(string.substring(pos, m.start())) } res.add(m.group()) pos = m.end(); } if (pos != string.length()) { addToTokens(res, string.substring(pos)); } for(String s : res) { System.out.println(s); } 

Output:

 2 ! + atan2 ( 3 + 9 , 2 + 3 ) - 2 * PI + 3 / 3 - 9 - 12 % 3 * sin ( 9 - 9 ) + ( 2 + 6 / 2 ) 

The problem is that now the expression may contain a matrix with a custom format. I want to consider each Matrix as an Operand or Argument in the case of functions.

Input 1:

 String input_1 = "2+3-9*[{2+3,2,6},{7,2+3,2+3i}]+9*6" 

The output should be:

 2 + 3 - 9 * [{2+3,2,6},{7,2+3,2+3i}] + 9 * 6 

Input 2:

 String input_2 = "{[2,5][9/8,func(2+3)]}+9*8/5" 

The output should be:

 {[2,5][9/8,func(2+3)]} + 9 * 8 / 5 

Input 3:

 String input_3 = "<[2,9,2.36][2,3,2!]>*<[2,3,9][23+9*8/8,2,3]>" 

The output should be:

 <[2,9,2.36][2,3,2!]> * <[2,3,9][23+9*8/8,2,3]> 

I want the ArrayList to now contain every operand, operators, arguments, functions and characters for each index. How can I achieve the desired result using a regular expression. Expression validation is not required.

+5
source share
2 answers

I think you can try something like:

 (?<matrix>(?:\[[^\]]+\])|(?:<[^>]+>)|(?:\{[^\}]+\}))|(?<function>\w+(?=\())|(\d+[eE][-+]\d+)|(?<operand>\w+)|(?<operator>[-+\/*%])|(?<symbol>.) 

Demo

Items

Committed to named capture groups. If you do not need this, you can use short:

 \[[^\]]+\]|<[^>]+>|\{[^\}]+\}|\d+[eE][-+]\d+|\w+(?=\()|\w+|[-+\/*%]|. 


Opening key \[[^\]]+\]|<[^>]+>|\{[^\}]+\} ( { , [ or < ), non-sliding brackets and a closing bracket ( } , ] , > ), so if there are no identical characters enclosed, the type of parentheses, no problem. Java implementation:

 public class Test { public static void main(String[] args) { String[] expressions = {"2!+atan2(3+9,2+3)-2*PI+3/3-9-12%3*sin(9-9)+(2+6/2)", "2+3-9*[{2+3,2,6},{7,2+3,2+3i}]+9*6", "{[2,5][9/8,func(2+3)]}+9*8/5","<[2,9,2.36][2,3,2!]>*<[2,3,9][23 + 9 * 8 / 8, 2, 3]>"}; Pattern pattern = Pattern.compile("(?<matrix>(?:\\[[^]]+])|(?:<[^>]+>)|(?:\\{[^}]+}))|(?<function>\\w+(?=\\())|(?<operand>\\w+)|(?<operator>[-+/*%])|(?<symbol>.)"); for(String expression : expressions) { List<String> elements = new ArrayList<String>(); Matcher matcher = pattern.matcher(expression); while (matcher.find()) { elements.add(matcher.group()); } for (String element : elements) { System.out.println(element); } System.out.println("\n\n\n"); } } } 

Explanation of alternatives:

  • \[[^\]]+\]|<[^>]+>|\{[^\}]+\} - match the opening bracket of a given type, a character that does not close the bracket of this type (anything that does not close bracket), and the closing bracket of this type,
  • \d+[eE][-+]\d+ = digit, then e or e , followed by the + or - operator, and then the numbers to capture elements like 2e+3
  • \w+(?=\() - match one or more characters of the word (A-Za-z0-9_) , if it is then ( to match functions such as sin ,
  • \w+ - match one or more characters of a word (A-Za-z0-9_) to match operands,
  • [-+\/*%] - matching one character with a character class to match Operators
  • . - match any other character to match other characters

The order of the alternatives is very important, as the last alternative . will match any character, so it should be the last option. A similar case with \w+(?=\() And \w+ , the second will correspond to all, as the previous one, however, if you do not want to distinguish between functions and operands, \w+ will be enough for all of them.

In a longer example, the part (?<name> ... ) in each alternative is a named capture group, and you can see in the demo how it groups consistent fragments in gorups, such as: operand, operator, function, etc.

+1
source

With regular expressions, you cannot match any level of nested balanced parentheses.

For example, in the second example {[2,5][9/8,func(2+3)]} you need to match the opening bracket with the closing bracket, but you need to keep track of how many open and closing inner curly braces / parens / etc are . This cannot be done with regular expressions.

If, on the other hand, you simplify your problem to remove any balancing requirements, then you can probably handle regular expressions.

+2
source

Source: https://habr.com/ru/post/1233481/


All Articles