Why is this regular expression space in the last match?

I have the following text:

2 HCl + 12 Na + 3 (Na₃Clβ‚‚)β‚‚β‚„ β†’ 2 NaCl + Hβ‚‚

I would like to compare each molecule, including its coefficient. The regular expression below almost works, but the space character, right before the last match, is selected, which should not. Here is the regex that I use:

(([0-9]* ??\(*([az]+[β‚€-₉]*)+\)*[β‚€-₉]*))

If you look at this regex101 link, it might be easier to see what my problem is: https://regex101.com/r/hK7jY6/1

+5
source share
2 answers

Update

If your lines are valid chemical formulas, why use indexes / numbers / letters? There are characters without spaces. Since there must be a required letter or ( , use them in the [az(] character class, and then add \S* (zero or more non-spaces):

 /(?:\d+ )?[az(]\S*/gi 

See the demo of regex . Construction (?:...)? represents an optional group that is not associated with capture (i.e., a group that is used only for grouping, but not for capture (= saving the subpattern inside the memory buffer).

Original answer explaining the root cause

You have numbers and a space at the beginning as optional subpatterns, instead you need to match them, but put them in an optional group:

 (?:[0-9]+ )?\(*([az]+[β‚€-₉]*)+\)*[β‚€-₉]* 

Watch the regex demo

Your [0-9]* ?? turned into (?:[0-9]+ )? . Please note that here you do not need to use the lazy version of the quantifier ? , she will work just as greedy. I also deleted 2 unnecessary outer groupings (...) .

Since the group (?:[0-9]+ )? optional, space will only be matched if there is a digit in front of it. If there are no numbers, the next character that can be matched is zero or more ( . Then the letter [az] must be present (if not ( , the letter will be the first character in the match).

Let me break it:

  • (?:[0-9]+ )? - optional one or more digits followed by a space
  • \(* - zero or more ( (perhaps you mean ? )
  • ([az]+[β‚€-₉]*)+ - zero or more sequences of one or more letters, followed by zero or more sbscript bits
  • \)* - zero or more ) (perhaps you mean ? )
  • [β‚€-₉]* - zero or more digits of the index

If you also want to make sure that you do not match (Ca or H) , you must also separate \(*...\)* as follows:

 (?:[0-9]+ )?(?:(?:[az]+[β‚€-₉]*)+|\((?:[az]+[β‚€-₉]*)+\))[β‚€-₉]* 

Watch another demo

+5
source

Although Wiktor's answer is very informative, I think I could find an easier way to do this.

([0-9]+ )*[az\(β‚€-₉\)]+

This will fit all parts of the equation as far as I can tell.

Demo

Update

Please check out the answers on Wiktors versions, it's better than that.

+2
source

Source: https://habr.com/ru/post/1242442/


All Articles