While processing a large text database with Perl, I ran into the problem of using regular expressions to match chemical formulas. I saw these two previous topics, but the suggested answers there are too weak for my requirements.
In particular, my (admittedly limited) research led me to this publication , which provides a regex for the currently accepted chemical symbols, which I'm copying here for reference
[BCFHIKNOPSUVWY] | [ISZ] [nr] | [ACELP] [ru] | A [cglmst] | B [aehikr] | C [adeflos] | D [bsy] | Es | F [elmr] | G [ade] | H [efgos] | Kr | L [aiv] | M [cdgnot] | N [abdehiop] | O [gs] | P [abdmot] | R [abe-hnu] | S [bcegim] | T [abcehilms] | Xe | Yb
(Thus, for example C, Cmand Cnwill pass through but not Cgor Cx).
As in the previous questions, I also need to match numbers, complete sets of brackets and complete sets of square brackets, so, for example, C2H6Oand (CH3)2CFCOO(CH2)2Si(CH3)2Cl.
So, how do I combine the previous solutions with a large regular expression to match real chemical elements to strictly match the chemical formula?
(If you donโt add too much hassle, truly talking about how to analyze a regular expression in a human way would be very appreciated, although not necessary.)
source
share