Strict regex for matching chemical formulas

While processing a large text database with Perl, I ran into the problem of using regular expressions to match chemical formulas. I saw these two previous topics, but the suggested answers there are too weak for my requirements.

In particular, my (admittedly limited) research led me to this publication , which provides a regex for the currently accepted chemical symbols, which I'm copying here for reference

[BCFHIKNOPSUVWY] | [ISZ] [nr] | [ACELP] [ru] | A [cglmst] | B [aehikr] | C [adeflos] | D [bsy] | Es | F [elmr] | G [ade] | H [efgos] | Kr | L [aiv] | M [cdgnot] | N [abdehiop] | O [gs] | P [abdmot] | R [abe-hnu] | S [bcegim] | T [abcehilms] | Xe | Yb

(Thus, for example C, Cmand Cnwill pass through but not Cgor Cx).

As in the previous questions, I also need to match numbers, complete sets of brackets and complete sets of square brackets, so, for example, C2H6Oand (CH3)2CFCOO(CH2)2Si(CH3)2Cl.

So, how do I combine the previous solutions with a large regular expression to match real chemical elements to strictly match the chemical formula?

(If you donโ€™t add too much hassle, truly talking about how to analyze a regular expression in a human way would be very appreciated, although not necessary.)

+4
source share
2 answers

Brief

, , , ( ). .


, OP :

  • , ,
  • 2 , 1

- , ,


(?(DEFINE)
  (?# Periodic elements )
  (?<Hydrogen>H)
  (?<Helium>He)
  (?<Lithium>Li)
  (?<Beryllium>Be)
  (?<Boron>B)
  (?<Carbon>C)
  (?<Nitrogen>N)
  (?<Oxygen>O)
  (?<Fluorine>F)
  (?<Neon>Ne)
  (?<Sodium>Na)
  (?<Magnesium>Mg)
  (?<Aluminum>Al)
  (?<Silicon>Si)
  (?<Phosphorus>P)
  (?<Sulfur>S)
  (?<Chlorine>Cl)
  (?<Argon>Ar)
  (?<Potassium>K)
  (?<Calcium>Ca)
  (?<Scandium>Sc)
  (?<Titanium>Ti)
  (?<Vanadium>V)
  (?<Chromium>Cr)
  (?<Manganese>Mn)
  (?<Iron>Fe)
  (?<Cobalt>Co)
  (?<Nickel>Ni)
  (?<Copper>Cu)
  (?<Zinc>Zn)
  (?<Gallium>Ga)
  (?<Germanium>Ge)
  (?<Arsenic>As)
  (?<Selenium>Se)
  (?<Bromine>Br)
  (?<Krypton>Kr)
  (?<Rubidium>Rb)
  (?<Strontium>Sr)
  (?<Yttrium>Y)
  (?<Zirconium>Zr)
  (?<Niobium>Nb)
  (?<Molybdenum>Mo)
  (?<Technetium>Tc)
  (?<Ruthenium>Ru)
  (?<Rhodium>Rh)
  (?<Palladium>Pd)
  (?<Silver>Ag)
  (?<Cadmium>Cd)
  (?<Indium>In)
  (?<Tin>Sn)
  (?<Antimony>Sb)
  (?<Tellurium>Te)
  (?<Iodine>I)
  (?<Xenon>Xe)
  (?<Cesium>Cs)
  (?<Barium>Ba)
  (?<Lanthanum>La)
  (?<Cerium>Ce)
  (?<Praseodymium>Pr)
  (?<Neodymium>Nd)
  (?<Promethium>Pm)
  (?<Samarium>Sm)
  (?<Europium>Eu)
  (?<Gadolinium>Gd)
  (?<Terbium>Tb)
  (?<Dysprosium>Dy)
  (?<Holmium>Ho)
  (?<Erbium>Er)
  (?<Thulium>Tm)
  (?<Ytterbium>Yb)
  (?<Lutetium>Lu)
  (?<Hafnium>Hf)
  (?<Tantalum>Ta)
  (?<Tungsten>W)
  (?<Rhenium>Re)
  (?<Osmium>Os)
  (?<Iridium>Ir)
  (?<Platinum>Pt)
  (?<Gold>Au)
  (?<Mercury>Hg)
  (?<Thallium>Tl)
  (?<Lead>Pb)
  (?<Bismuth>Bi)
  (?<Polonium>Po)
  (?<Astatine>At)
  (?<Radon>Rn)
  (?<Francium>Fr)
  (?<Radium>Ra)
  (?<Actinium>Ac)
  (?<Thorium>Th)
  (?<Protactinium>Pa)
  (?<Uranium>U)
  (?<Neptunium>Np)
  (?<Plutonium>Pu)
  (?<Americium>Am)
  (?<Curium>Cm)
  (?<Berkelium>Bk)
  (?<Californium>Cf)
  (?<Einsteinium>Es)
  (?<Fermium>Fm)
  (?<Mendelevium>Md)
  (?<Nobelium>No)
  (?<Lawrencium>Lr)
  (?<Rutherfordium>Rf)
  (?<Dubnium>Db)
  (?<Seaborgium>Sg)
  (?<Bohrium>Bh)
  (?<Hassium>Hs)
  (?<Meitnerium>Mt)
  (?<Darmstadtium>Ds)
  (?<Roentgenium>Rg)
  (?<Copernicium>Cn)
  (?<Nihonium>Nh)
  (?<Flerovium>Fl)
  (?<Moscovium>Mc)
  (?<Livermorium>Lv)
  (?<Tennessine>Ts)
  (?<Oganesson>Og)
  (?# Regex )
  (?<Element>(?&Actinium)|(?&Silver)|(?&Aluminum)|(?&Americium)|(?&Argon)|(?&Arsenic)|(?&Astatine)|(?&Gold)|(?&Barium)|(?&Beryllium)|(?&Bohrium)|(?&Bismuth)|(?&Berkelium)|(?&Bromine)|(?&Boron)|(?&Calcium)|(?&Cadmium)|(?&Cerium)|(?&Californium)|(?&Chlorine)|(?&Curium)|(?&Copernicium)|(?&Cobalt)|(?&Chromium)|(?&Cesium)|(?&Copper)|(?&Carbon)|(?&Dubnium)|(?&Darmstadtium)|(?&Dysprosium)|(?&Erbium)|(?&Einsteinium)|(?&Europium)|(?&Iron)|(?&Flerovium)|(?&Fermium)|(?&Francium)|(?&Fluorine)|(?&Gallium)|(?&Gadolinium)|(?&Germanium)|(?&Helium)|(?&Hafnium)|(?&Mercury)|(?&Holmium)|(?&Hassium)|(?&Hydrogen)|(?&Indium)|(?&Iridium)|(?&Iodine)|(?&Krypton)|(?&Potassium)|(?&Lanthanum)|(?&Lithium)|(?&Lawrencium)|(?&Lutetium)|(?&Livermorium)|(?&Moscovium)|(?&Mendelevium)|(?&Magnesium)|(?&Manganese)|(?&Molybdenum)|(?&Meitnerium)|(?&Sodium)|(?&Niobium)|(?&Neodymium)|(?&Neon)|(?&Nihonium)|(?&Nickel)|(?&Nobelium)|(?&Neptunium)|(?&Nitrogen)|(?&Oganesson)|(?&Osmium)|(?&Oxygen)|(?&Protactinium)|(?&Lead)|(?&Palladium)|(?&Promethium)|(?&Polonium)|(?&Praseodymium)|(?&Platinum)|(?&Plutonium)|(?&Phosphorus)|(?&Radium)|(?&Rubidium)|(?&Rhenium)|(?&Rutherfordium)|(?&Roentgenium)|(?&Rhodium)|(?&Radon)|(?&Ruthenium)|(?&Antimony)|(?&Scandium)|(?&Selenium)|(?&Seaborgium)|(?&Silicon)|(?&Samarium)|(?&Tin)|(?&Strontium)|(?&Sulfur)|(?&Tantalum)|(?&Terbium)|(?&Technetium)|(?&Tellurium)|(?&Thorium)|(?&Titanium)|(?&Thallium)|(?&Thulium)|(?&Tennessine)|(?&Uranium)|(?&Vanadium)|(?&Tungsten)|(?&Xenon)|(?&Ytterbium)|(?&Yttrium)|(?&Zirconium)|(?&Zinc))
  (?<Num>(?:[1-9]\d*)?)
  (?<ElementGroup>(?:(?&Element)(?&Num))+)
  (?<ElementParenthesesGroup>\((?&ElementGroup)+\)(?&Num))
  (?<ElementSquareBracketGroup>\[(?:(?:(?&ElementParenthesesGroup)(?:(?&ElementGroup)|(?&ElementParenthesesGroup))+)|(?:(?:(?&ElementGroup)|(?&ElementParenthesesGroup))+(?&ElementParenthesesGroup)))\](?&Num))
)
^((?<Brackets>(?&ElementSquareBracketGroup))|(?<Parentheses>(?&ElementParenthesesGroup))|(?<Group>(?&ElementGroup)))+$

  • (?(DEFINE)) ( ).
  • Element | , 1., , , ( , , C Ca)
  • ElementGroup : Element, , ( Num)
      • C - Element
      • CH - Element, Element
      • CH3 - Element, Element Num
      • O2 - Element, Num
      • N0 - 0
      • N01 - Num , 1-9
      • A -
      • C - -
  • ElementParenthesesGroup ElementGroup ( ), ElementGroup
      • (CH) - ElementGroup
      • (CH3) - ElementGroup
      • (CH3NO4) - ElementGroup,
      • (CH3N04)2 - ElementGroup, , Num
      • (CH[NO4]) - ElementGroup ElementParenthesesGroup
  • ElementSquareBracketGroup ElementParenthesesGroup ElementGroup [ ], ElementParenthesesGroup (ElementParenthesesGroup ElementGroup)
      • [CH3(NO4)] - ElementParenthesesGroup ElementParenthesesGroup ElementGroup
      • [(NO4)CH]2 - ElementParenthesesGroup ElementParenthesesGroup ElementGroup, Num
      • [(NO4)(CH3)] - ElementParenthesesGroup ElementParenthesesGroup ElementGroup
      • [(NO4)] - , [ ]
      • [NO4] - ElementParenthesesGroup

, , .

, :

  • g -
  • x - .
  • ( ), m

. Regex x, ( x). , . , (CH3)2CFCOO(CH2)2Si(CH3)2Cl, .

+5

. , @atoms. , , :

my ($atoms_regex) = map qr/$_/, join '|', map quotemeta, sort @atoms;

( , , quotemeta, | .)

@atoms.

, . , , :

my $chemical_formula_regex = qr/
  (?&item)++
  (?(DEFINE)
    (?<item> (?: \((?&item)++\) | \[(?&item)++\] | $atoms_regex ) [0-9]* )
  )
/x;
  • (?(DEFINE) ...) (?<name> ...). . (?&name). .

  • /x . !

  • ++ + , . , .

+5

Source: https://habr.com/ru/post/1685610/


All Articles