Search for all chemical symbols in a file

I have a file containing many chemical formulas. I need to mention any text that is a chemical formula. I want to find a file for any place containing a combination of at least one chemical symbol and at least one number, and add \chemical{} around it. For instance. H2O becomes \chemical{H2O} , and FeS2 becomes \chemical{FeS2} .

  • Chemicals are limited to spaces ( ) or forward slashes ( /" . For example: /Ar becomes /\chemical{Ar} , but Arizona should not be identified as` \ chemical {Ar} izona ".
  • Combinations that do not contain numbers should be ignored.
  • I found this list, which, in my opinion, has all possible chemical names: "Ac, Ag, Al, Am, Ar, As, At, Au, B, Ba, Be, Bh, Bi, Bk, Br, C, Ca, Cd, Ce, Cf, Cl, Cm, Cn, Co, Cr, Cs, Cu, Db, Ds, Dy, Er, Es, Eu, F, Fe, Fm, Fr, Ga, Gd, Ge, H, He, Hf, Hg, Ho, Hs, I, In, Ir, K, Kr, La, Li, Lr, Lu, Md, Mg, Mn, Mo, Mt, N, Na, Nb, Nd, Ne, Ni, No, Np, O, Os, P, Pa, Pb, Pd, Pm, Po, Pr, Pt, Pu, Ra, Rb, Re, Rf, Rg, Rh, Rn, Ru, S, Sb, Sc, Se, Sg, Si, Sm, Sn, Sr, Ta, Tb, Tc, Te, Th, Ti, Tl, Tm, U, Uuh, Uuo, Uup, Uuq, Uus, Uut, V, W, Xe, Y, Yb, Zn, Zr. "

How can I find all the chemical formulas appearing in a file?

+4
source share
2 answers

I would use Perl. It is more monotonous than exciting. You create a regular expression containing all the alternate characters, and then create a more complex regular expression from this and some other bits and parts:

 #!/usr/bin/env perl use strict; use warnings; my $symbols = "Ac|Ag|Al|Am|Ar|As|At|Au|B|Ba|Be|Bh|Bi|Bk|Br|C|Ca|Cd|Ce|Cf|Cl|Cm|Cn|Co|Cr|Cs|Cu|Db|Ds|Dy|Er|Es|Eu|F|Fe|Fm|Fr|Ga|Gd|Ge|H|He|Hf|Hg|Ho|Hs|I|In|Ir|K|Kr|La|Li|Lr|Lu|Md|Mg|Mn|Mo|Mt|N|Na|Nb|Nd|Ne|Ni|No|Np|O|Os|P|Pa|Pb|Pd|Pm|Po|Pr|Pt|Pu|Ra|Rb|Re|Rf|Rg|Rh|Rn|Ru|S|Sb|Sc|Se|Sg|Si|Sm|Sn|Sr|Ta|Tb|Tc|Te|Th|Ti|Tl|Tm|U|Uuh|Uuo|Uup|Uuq|Uus|Uut|V|W|Xe|Y|Yb|Zn|Zr"; #my $symbols = "Ac|Ag|Al|...|Y|Yb|Zn|Zr"; my $regex = qr{ ([/ ]) ( (?:$symbols) (?: \d (?:$symbols) )* \d? ) ([ /]) }x; printf "$regex\n"; while (<>) { s/$regex/$1\\chemical{$2}$3/g; # Handles first and third (, ...) in H2O CO2 H2SO4 s/$regex/$1\\chemical{$2}$3/g; # Handles second (fourth, ...) print $_; } 

The first capture touches the space or slash before the character. The second capture is a terrible, double-thickening string in $symbols . (?:...) are for grouping, not capture. The pattern looks for a chemical symbol, optionally followed by zero or more sequences of a digit and another symbol, possibly with a trailing digit. Please note that this is what you specified, but skip compounds like H 2 SO 4 , CO 2 , KMnO 4 sub> etc. You can select them with a simple adaptation:

 my $regex = qr{ ([/ ]) ( (?:$symbols) (?: \d* (?:$symbols) )* \d* ) ([ /]) }x; 

I also accept single digits in all compounds. This works for many, but some of the longer hydrocarbons will not be so good: CH 4 , C 2 H 6 , C 3 H 8 , C 4 H 10 , ... Again, you can handle this by replacing 0- or -1 ? with 0 or more * . You still have problems with commas after joins in lists, joins at the beginning of a line, joins at the end of a line, etc. Your specification is fully regulated by them.

You could better replace the first and third captures with \b to mark the boundary between words and words other than words where the chemical symbol will be considered a word. This applies to questions with commas, the beginning and end of the line, but selects more than you specified.

 my $regex = qr{ \b ( (?:$symbols) (?: \d* (?:$symbols) )* \d* ) \b }x; printf "$regex\n"; while (<>) { s/$regex/\\chemical{$1}/g; print $_; } 

Note that this formulation does not need to be replaced twice; one is enough, so it is definitely cleaner.

+5
source

Using awk :

 awk 'BEGIN{ strElements="Ac Ag Al Am Ar As At Au B Ba Be Bh Bi Bk Br C Ca Cd Ce Cf Cl Cm Cn Co Cr Cs Cu Db Ds Dy Er Es Eu F Fe Fm Fr Ga Gd Ge H He Hf Hg Ho Hs I In Ir K Kr La Li Lr Lu Md Mg Mn Mo Mt N Na Nb Nd Ne Ni No Np O Os P Pa Pb Pd Pm Po Pr Pt Pu Ra Rb Re Rf Rg Rh Rn Ru S Sb Sc Se Sg Si Sm Sn Sr Ta Tb Tc Te Th Ti Tl Tm U Uuh Uuo Uup Uuq Uus Uut VW Xe Y Yb Zn Zr" n = split(strElements, arrElements) for(i = 0; i < n; i++) hashElements[arrElements[i]] = 1} {for(i = 1; i <= NF; i++) { str = substr($i, 1, 1) == "/" ? substr($i, 2) : $i n = split(str, elements, "[0123456789]+") while (n > 0) {if (!(elements[n] in hashElements)) break; n--} if (n == 0) $i = (substr($i, 1, 1) == "/" ? "/" : "") "\\chemical{" str "}" } print}' your_file 

The idea of ​​the script is as follows:

  • Create a hash of all elements (in awk , arrays are associative).
  • For each line, use one word at a time, divide it by a number, and see if each subword is an element.
  • If so, surround the chemical with the desired line.

Of course, you need to add some logic to consider the special character / .

+1
source

Source: https://habr.com/ru/post/1400793/


All Articles