I would use Perl. It is more monotonous than exciting. You create a regular expression containing all the alternate characters, and then create a more complex regular expression from this and some other bits and parts:
#!/usr/bin/env perl use strict; use warnings; my $symbols = "Ac|Ag|Al|Am|Ar|As|At|Au|B|Ba|Be|Bh|Bi|Bk|Br|C|Ca|Cd|Ce|Cf|Cl|Cm|Cn|Co|Cr|Cs|Cu|Db|Ds|Dy|Er|Es|Eu|F|Fe|Fm|Fr|Ga|Gd|Ge|H|He|Hf|Hg|Ho|Hs|I|In|Ir|K|Kr|La|Li|Lr|Lu|Md|Mg|Mn|Mo|Mt|N|Na|Nb|Nd|Ne|Ni|No|Np|O|Os|P|Pa|Pb|Pd|Pm|Po|Pr|Pt|Pu|Ra|Rb|Re|Rf|Rg|Rh|Rn|Ru|S|Sb|Sc|Se|Sg|Si|Sm|Sn|Sr|Ta|Tb|Tc|Te|Th|Ti|Tl|Tm|U|Uuh|Uuo|Uup|Uuq|Uus|Uut|V|W|Xe|Y|Yb|Zn|Zr"; #my $symbols = "Ac|Ag|Al|...|Y|Yb|Zn|Zr"; my $regex = qr{ ([/ ]) ( (?:$symbols) (?: \d (?:$symbols) )* \d? ) ([ /]) }x; printf "$regex\n"; while (<>) { s/$regex/$1\\chemical{$2}$3/g; # Handles first and third (, ...) in H2O CO2 H2SO4 s/$regex/$1\\chemical{$2}$3/g; # Handles second (fourth, ...) print $_; }
The first capture touches the space or slash before the character. The second capture is a terrible, double-thickening string in $symbols . (?:...) are for grouping, not capture. The pattern looks for a chemical symbol, optionally followed by zero or more sequences of a digit and another symbol, possibly with a trailing digit. Please note that this is what you specified, but skip compounds like H 2 SO 4 , CO 2 , KMnO 4 sub> etc. You can select them with a simple adaptation:
my $regex = qr{ ([/ ]) ( (?:$symbols) (?: \d* (?:$symbols) )* \d* ) ([ /]) }x;
I also accept single digits in all compounds. This works for many, but some of the longer hydrocarbons will not be so good: CH 4 , C 2 H 6 , C 3 H 8 , C 4 H 10 , ... Again, you can handle this by replacing 0- or -1 ? with 0 or more * . You still have problems with commas after joins in lists, joins at the beginning of a line, joins at the end of a line, etc. Your specification is fully regulated by them.
You could better replace the first and third captures with \b to mark the boundary between words and words other than words where the chemical symbol will be considered a word. This applies to questions with commas, the beginning and end of the line, but selects more than you specified.
my $regex = qr{ \b ( (?:$symbols) (?: \d* (?:$symbols) )* \d* ) \b }x; printf "$regex\n"; while (<>) { s/$regex/\\chemical{$1}/g; print $_; }
Note that this formulation does not need to be replaced twice; one is enough, so it is definitely cleaner.