Python regex to replace uncoded ampersands in text

Question

Python regex to replace uncoded ampersands in text

I work with an upstream system that sometimes sends me text intended for HTML / XML output with ampersands that are not encoded:

str1 = "Stay at this B&B" str2 = "He&#8217;s going to Texas A&M" str3 = "He&#8217;s going to a B&amp;B and then Texas A&M"

I need to replace unencoded ampersands with & by storing those that are part of character references or are already encoded.

(Correcting the bottom-up system is not an option, and since the text sometimes comes partially encoded, re-encoding the entire string is not something I can do. I would just like to fix this problem and continue my life)

This regular expression catches it well, but it's hard for me to understand the syntax to do re.sub :

 re.findall("&[^#|amp]", str3)

I am not sure how to replace the text correctly; I have a feeling that it will include re.group , but this is a weakness in my regex-foo.

Any help is appreciated.

+4

python regex

Scott Jan 4 '12 at 17:46

source share

3 answers

If an ampersand is part of a character entity, it can be any named object (not just & ), a decimal object, or a hexadecimal object. This should cover it:

 re.sub(r'&(?![A-Za-z]+[0-9]*;|#[0-9]+;|#x[0-9a-fA-F]+;)', r'&amp;', your_string)

+9

Alan moore Jan 4 '12 at 18:16

source share

The first guy was close:

 re.sub(r"&(?!#\d{4};|amp;)", "&amp", your_string)

0

odgrim Jan 4 '12 at 18:12

source share

Andrew Clark · Accepted Answer · 2012-01-04T17:47:58+0000

I would suggest using a negative look for this. This will result in a match failure if & followed by #xxxx; (where x is a digit) or amp; , so it will only match individual & characters and replace them with & .

 re.sub(r"&(?!#\d{4};|amp;)", "&amp;", your_string)

Python regex to replace uncoded ampersands in text

More articles: