Python regex to replace uncoded ampersands in text

I work with an upstream system that sometimes sends me text intended for HTML / XML output with ampersands that are not encoded:

str1 = "Stay at this B&B" str2 = "He’s going to Texas A&M" str3 = "He’s going to a B&B and then Texas A&M" 

I need to replace unencoded ampersands with & by storing those that are part of character references or are already encoded.

(Correcting the bottom-up system is not an option, and since the text sometimes comes partially encoded, re-encoding the entire string is not something I can do. I would just like to fix this problem and continue my life)

This regular expression catches it well, but it's hard for me to understand the syntax to do re.sub :

 re.findall("&[^#|amp]", str3) 

I am not sure how to replace the text correctly; I have a feeling that it will include re.group , but this is a weakness in my regex-foo.

Any help is appreciated.

+4
source share
3 answers

I would suggest using a negative look for this. This will result in a match failure if & followed by #xxxx; (where x is a digit) or amp; , so it will only match individual & characters and replace them with & .

 re.sub(r"&(?!#\d{4};|amp;)", "&", your_string) 
+4
source

If an ampersand is part of a character entity, it can be any named object (not just & ), a decimal object, or a hexadecimal object. This should cover it:

 re.sub(r'&(?![A-Za-z]+[0-9]*;|#[0-9]+;|#x[0-9a-fA-F]+;)', r'&', your_string) 
+9
source

The first guy was close:

 re.sub(r"&(?!#\d{4};|amp;)", "&amp", your_string) 
0
source

Source: https://habr.com/ru/post/1389323/


All Articles