I work with an upstream system that sometimes sends me text intended for HTML / XML output with ampersands that are not encoded:
str1 = "Stay at this B&B" str2 = "He&
I need to replace unencoded ampersands with & by storing those that are part of character references or are already encoded.
(Correcting the bottom-up system is not an option, and since the text sometimes comes partially encoded, re-encoding the entire string is not something I can do. I would just like to fix this problem and continue my life)
This regular expression catches it well, but it's hard for me to understand the syntax to do re.sub :
re.findall("&[^#|amp]", str3)
I am not sure how to replace the text correctly; I have a feeling that it will include re.group , but this is a weakness in my regex-foo.
Any help is appreciated.
source share