I use regex to replace quotes inside the input string. My data contains two "types" of quotation marks -
" and "
There is a very subtle difference between the two. I currently explicitly mention both of these types in my regex
\"*\"*
I am afraid that in the future data I may get another “type” of quotation, according to which my regular expression may fail. How many different types of quotes are there? Is there a way to normalize them for just one type so that my regex is not interrupted for invisible data?
Edit -
My input is made up of HTML files, and I avoid HTML objects and URLs in ASCII
escaped_line = HTMLParser.HTMLParser().unescape(urllib.unquote(line.decode('ascii','ignore')))
where line indicates each line in the HTML file. I need to “ignore” ASCII, since all the files in my database do not have the same encoding, and I do not know the encoding before reading the file.
Edit2
I cannot do this using the replace function. I tried replacing ('' ',' '), but it does not replace another type of quote.' '' If I add it to another replacement function, it will cause a NON-ASCII character error.
Condition
The use of external libraries is not allowed, only native python libraries can be used.
source share