Indicate two types of quotes in regular expression

I use regex to replace quotes inside the input string. My data contains two "types" of quotation marks -

" and " 

There is a very subtle difference between the two. I currently explicitly mention both of these types in my regex

 \"*\"* 

I am afraid that in the future data I may get another “type” of quotation, according to which my regular expression may fail. How many different types of quotes are there? Is there a way to normalize them for just one type so that my regex is not interrupted for invisible data?

Edit -

My input is made up of HTML files, and I avoid HTML objects and URLs in ASCII

 escaped_line = HTMLParser.HTMLParser().unescape(urllib.unquote(line.decode('ascii','ignore'))) 

where line indicates each line in the HTML file. I need to “ignore” ASCII, since all the files in my database do not have the same encoding, and I do not know the encoding before reading the file.

Edit2

I cannot do this using the replace function. I tried replacing ('' ',' '), but it does not replace another type of quote.' '' If I add it to another replacement function, it will cause a NON-ASCII character error.

Condition

The use of external libraries is not allowed, only native python libraries can be used.

+3
source share
3 answers

It turns out there is a much simpler way to do this. Just add the literal 'u' before your regular expression that you write in python.

 regexp = ru'\"*\"*' 

Make sure you use the re.UNICODE flag if you want to compile / search / match your regular expression with your string.

 re.findall(regexp, string, re.UNICODE) 

Do not forget to enable

 #!/usr/bin/python # -*- coding:utf-8 -*- 

at the beginning of the source file to make sure that unicode strings can be written to your source file.

0
source

I don’t think there is a “quotation mark” character class in the Python regex implementation, so you have to match it accordingly.

You can save a list of common Unicode code symbols ( here is a list for a good start ) and build part of the regular expression that matches the quotation marks programmatically.

+3
source

I can only help you with the original question about quotes. As it turned out, Unicode defines many properties for each character, and all of them are available in the Unicode character database. A quotation mark is one of these properties.

How many different types of quotes are there?

29, according to Unicode, see below.

The Unicode standard brings us the final text file in the Unicode properties, PropList.txt , including a list of quotation marks. Since Python does not support supporting all Unicode properties in regular expressions , you cannot use \p{QuotationMark} . However, to create a regular expression character class is trivial:

 // placed on multiple lines for readability, remove spaces // and then place in your regex in place of the current quotes [\u0022 \u0027 \u00AB \u00BB \u2018 \u2019 \u201A \u201B \u201C \u201D \u201E \u201F \u2039 \u203A \u300C \u300D \u300E \u300F \u301D \u301E \u301F \uFE41 \uFE42 \uFE43 \uFE44 \uFF02 \uFF07 \uFF62 \uFF63] 

As tchrist noted above, you can save yourself some trouble by using the Matthew Barnett regex library , which supports \p{QuotationMark} .

+1
source

Source: https://habr.com/ru/post/911588/


All Articles