Finding all Unicode hyphens in Python

I am trying to extract specific text from a PDF converted to text files. The PDF was obtained from various sources, and I do not know how they were created.

The pattern I was trying to extract was just two digits, then a hyphen, and then two more digits, for example. 12-34. So I wrote a simple regex \d\d-\d\dand expected it to work.

However, when I tested this, I found that it missed some hits. I later noted that there are at least two hyphens, represented as \u2212and \xad. So I changed my regex to \d\d[-\u2212\xad]\d\d, and it worked.

My question is, since I'm going to extract so many PDFs that I don’t know what other hyphen options are, is there a regex expression that encompasses all “hyphens” and hopefully looks better than the [-\u2212\xad]expression?

+4
source share
1 answer

The solution you request in the title of the question involves using a white list and means you need to find characters that you think are like hyphens.

You can refer to Punctuation Mark, Dash Category , which lists all available Unicode decodings in the Unicode category .

PyPi regex \p{Pd} Unicode.

, re,

[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]

Unicode, minus Unicode, . .

, . , \S. , (?:[^\w\s]|_).

+6

Source: https://habr.com/ru/post/1693972/


All Articles