Finding all Unicode hyphens in Python

Question

Finding all Unicode hyphens in Python

I am trying to extract specific text from a PDF converted to text files. The PDF was obtained from various sources, and I do not know how they were created.

The pattern I was trying to extract was just two digits, then a hyphen, and then two more digits, for example. 12-34. So I wrote a simple regex \d\d-\d\dand expected it to work.

However, when I tested this, I found that it missed some hits. I later noted that there are at least two hyphens, represented as \u2212and \xad. So I changed my regex to \d\d[-\u2212\xad]\d\d, and it worked.

My question is, since I'm going to extract so many PDFs that I don’t know what other hyphen options are, is there a regex expression that encompasses all “hyphens” and hopefully looks better than the [-\u2212\xad]expression?

+4

python regex

Kenneth l Feb 22 '18 at 9:19

source share

1 answer

Wiktor Stribiżew · Accepted Answer · 2018-02-22T09:29:56+0000

The solution you request in the title of the question involves using a white list and means you need to find characters that you think are like hyphens.

You can refer to Punctuation Mark, Dash Category , which lists all available Unicode decodings in the Unicode category .

PyPi regex \p{Pd} Unicode.

, re,

[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]

Unicode, minus Unicode, . .

, . , \S. , (?:[^\w\s]|_).

Finding all Unicode hyphens in Python

More articles: