I am trying to extract specific text from a PDF converted to text files. The PDF was obtained from various sources, and I do not know how they were created.
The pattern I was trying to extract was just two digits, then a hyphen, and then two more digits, for example. 12-34. So I wrote a simple regex \d\d-\d\dand expected it to work.
However, when I tested this, I found that it missed some hits. I later noted that there are at least two hyphens, represented as \u2212and \xad. So I changed my regex to \d\d[-\u2212\xad]\d\d, and it worked.
My question is, since I'm going to extract so many PDFs that I don’t know what other hyphen options are, is there a regex expression that encompasses all “hyphens” and hopefully looks better than the [-\u2212\xad]expression?
source
share