Is it possible to use regular expressions with pdfquery?

Question

Is it possible to use regular expressions with pdfquery?

Is it possible to use regular expression to detect text in pdf (using pdfquery or another tool)?

I know that we can do this:

pdf = pdfquery.PDFQuery("tests/samples/IRS_1040A.pdf") pdf.load() label = pdf.pq('LTTextLineHorizontal:contains("Cash")') left_corner = float(label.attr('x0')) bottom_corner = float(label.attr('y0')) cash = pdf.pq('LTTextLineHorizontal:in_bbox("%s, %s, %s, %s")' % \ (left_corner, bottom_corner-30, \ left_corner+150, bottom_corner)).text() print cash '179,000.00'

But we need something like this:

 pdf = pdfquery.PDFQuery("tests/samples/IRS_1040A.pdf") pdf.load() label = pdf.pq('LTTextLineHorizontal:regex("\d{1,3}(?:,\d{3})*(?:\.\d{2})?")') cash = str(label.attr('x0')) print cash '179,000.00'

+5

python regex pdfminer

Dayvid oliveira Oct 13 '15 at 19:57

source share

1 answer

Dayvid oliveira · Answer 1 · 2015-10-13T21:19:33+0000

This is not exactly a regex search, but it works to format / filter possible selections:

 def regex_function(pattern, match): re_obj = re.search(pattern, match) if re_obj != None and len(re_obj.groups()) > 0: return re_obj.group(1) return None pdf = pdfquery.PDFQuery("tests/samples/IRS_1040A.pdf") pattern = '' pdf.extract( [ ('with_parent','LTPage[pageid=1]'), ('with_formatter', 'text'), ('year', 'LTTextLineHorizontal:contains("Form 1040A (")', lambda match: regex_function(SOME_PATTERN_HERE, match))) ])

I have not tested this following, but it may also work:

 def some_regex_function_feature(): # here you could use some regex. return float(this.get('width',0)) * float(this.get('height',0)) > 40000 pdf.pq('LTPage[page_index="1"] *').filter(regex_function_filter_here) [<LTTextBoxHorizontal>, <LTRect>, <LTRect>]

Is it possible to use regular expressions with pdfquery?

More articles: