I am parsing some PDF files in Python. These PDF files are visually organized into rows and columns. The pdftohtml script converts these PDF files to XML format with full <text> tags that have no hierarchy. Then my code should sort the <text> tags back into strings.
Since each <text> has attributes such as "top" or "left" coordinates, I wrote code to add <text> elements with the same "top" coordinate to the list. This list is actually a single line.
My code iterates through the page first, finds all the unique "top" values ββand adds them to the list of vertices. He then repeats this list of vertices. For each unique upper value, it searches for all elements that have this "upper" value and adds them to the list of strings.
for side in page: tops = list( set( [ d['top'] for d in side ] ) ) tops.sort() for top in tops: row = [] for blob in side: if int(blob['top']) == int(top): row.append(blob) rows.append(row)
This code works great for most of the PDF files that I process. But there are cases when elements located on the same line have slightly different upper values ββthat fall by one or two.
I am trying to adapt my code to become a little more crazy.
The comparison at the bottom seems easy enough to fix. Something like that:
for blob in side: rangeLower = int(top) - 2 rangeUpper = int(top) + 2 thisTop = int(blob['top']) if rangeLower <= thisTop <= rangeUpper : row.append(blob)
But the list of unique top values ββthat I create first is a problem. The code I'm using is
tops = list( set( [ d['top'] for d in side ] ) )
In these extreme cases, I get a list like:
[925, 946, 966, 995, 996, 1015, 1035]
How can I adapt this code to avoid the β995β and β996β in the list? I want to end up with only one value when the integers are within 1 or 2 of each other.
python sorting list
Kirkman14 Apr 17 '14 at 18:17 2014-04-17 18:17
source share