How to remove almost duplicate integers from a list?

I am parsing some PDF files in Python. These PDF files are visually organized into rows and columns. The pdftohtml script converts these PDF files to XML format with full <text> tags that have no hierarchy. Then my code should sort the <text> tags back into strings.

Since each <text> has attributes such as "top" or "left" coordinates, I wrote code to add <text> elements with the same "top" coordinate to the list. This list is actually a single line.

My code iterates through the page first, finds all the unique "top" values ​​and adds them to the list of vertices. He then repeats this list of vertices. For each unique upper value, it searches for all elements that have this "upper" value and adds them to the list of strings.

 for side in page: tops = list( set( [ d['top'] for d in side ] ) ) tops.sort() for top in tops: row = [] for blob in side: if int(blob['top']) == int(top): row.append(blob) rows.append(row) 

This code works great for most of the PDF files that I process. But there are cases when elements located on the same line have slightly different upper values ​​that fall by one or two.

I am trying to adapt my code to become a little more crazy.

The comparison at the bottom seems easy enough to fix. Something like that:

  for blob in side: rangeLower = int(top) - 2 rangeUpper = int(top) + 2 thisTop = int(blob['top']) if rangeLower <= thisTop <= rangeUpper : row.append(blob) 

But the list of unique top values ​​that I create first is a problem. The code I'm using is

  tops = list( set( [ d['top'] for d in side ] ) ) 

In these extreme cases, I get a list like:

 [925, 946, 966, 995, 996, 1015, 1035] 

How can I adapt this code to avoid the β€œ995” and β€œ996” in the list? I want to end up with only one value when the integers are within 1 or 2 of each other.

0
python sorting list
Apr 17 '14 at 18:17
source share
2 answers
  • Sort the list to put close values ​​next to each other.
  • Use reduce to filter the value based on the previous value.

the code:

 >>> tops = [925, 946, 966, 995, 996, 1015, 1035] >>> threshold = 2 >>> reduce(lambda x, y: x + [y] if len(x) == 0 or y > x[-1] + threshold else x, sorted(tops), []) [925, 946, 966, 995, 1015, 1035] 

With several adjacent values:

 >>> tops = range(10) >>> reduce(lambda x, y: x + [y] if len(x) == 0 or y > x[-1] + threshold else x, sorted(tops), []) [0, 3, 6, 9] 

Edit

Shortening can be a bit cumbersome to read, so here is a simpler approach:

 res = [] for item in sorted(tops): if len(res) == 0 or item > res[-1] + threshold: res.append(item) 
+3
Apr 17 '14 at 18:29
source share
Answer to

@ njzk2 also works, but this function actually shows what is happening and what is easier to understand:

 >>> def sort(list): ... list.sort() #sorts in ascending order ... x = range(0, len(list), 1) #gets range ... x.reverse() #reverses ... for k in x: ... if list[k]-1 == list[k-1]: #if the list value -1 is equal to the next, ... del(list[k-1]) #remove it ... return list #return ... >>> tops = [925, 946, 966, 995, 996, 1015, 1035] >>> sort(tops) [925, 946, 966, 996, 1015, 1035] >>> 
0
Apr 17 '14 at 18:40
source share



All Articles