How to remove almost duplicate integers from a list?

Question

How to remove almost duplicate integers from a list?

I am parsing some PDF files in Python. These PDF files are visually organized into rows and columns. The pdftohtml script converts these PDF files to XML format with full <text> tags that have no hierarchy. Then my code should sort the <text> tags back into strings.

Since each <text> has attributes such as "top" or "left" coordinates, I wrote code to add <text> elements with the same "top" coordinate to the list. This list is actually a single line.

My code iterates through the page first, finds all the unique "top" values and adds them to the list of vertices. He then repeats this list of vertices. For each unique upper value, it searches for all elements that have this "upper" value and adds them to the list of strings.

 for side in page: tops = list( set( [ d['top'] for d in side ] ) ) tops.sort() for top in tops: row = [] for blob in side: if int(blob['top']) == int(top): row.append(blob) rows.append(row)

This code works great for most of the PDF files that I process. But there are cases when elements located on the same line have slightly different upper values that fall by one or two.

I am trying to adapt my code to become a little more crazy.

The comparison at the bottom seems easy enough to fix. Something like that:

  for blob in side: rangeLower = int(top) - 2 rangeUpper = int(top) + 2 thisTop = int(blob['top']) if rangeLower <= thisTop <= rangeUpper : row.append(blob)

But the list of unique top values that I create first is a problem. The code I'm using is

  tops = list( set( [ d['top'] for d in side ] ) )

In these extreme cases, I get a list like:

 [925, 946, 966, 995, 996, 1015, 1035]

How can I adapt this code to avoid the “995” and “996” in the list? I want to end up with only one value when the integers are within 1 or 2 of each other.

0

python sorting list

Kirkman14 Apr 17 '14 at 18:17

source share

2 answers

Answer to

@ njzk2 also works, but this function actually shows what is happening and what is easier to understand:

 >>> def sort(list): ... list.sort() #sorts in ascending order ... x = range(0, len(list), 1) #gets range ... x.reverse() #reverses ... for k in x: ... if list[k]-1 == list[k-1]: #if the list value -1 is equal to the next, ... del(list[k-1]) #remove it ... return list #return ... >>> tops = [925, 946, 966, 995, 996, 1015, 1035] >>> sort(tops) [925, 946, 966, 996, 1015, 1035] >>>

0

AJ Uppal Apr 17 '14 at 18:40

source share

njzk2 · Accepted Answer · 2014-04-17 18:29

Sort the list to put close values next to each other.
Use reduce to filter the value based on the previous value.

the code:

 >>> tops = [925, 946, 966, 995, 996, 1015, 1035] >>> threshold = 2 >>> reduce(lambda x, y: x + [y] if len(x) == 0 or y > x[-1] + threshold else x, sorted(tops), []) [925, 946, 966, 995, 1015, 1035]

With several adjacent values:

 >>> tops = range(10) >>> reduce(lambda x, y: x + [y] if len(x) == 0 or y > x[-1] + threshold else x, sorted(tops), []) [0, 3, 6, 9]

Edit

Shortening can be a bit cumbersome to read, so here is a simpler approach:

 res = [] for item in sorted(tops): if len(res) == 0 or item > res[-1] + threshold: res.append(item)

How to remove almost duplicate integers from a list?

Edit

More articles: