How to combine 2 lists uniquely

I work with extremely long lists and try to come up with an iterative solution to combine the two lists in a unique way.

For example, I have lists

a = [TF1,Tar1] b = [Tar1, TF1] 

I need the following iterator (if possible) containing tuples:

 (TF1,Tar1) (TF1,TF1) (Tar1,Tar1) 

This excludes (Tar1, TF1) because the opposite ordering has already been added.

My current approach is to loop through each list and use a dictionary to keep track of what has been added. This takes up a huge amount of RAM, since list a is 12,000 and list b is 15,000. Creating the resulting dictionary contains about * b / 2 entries, which in this case are 90M.

Any suggestions are welcome. Thanks

+5
source share
4 answers

Basically the problem arises with common elements between two lists. If you can separate cases of combining common and unique elements, you would solve your problem

i.e. you need to create the following cartesian products

 a_unique X b_unique a_unique X b_common a_common X b_unique a_common X b_common 

Of the four cases, the latter will be a problem, as this will create non-unique pairs. On a second thought, the last Cartesian with single pairs is a simple choice of 2 elements from a_common.

Finally, the separation of elements can be accomplished by creating a set and both lists, and then iterating through the comparison

 >>> #Sample Lists >>> a = ['C0','C1','C2','A0','A1','A2'] >>> b = ['C0','C1','C2','B0','B1','B2'] >>> from itertools import product, combinations, chain >>> # Create sets for O(1) lookup >>> a_key = set(a) >>> b_key = set(b) >>> # Segerate elements to unique and common for both lists >>> a = {'common':a_key & b_key, 'unique':a_key - common} >>> b = {'common':a_key & b_key, 'unique':b_key - common} >>> # Create cartesian products forall the cases >>> list(chain.from_iterable([product(a['unique'], b['unique']), product(a['unique'], b['common']), product(a['common'], b['unique']), combinations(a['common'], 2)])) [('A0', 'B0'), ('A0', 'B1'), ('A0', 'B2'), ('A1', 'B0'), ('A1', 'B1'), ('A1', 'B2'), ('A2', 'B0'), ('A2', 'B1'), ('A2', 'B2'), ('A0', 'C0'), ('A0', 'C1'), ('A0', 'C2'), ('A1', 'C0'), ('A1', 'C1'), ('A1', 'C2'), ('A2', 'C0'), ('A2', 'C1'), ('A2', 'C2'), ('C0', 'B0'), ('C0', 'B1'), ('C0', 'B2'), ('C1', 'B0'), ('C1', 'B1'), ('C1', 'B2'), ('C2', 'B0'), ('C2', 'B1'), ('C2', 'B2'), ('C0', 'C1'), ('C0', 'C2'), ('C1', 'C2')] 
+2
source

To generate pairs iteratively, you need to look at the itertools.product function:

 >>> l1 = [1, 2, 3] >>> l2 = [1, 3, 7] >>> import itertools >>> list(itertools.product(l1, l2)) [(1, 1), (1, 3), (1, 7), (2, 1), (2, 3), (2, 7), (3, 1), (3, 3), (3, 7)] 

However, I do not consider it possible to remove duplicate pairs without tracking those that you have already seen.

To remove duplicates in memory, I would sort the tuples and make them a set:

 >>> pairs = list(itertools.product(l1, l2)) >>> set(map(tuple, map(sorted, pairs))) set([(1, 2), (2, 7), (1, 3), (3, 3), (2, 3), (1, 7), (3, 7), (1, 1)]) 

If you want the memory to be low and you can use the disk, I would suggest using the merge sort supported by the disk files similar to this approach . When repeating the result of itertools.product pair and write it to disk. Then use merge sort and read the sorted list, removing duplicates (since they will be contiguous).

+1
source

I think you can avoid duplicates without saving all the values ​​that you have created so far. Instead, you want to check which values ​​you generate will later be generated in the reverse order and only those elements will be tracked. If you do not have a large number of collisions, this will require significantly less memory (although in the worst case, it is still O(M*N) ).

Here's how I would do it:

 import itertools def product_without_reversed_duplicates(a, b): a_set = set(a) b_set = set(b) dupes = set() for x, y in itertools.product(a, b): if (x, y) not in dupes: # take (x, y) only if it is not a dupe of a previous item yield x, y if x in b_set and y in a_set: # test if (y, x) will be generated later dupes.add((y, x)) # if so, add it to the set to be skipped 

Note that this assumes that a and b do not have internal duplicates and that you want to keep the order of the product as much as possible (only skipping inverse pairs). If duplicates within a or b possible, you need to iterate over itertools.product(a_set, b_set) instead of what I have above. This, however, will give you results in random order. You can work around this with extra steps to deduplicate a and b , keeping them in order, but I will leave it to you to figure out the code for this if you need it.

+1
source

Rather difficult, but here's a way to do it, with O(n) extra memory.

 xs = ['a', 'b', 'd'] ys = ['b', 'a', 'c'] def unique(seq): seen = set() seen_add = seen.add return [ x for x in seq if not (x in seen or seen_add(x))] xs = unique(xs) ys = unique(ys) x_added = set() for x in xs: for y in ys: if y in x_added and x in set(ys): continue print(x, y) x_added.add(x) 

Conclusion:

 ab aa ac bb bc db da dc 

Basically, we know that a pair is already given if y already in one of x obtained so far, and x is one of ys , since we have already repeated all y for the previous x s. The only requirement just makes handling special cases easier.

+1
source

Source: https://habr.com/ru/post/1205482/


All Articles