How to speed up 4 million established intersections?

I am an inexperienced programmer working on several bioinformatics exercises in Python.

One problem area counts elements at a given intersection between name groups and repositories that are counted in the dictionary. There are two lists of 2,000 nouns; names in group names are Latin species names. For instance:

list__of_name_groups_1 = [
    ['Canis Lupus', 'Canis Latrans'],
    ['Euarctos Americanus', 'Lynx Rufus'],
    ...
]
list__of_name_groups_2 = [
    ['Nasua Narica', 'Odocoileus Hemionus'],
    ['Felis Concolor', 'Peromyscus Eremicus'],
    ['Canis Latrans', 'Cervus Canadensis']
    ...
]

And I need a dictionary that contains all sizes of intersections between groups of names, for example.

>>> intersections
{ (0, 0): 0, (0, 1): 0, (0, 2): 1, (1, 0): 0, (1, 1): 0, (2, 1): 0,
  (2, 0): 1, (2, 1): 0, (2, 2): 0 }

( 'Canis Latrans'occurs in an element 0in the first list, an element 2in the second list.)

I have an implementation of an algorithm that works, but it works too slowly.

overlap = {}
    for i in list_of_lists_of_names_1:            
        for j in list_of_lists_of_names_2:
            overlap[(i,j)] = len(set(i) & set(j))

Is there a faster way to count the number of elements at given intersections?

( ... , , , . , , , . , , , , .)

+4
4

-, Python set ( ), set. . list 2000 [ list ?], 4000 set, 2000 x 2000 x 2 = 8 set s.

, 4000 :

list_of_name_tuples_1 = [("a", "aa"), ("b", "bbb"), ("c", "cc", "ccc")]
list_of_name_tuples_2 = [("a", "AA"), ("b", "BBB"), ("c", "cc", "CCC")]
name_sets_1 = [set(i) for i in list_of_name_tuples_1]
name_sets_2 = [set(i) for i in list_of_name_tuples_2]

overlap = {}
for l1, s1 in zip(list_of_name_tuples_1, name_sets_1):
    for l2, s2 in zip(list_of_name_tuples_2, name_sets_2):
        overlap[(l1, l2)] = len(s1 & s2)

Python list , dict, -- --.

( , Python 3, zip() . Python 2, itertools.izip(), .)

-, overlap dict dict, dict, .

list_of_name_tuples_1 = [("a", "aa"), ("b", "bbb"), ("c", "cc", "ccc")]
list_of_name_tuples_2 = [("a", "AA"), ("b", "BBB"), ("c", "cc", "CCC")]
name_sets_1 = [set(i) for i in list_of_name_tuples_1]
name_sets_2 = [set(i) for i in list_of_name_tuples_2]

overlap = {}
for l1, s1 in zip(list_of_name_tuples_1, name_sets_1):
    d = overlap.setdefault(l1, {})
    for l2, s2 in zip(list_of_name_tuples_2, name_sets_2):
        d[l2] = len(s1 & s2)

, overlap[l1][l2] overlap[(l1, l2)] ( ), d = overlap[l1] d[l2] .

+1

, .

, , overlap.

0

. , , (0 < 3) + (1 << 2 ) + (1 << 1) = 6.

, .

0

, , , -. :

  • ,
  • 0.

- : 2000 x 2 000 == 4,000,000. , Python , . , 1000 , ​​ , .

My calculation based on the reverse envelope is that you can improve performance by 4 times or better if there are relatively few common names. The improvement will be less, the less common names.

I used here are a few things that may be new to you: and lists enumerate(), defaultdictmembership in the list using inand itertools methods. It lures you to the deep end. Happy research, and let me know if you want some explanation.

from collections import defaultdict
import itertools

list_of_name_groups_1 = [
    ['Canis Lupus', 'Canis Latrans'],
    ['Euarctos Americanus', 'Lynx Rufus'],
]
list_of_name_groups_2 = [
    ['Nasua Narica', 'Odocoileus Hemionus'],
    ['Felis Concolor', 'Peromyscus Eremicus'],
    ['Canis Latrans', 'Cervus Canadensis']
]

def flatten(list_of_lists):
    return itertools.chain.from_iterable(list_of_lists)


def unique_names(list_of_name_groups):
    return set(flatten(list_of_name_groups))


def get_matching_name_groups(name, list_of_name_groups):
    return (list_index for list_index, name_group
            in enumerate(list_of_name_groups)
            if name in name_group)

list1_candidates = set()
list2_candidates = set()
common_names = unique_names(list_of_name_groups_1) & unique_names(list_of_name_groups_2)
for name in common_names:
    list1_candidates.update(tuple(get_matching_name_groups(name, list_of_name_groups_1)))
    list2_candidates.update(tuple(get_matching_name_groups(name, list_of_name_groups_2)))
intersections = defaultdict(lambda: 0)
for i, j in itertools.product(list1_candidates, list2_candidates):
        intersections[(i, j)] = len(set(list_of_name_groups_1[i]) & set(list_of_name_groups_2[j]))
print(intersections)

>>> python intersections.py
defaultdict(<function <lambda> at 0x0000000000DC7620>, {(0, 2): 1})
0
source

Source: https://habr.com/ru/post/1648987/


All Articles