Finding an item in a list with 2MILLION items - Python

I have a list of strings of 1.9-2 MILLION elements .

The following code:

items = [...]
item_in_list = items[-1] in items

takes 0.1 seconds

With sqlite3 it takes 0.7 seconds


Now the problem is that I need to perform this check 1 million times , and I would like to complete it in minutes, not days.

More precisely, I am trying to synchronize the contents of a CSV file with the calculated values ​​in the database.


Any ideas? It would be great:)

+3
source share
4 answers

.

:

import random
from timeit import Timer

def random_strings(size):
    alpha = 'abcdefghijklmnopqrstuvwxyz'
    min = 3
    max = 8
    strings = []
    for count in xrange(1, size):
        current = ''
        for x in random.sample(alpha, random.randint(min,max)):
            current += x  
        strings.append(current)
    return strings

string_list_1 = random_strings(10000)
string_list_2 = random_strings(10000)

def string_test():
    common = filter(lambda x: x in string_list_2, string_list_1)
    return common

def set_test():
    string_set_1 = frozenset(string_list_1)
    string_set_2 = frozenset(string_list_2)
    common = string_set_1 & string_set_2
    return common

string_timer = Timer("__main__.string_test()", "import __main__")
set_timer = Timer("__main__.set_test()", "import __main__")
print string_timer.timeit(10)
# 22.6108954005
print set_timer.timeit(10)
#  0.0226439453

, . , .

, , . , , , , .

+4

, , . SORTED. , .

+1

, .

, :

  • 2 .
  • , .
  • , .

Update:

As mentioned in the comments, sets and dicts do not use binary trees, they use hash tables. This should be faster than a list, and in fact, probably even faster than a binary search.

0
source

from my head, with so little information about why you do this several million times:

1.) can you import csv into a table and perform validations in sql?

2.) what about sorting and indexing the list for quick access?

amuses, P

0
source

Source: https://habr.com/ru/post/1780496/


All Articles