Fast list membership

I have a very large list called "data" and I need to respond to queries equivalent to

if (x in data[a:b]): 

for different values โ€‹โ€‹of a, b and x.

Is data preprocessing possible to quickly execute these queries

+4
source share
3 answers

Idea

you can create a dict . For each element, a sorted list of positions where it occurs is stored.

To answer the query: the first element of a binary search that is greater than or equal to a , check if it exists and is less than b

pseudo code

Preprocessing:

 from collections import defaultdict byvalue = defaultdict(list) for i, x in enumerate(data): byvalue[x].append(i) 

Query:

 def has_index_in_slice(indices, a, b): r = bisect.bisect_left(indices, a) return r < len(indices) and indices[r] < b def check(byvalue, x, a, b): indices = byvalue.get(x, None) if not indices: return False return has_index_in_slice(indices, a, b) 

The complexity of O(log N) per query is here if we assume that list and dict have the difficulty of "getting by index" O (1).

+4
source

Yes, you can pre-process these fragments in sets, thereby doing a membership search O(1) instead of O(n) :

 check = set(data[a:b]) if x in check: # do something if y in check: # do something else 
+1
source

Put the list in the database and use the built-in indexing, optimization and caching. For example, from the PostgreSQL manual:

After creating the index, no further intervention is required: the system will update the index when the table is changed, and this will use the index in queries when it thinks it will be more efficient than scanning a sequential table.

But you can also use sqlite for simplicity (and availability in the Python standard library). From the Python documentation regarding indexing :

The Row instance serves as a highly optimized row_factory for connecting objects. It tries to simulate a tuple in most of its functions.

It supports matching access by column name and index, iteration, presentation, equality testing and len ().

And elsewhere on this page:

The row provides both index-based and case-insensitive name-based access to columns with almost free memory overhead. It will probably be better than your own dictionary-based approach or even a db_row-based solution.

0
source

Source: https://habr.com/ru/post/1495144/


All Articles