Fast list membership

Question

Fast list membership

I have a very large list called "data" and I need to respond to queries equivalent to

if (x in data[a:b]):

for different values of a, b and x.

Is data preprocessing possible to quickly execute these queries

+4

performance python

phoenix Aug 3 '13 at 21:44

source share

3 answers

Yes, you can pre-process these fragments in sets, thereby doing a membership search O(1) instead of O(n) :

 check = set(data[a:b]) if x in check: # do something if y in check: # do something else

+1

Tim pietzcker Aug 3 '13 at 21:48

source share

Put the list in the database and use the built-in indexing, optimization and caching. For example, from the PostgreSQL manual:

After creating the index, no further intervention is required: the system will update the index when the table is changed, and this will use the index in queries when it thinks it will be more efficient than scanning a sequential table.

But you can also use sqlite for simplicity (and availability in the Python standard library). From the Python documentation regarding indexing :

The Row instance serves as a highly optimized row_factory for connecting objects. It tries to simulate a tuple in most of its functions.
It supports matching access by column name and index, iteration, presentation, equality testing and len ().

And elsewhere on this page:

The row provides both index-based and case-insensitive name-based access to columns with almost free memory overhead. It will probably be better than your own dictionary-based approach or even a db_row-based solution.

0

Paulo almeida Aug 3 '13 at 23:17

source share

Riad · Accepted Answer · 2013-08-03T21:47:59+0000

Idea

you can create a dict . For each element, a sorted list of positions where it occurs is stored.

To answer the query: the first element of a binary search that is greater than or equal to a , check if it exists and is less than b

pseudo code

Preprocessing:

 from collections import defaultdict byvalue = defaultdict(list) for i, x in enumerate(data): byvalue[x].append(i)

Query:

 def has_index_in_slice(indices, a, b): r = bisect.bisect_left(indices, a) return r < len(indices) and indices[r] < b def check(byvalue, x, a, b): indices = byvalue.get(x, None) if not indices: return False return has_index_in_slice(indices, a, b)

The complexity of O(log N) per query is here if we assume that list and dict have the difficulty of "getting by index" O (1).

Fast list membership

Idea

pseudo code

More articles: