Quick line in list search

Question

Quick line in list search

Using Python 3, I have a list containing more than 100,000 lines (list1), each of which is no more than 300 characters. I also have a list of over 9 million substrings (list2). I want to count how many substring elements in list2 appear. For instance,

list1 = ['cat', 'caa', 'doa', 'oat'] list2 = ['at', 'ca', 'do']

I want the function to return (displayed in list2):

 [2, 2, 1]

This is usually very simple and requires very little. However, due to the large size of the lists, I have performance problems. I want to find the fastest way to return this list of counters.

I have tried lists, generators, maps, loops of all kinds, and I have yet to find a quick way to complete this easy task. What is theoretically the fastest way to achieve this, is it advisable to follow the steps O(len(list2)) very quickly?

+6

performance python string list

user1104160 Dec 18 '11 at 4:21

source share

3 answers

puk · Answer 1 · 2011-12-18T04:39:42+0000

set M = len(list1) and N = len(list2)

For each of the N entries in list2 you will need to perform M comparisons with the items in list1 . This is the worst running time of O(M x N) . If you take it further, let each entry in list2 have a length of 1, and each entry in list1 will have a length of 300, then you will get an O(300M x N) runtime.

If performance is really a problem, try dynamic programming. Here is the start:

1) sort list2 in increasing order of length:

 ['scorch', 'scorching', 'dump', 'dumpster', 'dumpsters']

2) sort them in the sublists so that each previous record is a subset of the existing record as follows:

 [['scorch', 'scorching'] , ['dump', 'dumpster', 'dumpsters']]

3) Now, if you compare with list1 and 'scorch' , there is none, then you do not need to search for 'scorching' . Similarly, if 'dump' is not there, neither 'dumpster' or 'dumpsters'

Note that the worst execution time of the program remains the same.

tobigue · Answer 2 · 2011-12-18T22:40:11+0000

I believe that this problem can be solved in linear time using the Aho Corasick string . See this answer for more information (maybe you get ideas from other answers to this question too - this is almost the same task, and I think that Aho Korasik is theoretically the fastest way to solve this problem).

You will need to change the machine to match strings accordingly, so that instead of returning a match, it will increase the counter of each substring by one. (This should be only a minor modification).

hughdbrown · Answer 3 · 2011-12-18T15:57:03+0000

Not sure how to avoid using any O (n ** 2) algorithm. Here is a simple implementation.

 >>> def some_sort_of_count(list1, list2): >>> return [sum(x in y for y in list1) for x in list2] >>> >>> list1 = ['cat', 'caa', 'doa', 'oat'] >>> list2 = ['at', 'ca', 'do'] >>> some_sort_of_count(list1, list2) [2, 2, 1]

Quick line in list search

More articles: