Can understanding python be used to create a dictionary of substrings and their locations?

Given a string of characters, I want to create a dictionary of all n-character substrings contained in a string where the dictionary key is a substring and the value is a list. The first element of the list is the number of occurrences of the substring, and the second element of the list is the list of starting places for these occurrences.

For example, with n=3 string 'abcdabcxdabc' leads to this dictionary:

 {'abc': [3, [0, 4, 9]], 'cda': [1, [2]], 'dab': [2, [3, 8]], 'bcd': [1, [1]], 'cxd': [1, [6]], 'bcx': [1, [5]], 'xda': [1, [7]]} 

The code below works and is efficient since it only goes through the line once, but I wonder if there is a more elegant and / or more pythonic way to do this, perhaps using a dictionary understanding. I am new to python and still trying to figure out when it makes sense (or even possible) to use methods, etc.

 text = 'abcdabcxdabc' n = 3 d = {} for i in range(len(text) - n + 1): sub = text[i:i + n] if sub in d: d[sub][0] += 1 d[sub][1].append(i) else: d[sub] = [1, [i]] print(d) 

Update: Thanks for all the answers. As a rule, they confirm my suspicion that it is too complicated to be effectively realized in one understanding (but thanks to the volcano, to show that this is possible if efficiency does not cause concern). Thanks to RemcoGerlich and Ignacio Vazquez-Abrams for pointing me to defaultdict. I will have to delve into this more. The results of my time show that initializing defaultdict had a bit more overhead than dict, but the runtime might be a little faster, at least for this example. (The synchronization results are published in a separate comment below.) Now I wonder if there are situations in which dict would prefer defaultdict. Also, thanks to Narcolei for pointing me to the timeit functions.

+6
source share
5 answers

The problem is that v[0] depends on the length or v[1] , which means that either the generation operation v[1] had to work twice, or that the dictionary had to be repeated to fill in v[0] so that Replace the dummy value included for the first time.

Another problem is that understanding dict assumes that the entire key and value will be available immediately, which means that you will need to run list comprehension to get all the indices of the character, which means that the whole operation becomes O (n 2 )

The only optimization I would do would be to replace creating d so you don't have to check for the key.

 d = collections.defaultdict(lambda: [0, []]) 
+4
source

This is scary, but (I added only offsets, the number of occurrences you can get from the list of offsets). Yes it can be done

 In [83]: my_str = 'abcdabcxdabc' In [84]: n=3 In [85]: {substr: [my_str.replace(substr, ' '*n, c).index(substr) for c in xrange(my_str.count(substr))] ....: for substr in set(my_str[idx:idx+n] for idx in xrange(len(my_str)-n))} Out[85]: {'abc': [0, 4, 9], 'bcd': [1], 'bcx': [5], 'cda': [2], 'cxd': [6], 'dab': [3, 8], 'xda': [7]} 
+2
source

As @Ignacio said, any understanding that tries to solve this problem will have quadratic performance at runtime, as seen from @volcano's answer. The cleanest way to solve the problem is this:

 def substrings(text, n): d = collections.defaultdict(list) for i in xrange(len(text)-n+1): d[text[i:i+n]].append(i) return d 

Note that you do not need to store the number of substrings, since you can just do len(d['abc']) to get the number of occurrences of abc .

Here are some timings of this approach compared to comprehesion:

 >>> import collections >>> >>> def substrings(text, n): >>> d = collections.defaultdict(list) >>> for i in xrange(len(text)-n+1): >>> d[text[i:i+n]].append(i) >>> return d >>> >>> def substrings2(text, n): >>> return {substr: [my_str.replace(substr, ' '*n, c).index(substr) for c in xrange(my_str.count(substr))] for substr in set(my_str[idx:idx+n] for idx in xrange(len(my_str)-n))} >>> >>> text = 'abcdabcxdabc' >>> >>> %timeit substrings(text, 3) 100000 loops, best of 3: 9.51 ยตs per loop >>> %timeit substrings2(text, 3) 10000 loops, best of 3: 26.3 ยตs per loop >>> text = text * 100 >>> %timeit substrings(text, 3) 1000 loops, best of 3: 440 ยตs per loop >>> %timeit substrings2(text, 3) 100 loops, best of 3: 8.68 ms per loop 

Notice how the comprehension time is increased by 1000x to increase the input size by 100 times.

+1
source

I implemented a couple of options using defaultdict and parts of understanding volcanoes, and conducted some time tests.

Original version (test1):

 d1 = {} for i in range(len(text) - n + 1): sub = text[i:i + n] if sub in d1: d1[sub][0] += 1 d1[sub][1].append(i) else: d1[sub] = [1, [i]] 

First variation (test2):

 d = defaultdict(lambda: [0, []]) for i in range(len(text) - n + 1): sub = text[i:i + n] d[sub][0] += 1 d[sub][1].append(i) 

Second variation (test3):

 d = defaultdict(lambda: [0, []]) for i, sub in [(i, text[i:i + n]) for i in range (len(text) - n + 1)]: d[sub][0] += 1 d[sub][1].append(i) 

Third option (test4):

 d = {sub: [text.replace(sub, ' '*n, c).index(sub) for c in range(text.count(sub))] for sub in set(text[i:i + n] for i in range(len(text) - n + 1))} 

The following are the synchronization results (showing the execution time for each cycle):

 text = "abcdabcxdabc": 10000 loops, best of 3, function test1: 7.37486786334e-06 10000 loops, best of 3, function test2: 1.02725863892e-05 10000 loops, best of 3, function test3: 1.16522984082e-05 10000 loops, best of 3, function test4: 1.98546753287e-05 text = "abcdabcxdabc"*10: 10000 loops, best of 3, function test1: 7.16965834334e-05 10000 loops, best of 3, function test2: 6.8394193171e-05 10000 loops, best of 3, function test3: 7.63521044367e-05 10000 loops, best of 3, function test4: 0.00016625460554 text = "abcdabcxdabc"*100: 1000 loops, best of 3, function test1: 0.000708709217238 1000 loops, best of 3, function test2: 0.000623426932274 1000 loops, best of 3, function test3: 0.000695915822531 1000 loops, best of 3, function test4: 0.00419154787196 text = "abcdabcxdabc"*1000: 1000 loops, best of 3, function test1: 0.00700270379835 1000 loops, best of 3, function test2: 0.00615744327104 1000 loops, best of 3, function test3: 0.00712429980398 1000 loops, best of 3, function test4: 0.296075626815 

The initial and first two options seem to be O (n), and the third is closer to O (n * n). I think I will go with the second option, since it is the most compact of the versions of O (n).

0
source

For the record, another insert:

 >>> n, s = 3, 'abcdabcxdabc' >>> L=[(s[i:i+n], i) for i in range(len(s)-n+1)] >>> L [('abc', 0), ('bcd', 1), ('cda', 2), ('dab', 3), ('abc', 4), ('bcx', 5), ('cxd', 6), ('xda', 7), ('dab', 8), ('abc', 9)] >>> d={t:[i for u, i in L if u == t] for t, _ in L} >>> d {'abc': [0, 4, 9], 'bcd': [1], 'cda': [2], 'dab': [3, 8], 'bcx': [5], 'cxd': [6], 'xda': [7]} >>> {k:(len(v), v) for k, v in d.items()} {'abc': (3, [0, 4, 9]), 'bcd': (1, [1]), 'cda': (1, [2]), 'dab': (2, [3, 8]), 'bcx': (1, [5]), 'cxd': (1, [6]), 'xda': (1, [7])} 

In one line:

 >>> {k:(len(v), v) for L in ([(s[i:i+n], i) for i in range(len(s)-n+1)],) for k, v in ((t, [i for u, i in L if u == t]) for t, _ in L)} {'abc': (3, [0, 4, 9]), 'bcd': (1, [1]), 'cda': (1, [2]), 'dab': (2, [3, 8]), 'bcx': (1, [5]), 'cxd': (1, [6]), 'xda': (1, [7])} 

What would I do in the "real world":

 >>> def substrings(s, n): ... d = {} ... tis = ((s[i:i+n], i) for i in range(len(s)-n+1)) ... for t, i in tis: ... d.setdefault(t, []).append(i) ... return {k:(len(v), v) for k, v in d.items()} ... >>> substrings(s, n) {'abc': (3, [0, 4, 9]), 'bcd': (1, [1]), 'cda': (1, [2]), 'dab': (2, [3, 8]), 'bcx': (1, [5]), 'cxd': (1, [6]), 'xda': (1, [7])} 

The version of the "real world" differs from the one-line one-point version: the dict is built on O (n) against O (n ^ 2)

0
source

Source: https://habr.com/ru/post/959322/


All Articles