Delete the dictionary key if it is a substring in any other key

I am learning Python. I have a performance issue. For one dictionary, I want to delete the keys if

  • Key a is a substring in another key

I do not want to delete keys if

  • The main substring in itself

My keys are unique strings from 3 to 50 characters long. There are 100,000 or more items in the dictionary I work in, which makes billions of comparisons. Since this is an O (n ^ 2) problem, should I stop trying to optimize this code? Or is there a place to advance here?

A dictionary is preferred, but I am open to other types.

For example: 'hello' contains 'he' and 'ell'. I want to remove the keys "he" and "ell", keeping "hello". I would like to remove the prefixes, suffixes and key substrings in the middle of the other keys.

Keys are generated one by one and added to the dictionary. Then execute reduce_dict(dictionary) . My guess: the test, while they are added to the dictionary, will be as slow as after testing the function, as in the code below.

 def reduce_dict(dictionary): reduced = dictionary.copy() for key in dictionary: for key2 in dictionary: if key != key2: if key2 in key: reduced.pop(key2, 0) return reduced 
+5
source share
5 answers

I think you can create a list of "good" keys (= those that are not substrings of others) in a slightly optimized way:

 # keys = yourDict.keys(), eg keys = ['low', 'el', 'helloworld', 'something', 'ellow', 'thing', 'blah', 'thingy'] # flt is [[key, is_substring],...] sorted by key length reversed flt = [[x, 0] for x in sorted(keys, key=len, reverse=True)] for i in range(len(flt)): p = flt[i] if p[1]: # already removed continue for j in range(i + 1, len(flt)): # iterate over shorter strings q = flt[j] if not q[1] and q[0] in p[0]: # if not already removed and is substring q[1] = 1 # remove goodkeys = set(x[0] for x in flt if not x[1]) print goodkeys # eg ['helloworld', 'something', 'thingy', 'blah'] 

Now removal is trivial:

 newdict = {k:olddict[k] for k in goodkeys} 
+2
source

Given that your lines are somewhat small, you can keep the hash of all possible substrings for each key. This will allow you to find for a given substring all keys that have corresponding substrings in O (N) time, however, the trade-off is that you increase the complexity of your investments, as you will create a set of substrings for each new key.

+2
source

If instead of key2 in key (that is, " key2 is a substring of key "), you change your requirement that " key2 is the key prefix" (as your examples demonstrate), you can use trie to effectively check the prefix. See this answer .

First define make_trie , as in the answer above:

 _end = '_end_' def make_trie(*words): root = dict() for word in words: current_dict = root for letter in word: current_dict = current_dict.setdefault(letter, {}) current_dict = current_dict.setdefault(_end, _end) return root 

Then define a function similar to in_trie from the above answer, but checking if the key is a strict prefix of the other key:

 def is_strict_prefix_of_word_in_trie(trie, word): current_dict = trie for letter in word: if letter in current_dict: current_dict = current_dict[letter] else: return False else: if _end in current_dict: return False # it actually in the trie else: return True # it a strict prefix of a word in the trie 

Finally, complete your deletions like this:

 def reduce_dict(dictionary): trie = make_trie(dictionary.keys()) reduced = dictionary.copy() for key in dictionary: if is_strict_prefix_of_word_in_trie(trie, key): reduced.pop(key, 0) return reduced 

Or you can use dictionary comprehension:

 def reduce_dict(dictionary): trie = make_trie(dictionary.keys()) return {key: value for (key, value) in dictionary \ if not is_strict_prefix_of_word_in_trie(trie, key)} 
+1
source

If dictionnary is static, IMHO it is useless to optimize the operation: it will run only once and in less time than you need to carefully optimize and test the optimization.

If the dictionnary is dynamic, you can try setting the timestamp to a value if it makes sense to keep a list of keys that have already been cleared. Therefore, when you start the cleaning process again, you have 2 sets of keys: one that has been processed alone (size n1), and a new size of keys (n2). You are only comparing:

  • the new key may be a substring of the old key
  • the old key can be a substring of the new key
  • a new key can be a substring of a new key

So you have n2 * (n2 + 2 * n1) comparisons. If n → n2, then O (n * n2 * 2).

Alternatively, if adding an item to the dictionary is not performed during a limited operation (and not in interactive mode), you can test each time you add to O (2n) without adding anything else (neither hold keys, nor timestamp).

In fact, if you clear your dictionnary once with the trivial O (n 2 ) algorithm, and then manage the keys whenever a new element is generated, you can safely assume that none of the existing keys can be a substring of the other. You just need to check:

  • is the new substring key of existing key-n operations in the worst case (but probably the most common)
  • is the existing substring key of new key-n operations in all cases.

The only requirement is that you should never try to add a key before the complete cleanup of the previous one ends. This may be obvious if there is only one thread associated with dictionnary in one process, if you do not need synchronization.

+1
source

Since the keys string is a string, you can use the find method to get substring and delete their keys.

If d is a dictionary,

 d = {'hello': 1, 'he': 2, 'llo': 3, 'world': 4, 'wor': 5, 'ld': 6, 'python': 2.7} for key in d.keys(): for sub in d.keys(): if key.find(sub) >= 0: if key == sub: continue else: del(d[sub]) 

d will be then

 {'python': 2.7, 'world': 4, 'hello': 1} 
+1
source

Source: https://habr.com/ru/post/1200812/


All Articles