Exact set for keywords in the text

The text contains keywords and the start / end position of their occurrences. Keywords may partially overlap, for example. "something" β†’ "something" / "some" / "thing":

keywords_occurences = { "key_1": [(11, 59)], "key_2": [(24, 46), (301, 323), (1208, 1230), (1673, 1695)], "key_3": [(24, 56), (1208, 1240)], ... } 

I need to choose one position for each keyword so that they do not overlap, to solve this case:

 key_1: 11-59 key_2: 301-323 (or 1673-1695, it does not matter) key_3: 1208-1240 

If not, select the maximum number of unique non-overlapping keys.

It looks like a "hit" type problem, but I cannot find a description of the algorithm.

+4
source share
1 answer

I think the following code does what you want.

 #!/usr/bin/env python # keyword occurrences -> [('key_1', (11, 59)), ('key_2', (301, 333)), ('key_3', ())] kw_all_occ = {"key_1" : [(11, 59)], "key_2" : [(24, 56), (301, 333), (1208, 1240), (1673, 1705)], "key_3" : [(24, 46), (1208, 1230)]} def non_overlapping_occ(occ): # dictionary with all keyword occurrences all_occ = dict({}) all_occ.update(occ) # list with the first non overlapping occurrences of every keyword -> # [('key_1', (start_1, end_1)),('key_2', (start_2, end_2)),...] result = [] # Sort keywords by length -> [(22, 'key_3'), (32, 'key_2'), (48, 'key_1')] kw_lengths = [] for k, v in all_occ.iteritems(): kw_lengths.append((v[0][1] - v[0][0], k)) kw_lengths.sort() while len(kw_lengths): # Current longest keyword longest_keyword = kw_lengths.pop(-1)[1] try: result.append((longest_keyword, all_occ[longest_keyword][0])) # Remove occurrences overlapping any occurrence of the current # longest_keyword value for item in all_occ[longest_keyword]: start = item[0] end = item[1] for l, k in kw_lengths: v = all_occ[k] all_occ[k] = filter(lambda x: (x[0] > end) | (x[1] < start), v) except IndexError: result.append((longest_keyword, ())) return result print non_overlapping_occ(kw_all_occ) 

He produces the following conclusion:

 vicent@deckard :~$ python prova.py [('key_1', (11, 59)), ('key_2', (301, 333)), ('key_3', ())] 

Please note that I do not use sets in code. Your question simply suggests that kits can help solve the problem, so I understand that using kits is not mandatory for you.

Also note that the code has not been deeply tested, but it seems to work very well (it also correctly refers to the keywords in your question. In fact, these cases can be resolved using simpler but less general code).

+1
source

Source: https://habr.com/ru/post/1432829/


All Articles