How can I find a set of keywords in a document, all / some of them at a certain distance?

I have a set of keywords, about 10. I want to search in a very long document and check if I can find a set of keywords there, but not only their presence or existence in the text, also if all / some of them or a subset of them located at a distance, for example, from 3 sentences or 30 words or any other proximity indicators. How can I do that? I just thought about writing python code that finds one of the keywords and then checks to see if any of the keywords has about three lines of text. But this will require a lot of processing power, and it is inefficient.

+4
source share
4 answers

They will offer to solve this problem, create a map (hash), enter each word as a key and add the location of the word as a value to the list, which is the value on the map.

For the text, Fast brown fox jumps over a lazy dog, this will lead to a model as shown below (in json format).

Note: Here all words are added to the index as if they were written in lower case.

{
    "document": [
        {
            "key": "the",
            "value": [
                {
                    "location": 1
                },
                {
                    "location": 7
                }
            ]
        },
        {
            "key": "quick",
            "value": [
                {
                    "location": 2
                }
            ]
        },
        {
            "key": "brown",
            "value": [
                {
                    "location": 3
                }
            ]
        },
        {
            "key": "fox",
            "value": [
                {
                    "location": 4
                }
            ]
        },
        {
            "key": "jumps",
            "value": [
                {
                    "location": 5
                }
            ]
        },
        {
            "key": "over",
            "value": [
                {
                    "location": 6
                }
            ]
        },
        {
            "key": "lazy",
            "value": [
                {
                    "location": 8
                }
            ]
        },
        {
            "key": "dog",
            "value": [
                {
                    "location": 9
                }
            ]
        }
    ] 
}

Once this index is made, it is easy to see how far different words are from each other. As can be seen in the word a, which is located in places 1 and 7.

Also, the number of times a word is displayed in the text can be easily obtained by the number of places that give the word.

: , , // ..

+1

, , , , . , , . , . O(len(document)) O(len(window)).

Python :

from collections import defaultdict
from sets import Set
def isInProximityWindow(doc, keywords, windowLen):
    words = doc.split()
    wordsLen = len(words)
    if (windowLen > wordsLen):
        windowLen = wordsLen

    keywordsLen = len(keywords)
    allKeywordLocs = defaultdict(Set)
    numKeywordsInWindow = 0
    locKeyword = {}
    for i in range(wordsLen):
        windowContents = sorted([k for k in allKeywordLocs.keys() if allKeywordLocs[k]])
        print "On beginning of iteration #%i, window contains '%s'" % (i, ','.join(windowContents))

        oldKeyword = locKeyword.pop(i-windowLen, None)
        if oldKeyword:
            keywordLocs = allKeywordLocs[oldKeyword]
            keywordLocs.remove(i-windowLen)
            if not keywordLocs:
                print "'%s' fell out of window" % oldKeyword
                numKeywordsInWindow -= 1
        word = words[i]
        print "Next word is '%s'" % word
        if word in keywords:
            locKeyword[i] = word
            keywordLocs = allKeywordLocs[word]
            if not keywordLocs:
                print "'%s' fell in window" % word
                numKeywordsInWindow += 1
                if numKeywordsInWindow == keywordsLen:
                    return True
            keywordLocs.add(i)
    return False

:

>>> isInProximityWindow("the brown cow jumped over the moon and the red fox jumped over the black dog", Set(["fox", "over", "the"]), 4)
On beginning of iteration #0, window contains ''
Next word is 'the'
'the' fell in window
On beginning of iteration #1, window contains 'the'
Next word is 'brown'
On beginning of iteration #2, window contains 'the'
Next word is 'cow'
On beginning of iteration #3, window contains 'the'
Next word is 'jumped'
On beginning of iteration #4, window contains 'the'
'the' fell out of window
Next word is 'over'
'over' fell in window
On beginning of iteration #5, window contains 'over'
Next word is 'the'
'the' fell in window
On beginning of iteration #6, window contains 'over,the'
Next word is 'moon'
On beginning of iteration #7, window contains 'over,the'
Next word is 'and'
On beginning of iteration #8, window contains 'over,the'
'over' fell out of window
Next word is 'the'
On beginning of iteration #9, window contains 'the'
Next word is 'red'
On beginning of iteration #10, window contains 'the'
Next word is 'fox'
'fox' fell in window
On beginning of iteration #11, window contains 'fox,the'
Next word is 'jumped'
On beginning of iteration #12, window contains 'fox,the'
'the' fell out of window
Next word is 'over'
'over' fell in window
On beginning of iteration #13, window contains 'fox,over'
Next word is 'the'
'the' fell in window
True
+3

:

  • Python 3.4 Windows
  • 150 , 5 - 16 .
  • 10 ,
  • 75
  • 50 , 514

:

def generator(gen_salt):
    words = [word(i) for i in range(n_distinct_words)]
    np.random.seed(123)

    for i in range(int(n_words)):
        yield words[np.random.randint(0, n_distinct_words)]

words = generator, search_words = set, window_len = int:

from collections import deque
from time import time

def deque_window(words, search_words, window_len):
    start = time()
    result = []
    pos = 0

    window = deque([], window_len)

    for word in words:
        window.append(word)

        if word in search_words:
            all_found = True
            for search_word in search_words:
                if search_word not in window:
                    all_found = False
                    break

            if all_found:
                result.append(pos)

        pos += 1

    return result, time() - start

31 , 48 , , . , randint list . , , ( ).

, :

sum(len(w) for w in words)
+1

Apache Solr.

Apache Solr - , , , Apache Lucene ™

. , terrabytes .

0
source

Source: https://habr.com/ru/post/1611721/


All Articles