Use Python to search for a single .txt file for a list of words or phrases (and show context)

Basically, as the question says. I am new to Python and love to learn by seeing and doing.

I would like to create a script that looks for a text document (for example, text copied and pasted from a news article, for example) for specific words or phrases. Ideally, a list of words and phrases will be stored in a separate file.

When getting results, it would be great to get a context for the results. Perhaps he could print 50 characters in a text file before and after each search query found. It would be great if he also showed on which line the search query was found.

Any pointers on how to encode this, or even code examples, would be much appreciated.

+3
source share
2 answers

Despite the often pronounced antipathy for regular expressions on the part of many in the Python community, they are indeed a valuable tool for relevant use cases, which certainly include the identification of words and phrases (thanks to the \b“word boundary element” in regular expression patterns - alternatives to the basis of string processing are much more complex tasks, for example, .split()using spaces as a separator and, therefore, annoyingly leaves punctuation marks on adjacent words attached to them, etc. etc., etc.).

If the RE is fine, I would recommend something like:

import re
import sys

def main():
  if len(sys.argv) != 3:
    print("Usage: %s fileofstufftofind filetofinditin" % sys.argv[0])
    sys.exit(1)

  with open(sys.argv[1]) as f:
    patterns = [r'\b%s\b' % re.escape(s.strip()) for s in f]
  there = re.compile('|'.join(patterns))

  with open(sys.argv[2]) as f:
    for i, s in enumerate(f):
      if there.search(s):
        print("Line %s: %r" % (i, s))

main()

, () , , () , . , , (, ) .. ..

, RE...:

\b patterns , ( "" "", "" "", ", , " , , "", , -).

| or, , , ( )

cat
dog

'\bcat\b|\bdog\b', "", "" ( , , ).

re.escape , , , RE.

+6

- . , .

import sys

words = "foo bar baz frob"

word_set = set(words.split())
for line_number, line in enumerate(open(sys.argv[1])):
    if words_set.intersection(line.split()):
        print "%d:%s" % (line_number, line.strip())

:

  • , , ( 3). , , - . ( - O (1), O (n) - ).

  • ( ) enumerate, , . sys.argv - , ; sys.argv[0] Python script.

  • , . , . True (.. ), , .

, ( ):

  • , ( , , sys.argv[2]), . , add update ( append extend, ).

  • , , ( ). , , , , , , - , any(phrase in line for phrase in set_of_phrases). ( , , ).

  • , (, prev_line next_line), . for next_line line, for line prev_line next_line line.

  • Pythonic Python, , i-1, item item + 1 (, ). , , Python, , . , , , , :

    def context_generator(iterable):
        prev, current, next = None, None, None
        for element in iterable:
            prev, current, next = current, next, element
            if current is not None:
                yield prev, current, next
        if next is not None:
            yield current, next, None
    
+3

Source: https://habr.com/ru/post/1749250/


All Articles