Iterate through the word String at a time in Python

I have a string buffer of a huge text file. I have to search for specific words / phrases in the string buffer. What is an effective way to do this?

I tried using re-modules. But since I have a huge text box that I have to look for. It takes a lot of time.

Given a dictionary of words and phrases.

I iterate over each file, read it in a line, look through all the words and phrases in the dictionary and increase the number in the dictionary if the keys are found.

One small optimization, in our opinion, was to sort the dictionary of phrases / words with the maximum number of words to the smallest. And then compare each starting position of the word from the string buffer and compare the list of words. If one phrase is found, we are not looking for other phrases (since it matches the longest phrase we want)

Can someone tell me how to follow a word in a lowercase buffer. (Iterate string buffer word by word)?

Also, is there any other optimization that can be done on this?

data = str(file_content)
for j in dictionary_entity.keys():
    cnt = data.count(j+" ")
    if cnt != -1:
        dictionary_entity[j] = dictionary_entity[j] + cnt
f.close()
+3
source share
8 answers

Iterating over a word through the contents of the file (The Wizard of Oz from Project Gutenberg, in my case), three different ways:

from __future__ import with_statement
import time
import re
from cStringIO import StringIO

def word_iter_std(filename):
    start = time.time()
    with open(filename) as f:
        for line in f:
            for word in line.split():
                yield word
    print 'iter_std took %0.6f seconds' % (time.time() - start)

def word_iter_re(filename):
    start = time.time()
    with open(filename) as f:
        txt = f.read()
    for word in re.finditer('\w+', txt):
        yield word
    print 'iter_re took %0.6f seconds' % (time.time() - start)

def word_iter_stringio(filename):
    start = time.time()
    with open(filename) as f:
        io = StringIO(f.read())
    for line in io:
        for word in line.split():
            yield word
    print 'iter_io took %0.6f seconds' % (time.time() - start)

woo = '/tmp/woo.txt'

for word in word_iter_std(woo): pass
for word in word_iter_re(woo): pass
for word in word_iter_stringio(woo): pass

Result:

% python /tmp/junk.py
iter_std took 0.016321 seconds
iter_re took 0.028345 seconds
iter_io took 0.016230 seconds
+7

, trie. , - trie, Patricia/radix trie. /, trie, . , , node. , , . . trie , trie ( O (m), m - / ).

, ( /, al, mz, ) .

+1

re , . . ( ?). , - , .

0

... 2 000 000 ( ), . , - , . :

word_counts = new hash<string,int>
for each word in corpus:
  if exists(word_counts[word]):
    word_counts[word]++
  else:
    word_counts[word] = 1

, , word_counts , , ... .

0

xyld, , re, , , , . , , . , , . hotshot . python http://onlamp.com/pub/a/python/2005/12/15/profiling.html.

0

re , , , findall() . :

>>> for i in re.finditer(r'\w+', 'Hello, this is a sentence.'):
...     print i.group(0)
...     
Hello
this
is
a
sentence
0
#!/usr/bin/env python
import re

s = ''
for i in xrange(0, 100000):
    s = s + 'Hello, this is a sentence. '
    if i == 50000:
        s = s + " my phrase "

s = s + 'AARRGH'

print len(s)

itr = re.compile(r'(my phrase)|(\w+)').finditer(s)
for w in itr:
    if w.group(0) == 'AARRGH':
        print 'Found AARRGH'
    elif w.group(0) == "my phrase":
        print 'Found "my phrase"'

Running this, we get

$ time python itrword.py
2700017
Found "my phrase"
Found AARRGH

real    0m0.616s
user    0m0.573s
sys     0m0.033s

But each "phrase" explicitly added to the regular expression will affect performance - up to 50% slower than just "\ w +", according to my rough measurements.

0
source

Have you considered the Natural Language Toolkit ? It includes many nice features for working with the text body, it also has the Cool FreqDist class, which behaves like a dict-like (has keys) and list-like (slice).

0
source

Source: https://habr.com/ru/post/1744045/


All Articles