How to find word sequences in python?

I have a large text file like this.txt:
http://www.fullbooks.com/The-Jacket-Star-Rover-1.html
with awk:

cat example.txt | awk '{ print substr($0, index($0,$3)) }' | tr -sc "[A-Z][a-z][0-9]'" '[\012*]' | awk -- 'first!=""&&second!="" { print first,second,$0; } { first=second; second=$0; }' | sort | uniq -c | sort -nr | head -n20

the conclusion is the top 20 ranking of three consecutive most repeated words:

 13 in the jacket
 11 I was a
 10 of the Yard
 10 me in the
  8 Captain of the
  7 times and places
  7 the Captain of
  7 in the prison
  7 in the dungeons
  7 in San Quentin
  7 I had been
  6 other times and
  6 hours in the
  6 are going to
  5 twenty four hours
  5 to take me
  5 the rest of
  5 the forty lifers
  5 the Board of
  5 that I had

Beginning with:

raw=open('examples.txt')
text=raw.read().replace('\n', '')
words = text.split()
...............

how to get the same with python3?

+4
source share
2 answers

This is a good way to calculate the frequency of words, but not the one that is different. I'd:

  • read the file and share as you do.
  • create triplets and submit them to collections.Counter(using tupleso that it hashable)
  • filter / sort to display the above 5 cases

like this:

import collections

with open('example.txt') as raw:
    words = raw.read().split()

c = collections.Counter(tuple(words[i:i+3]) for i in range(len(words)-3))
for x in sorted([(k,v) for k,v in c.items() if v>=5] ,key = lambda x : x[1],reverse=True):
    print(x)

, str.split() , ( , , "Hello, World" "Hello," "World), non alphanum char:

words = [x for x in re.split("\W",raw.read()) if x]

( , str.split):

(('in', 'the', 'jacket'), 19)
(('of', 'the', 'Yard'), 13)
(('Captain', 'of', 'the'), 12)
(('I', 'was', 'a'), 12)
(('me', 'in', 'the'), 11)
(('in', 'the', 'prison'), 11)
(('in', 'the', 'dungeons'), 10)
(('hours', 'in', 'the'), 9)
(('in', 'San', 'Quentin'), 9)
(('I', 'don', 't'), 8)
(('He', 'was', 'a'), 8)
(('are', 'going', 'to'), 8)
(('I', 'had', 'been'), 7)
(('I', 'have', 'been'), 7)
(('in', 'order', 'to'), 7)
(('times', 'and', 'places'), 7)
(('five', 'pounds', 'of'), 7)
(('and', 'I', 'have'), 7)
(('the', 'Captain', 'of'), 7)
(('Darrell', 'Standing', 's'), 6)
(('I', 'did', 'not'), 6)
(('five', 'years', 'of'), 6)
(('Warden', 'Atherton', 'and'), 6)
(('Board', 'of', 'Directors'), 6)
(('thirty', 'five', 'pounds'), 6)
(('that', 'I', 'had'), 6)
(('pounds', 'of', 'dynamite'), 6)
(('other', 'times', 'and'), 6)
(('of', 'San', 'Quentin'), 5)
(('the', 'forty', 'lifers'), 5)
(('and', 'Captain', 'Jamie'), 5)
(('I', 'Darrell', 'Standing'), 5)
(('in', 'the', 'dungeon'), 5)
(('going', 'to', 'take'), 5)
...

, , , ("in the woods" vs "in the woods")

+3

:

import re

frequency={}
with open('example.txt') as raw:
    words = [word.lower() for word in re.split("\W",raw.read()) if word]

for index, word in enumerate(words):
    if index < (len(words)-2):
        triplet = (word, words[index+1], words[index+2])
        if triplet in frequency:
            frequency[triplet] += 1
        else:
            frequency[triplet] = 1

for triplet, rank in frequency.items():
    print(triplet,str(rank))
0

Source: https://habr.com/ru/post/1688220/


All Articles