How to efficiently extract literary words from a sequential line?

Question

How to efficiently extract literary words from a sequential line?

Possible duplicate:
How to split text without spaces into a list of words?

In the comments of people there is a lot of textual information that is analyzed from html, but they do not have separator characters. For example: thumbgreenappleactiveassignmentweeklymetaphor. Apparently, the line has "big", "green", "apple", etc. I also have a large dictionary for asking if the word is reasonable. So what is the fastest way to extract these words?

+2

python algorithm extract text-extraction

Peiyun Jul 20 '12 at 9:39

source share

2 answers

"" , ...

words = set(possible words)
s = 'thumbgreenappleactiveassignmentweeklymetaphor'
for i in xrange(len(s) - 1):
    for j in xrange(1, len(s) - i):
        if s[i:i+j] in words:
            print s[i:i+j]

/usr/share/dict/words for j in xrange(3, len(s) - i): ( 3), :

thumb
hum
green
nap
apple
plea
lea
act
active
ass
assign
assignment
sign
men
twee
wee
week
weekly
met
eta
tap

+4

eumiro 20 . '12 9:45

Generic Human · Accepted Answer · 2012-07-21T00:03:16+0000

, , , .

- . , , . . , Zipf, n 1/(n log N), N - .

, , . , , . , , .

import math

# Build a cost dictionary, assuming Zipf law and cost = -math.log(probability).
words = open("words-by-frequency.txt").read().split()
wordcost = dict((k,math.log((i+1)*math.log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)

def infer_spaces(s):
    """Uses dynamic programming to infer the location of spaces in a string
    without spaces."""

    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)
        cost.append(c)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))

s = 'thumbgreenappleactiveassignmentweeklymetaphor'
print(infer_spaces(s))

125k, .

: thumbgreenappleactiveassignmentweeklymetaphor.
: thumb . .

: , , , odelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetapho rapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquery whetherthewordisreasonablesowhatsthefastestwayofextractionalot.
: , html, , , . , , .. , , .

: itwasadarkandstormynighttherainfellintorrentsexceptatocial intervalswhenchwatchhecked theaviolentgustofwindwhichsweptupthestreetsforitisinlondonthatceneliesrattlingalhousetopsandfiercelyagitatingthescantyflameamphelthstthgrgledagainstthedarkness.
: , , , , , , , .

, . - , , , , .

, . , , .

, , . , 10000 1000 , . .

How to efficiently extract literary words from a sequential line?

More articles: