Python - you need a fast algorithm that removes all words in a file that are derived in other words

Question

Python - you need a fast algorithm that removes all words in a file that are derived in other words

We have a file called wordlist that contains 1,876 KB of alphabetic words, all of which are longer than 4 letters and contain one carriage return between each new two-letter construct (ab, ac, ad, etc., words all contain returns between them) :

wfile = open("wordlist.txt", "r+")

I want to create a new file containing only words that are not derived from other smaller words. For example, a word list contains the following words: "Violators, abuse, abuse, abuse, abuse, etc.". The created new file should contain only the word "abuse" because it is the "lowest common denominator" (if you will) between all these words. Similarly, the word rodeo will be deleted because it contains the word rodeo.

I tried this implementation:

 def root_words(wordlist): result = [] base = wordlist[1] for word in wordlist: if not word.startswith(base): result.append(base) print base base=word result.append(base) return result; def main(): wordlist = [] wfile = open("wordlist.txt", "r+") for line in wfile: wordlist.append(line[:-1]) wordlist = root_words(wordlist) newfile = open("newwordlist.txt", "r+") newfile.write(wordlist)

But he always froze on my computer. Any solutions?

+4

python list algorithm io

Parseltongue Jan 25 '11 at 9:02

source share

3 answers

One possible improvement is to use a database to load words and avoid loading the full input file into RAM. Another option is to process the words as they are read from the file and write the results without loading everything into memory.

The following example examines a file because it is read without first loading into memory.

 def root_words(f,out): result = [] base = f.readline() for word in f: if not word.startswith(base): out.write(base + "\n") base=word out.write(base + "\n") def main(): wfile = open("wordlist.txt", "r+") newfile = open("newwordlist.txt", "w") root_words(wfile,newfile) wfile.close() newfile.close()

The memory complexity of this solution is O (1), since the base variable is the only thing you need to process the file. This can be done because the file is sorted alphabetically.

+2

Manuel salvadores Jan 25 '11 at 9:15

source share

since the list is in alphabetical order, this does the trick (takes 0.4 seconds with 5 megabytes of data, so there shouldn't be a problem with 1.8)

 res = [" "] with open("wordlist.txt","r") as f: for line in f: tmp = line.strip() if tmp.startswith(res[-1]): pass else: res.append(tmp) with open("newlist.txt","w") as f: f.write('\n'.join(res[1:]))

+1

Kimvais Jan 25 '11 at 9:23

source share

aaronasterling · Accepted Answer · 2011-01-25T09:16:14+0000

I would do something like this:

 def bases(words): base = next(words) yield base for word in words: if word and not word.startswith(base): yield word base = word def get_bases(infile, outfile): with open(infile) as f_in: words = (line.strip() for line in f_in) with open(outfile, 'w') as f_out: f_out.writelines(word + '\n' for word in bases(words))

This happens through a list of 58,000 words corn in a fifth of a second on my pretty old laptop. He is old enough to have one memory concert.

 $ time python words.py real 0m0.233s user 0m0.180s sys 0m0.012s

It uses iterators wherever you can easily navigate through memory. You can probably increase performance by cutting off the end of lines, instead of using a strip to get rid of new lines.

Also note that this depends on how your input is sorted and not empty. This was part of the stated premises, although I do not feel too bad about it;)

Python - you need a fast algorithm that removes all words in a file that are derived in other words

More articles: