I posted this as a comment, but thought that I could fully state it in the full answer with some explanation:
You want to use str.split()
to split the string into words, and then pin each word:
for word in text.split(" "): PorterStemmer().stem_word(word)
How you want to get a string of all related words is trivial, then to combine these stems together. To do this easily and efficiently, we use str.join()
and the generator expression:
" ".join(PorterStemmer().stem_word(word) for word in text.split(" "))
Edit:
For your other problem:
with open("/path/to/file.txt") as f: words = set(f)
Here we open the file using the with
statement (this is the best way to open files, since it closes them correctly, even on exceptions, and is more readable) and reads the contents into a set. We use a set because we do not care about word order or duplicates, and it will be more effective later. I assume one word per line - if it is not, and they are separated by commas or spaces, and using str.split()
, as we did before (with the corresponding arguments), is probably a good plan.
stems = (PorterStemmer().stem_word(word) for word in text.split(" ")) " ".join(stem for stem in stems if stem not in words)
Here we use the if clause of a generator expression to ignore words that are in a set of words loaded from a file. Membership validation is O (1), so this should be relatively effective.
Edit 2:
To delete words before they finish, it's even easier:
" ".join(PorterStemmer().stem_word(word) for word in text.split(" ") if word not in words)
Removing the given words is simple:
filtered_words = [word for word in unfiltered_words if not in set_of_words_to_filter]