I need to execute a sequence on portuguese lines. To do this, I want the string to be executed using the nltk.word_tokenize () function, and then each word individually. After that, I rebuild the line. It works, but does not work well. How can I do it faster? The line length is about 2 million words.
tokenAux=""
tokens = nltk.word_tokenize(portugueseString)
for token in tokens:
tokenAux = token
tokenAux = stemmer.stem(token)
textAux = textAux + " "+ tokenAux
print(textAux)
Sorry for the bad english and thanks!
source
share