Python Full String Behavior

I need to execute a sequence on portuguese lines. To do this, I want the string to be executed using the nltk.word_tokenize () function, and then each word individually. After that, I rebuild the line. It works, but does not work well. How can I do it faster? The line length is about 2 million words.

    tokenAux=""
    tokens = nltk.word_tokenize(portugueseString)
        for token in tokens:
            tokenAux = token
            tokenAux = stemmer.stem(token)    
            textAux = textAux + " "+ tokenAux
    print(textAux)

Sorry for the bad english and thanks!

+4
source share
3 answers

string , , . . , generator expression list comprehension. . . generator expression join:

my_text : len(my_text) -> 444399

timeit :

%%timeit
tokenAux=""
textAux=""
tokens = nltk.word_tokenize(my_text)
for token in tokens:
    tokenAux = token
    tokenAux = stemmer.stem(token)    
    textAux = textAux + " "+ tokenAux

:

1 loop, best of 3: 6.23 s per loop

generator expression join:

%%timeit 
' '.join(stemmer.stem(token) for token in nltk.word_tokenize(my_text))

:

1 loop, best of 3: 2.93 s per loop
+3

String Python. :

textAux = ""
for token in tokens:
    # something important ...
    textAux = textAux + " "+ tokenAux

, textAux. .

tokenAux . . :

tokenAux = []  # we declare list for storing tokens
tokens = nltk.word_tokenize(portugueseString)
for token in tokens:
    tokenAux = token
    tokenAux = stemmer.stem(token)    
    textAux.append(tokenAux)  # we add new token into the resulting list

result = " ".join(textAux)  # join list using space as separator
print(result)

:)

:

+1

You can read the line as a text file, and then perform the necessary operations for each word using PySpark . This will allow you to carry out your operations in parallel.

You can also use the multiprocessing module .

0
source

Source: https://habr.com/ru/post/1681747/


All Articles