Separation of sentences in python

I am trying to split sentences into words.

words = content.lower().split()

it gives me a list of words like

'evening,', 'and', 'there', 'was', 'morning--the', 'first', 'day.'

and with this code:

def clean_up_list(word_list):
    clean_word_list = []
    for word in word_list:
        symbols = "~!@#$%^&*()_+`{}|\"?><`-=\][';/.,']"
        for i in range(0, len(symbols)):
            word = word.replace(symbols[i], "")
        if len(word) > 0:
            clean_word_list.append(word)

I get something like:

'evening', 'and', 'there', 'was', 'morningthe', 'first', 'day'

if you see the word "morning" in the list, it had a "-" between the words. Now I can break them into two words, for example "morning","the"??

+4
source share
5 answers

I would suggest a regex based solution:

import re

def to_words(text):
    return re.findall(r'\w+', text)

It searches for all words - groups of letter characters, ignoring characters, delimiters and spaces.

>>> to_words("The morning-the evening")
['The', 'morning', 'the', 'evening']

Note that if you iterate over words, the use re.finditerthat the generator object returns is probably better since you do not have the entire list of words at once.

+3

itertools.groupby str.alpha() :

>>> from itertools import groupby
>>> sentence = 'evening, and there was morning--the first day.'

>>> [''.join(j) for i, j in groupby(sentence, str.isalpha) if i]
['evening', 'and', 'there', 'was', 'morning', 'the', 'first', 'day']

PS: Regex . .


OP: , , -- , '-' ' ' split. , :

words = content.lower().replace('-', ' ').split()

words .

+3

, .

>>> re.findall(r'\w+', "Don't read O'Rourke books!")
['Don', 't', 'read', 'O', 'Rourke', 's', 'books']

nltk.

+1

Besides the solutions already presented, you can also improve your function clean_up_listto do a better job.

def clean_up_list(word_list):
    clean_word_list = []
    # Move the list out of loop so that it doesn't
    # have to be initiated every time.
    symbols = "~!@#$%^&*()_+`{}|\"?><`-=\][';/.,']"

    for word in word_list:
        current_word = ''
        for index in range(len(word)):
            if word[index] in symbols:
                if current_word:
                    clean_word_list.append(current_word)
                    current_word = ''
            else:
                current_word += word[index]

        if current_word:
            # Append possible last current_word
            clean_word_list.append(current_word)

    return clean_word_list

Actually, you can apply the block in for word in word_list:to the whole sentence to get the same result.

0
source

You can also do this:

import re

def word_list(text):
  return list(filter(None, re.split('\W+', text)))

print(word_list("Here we go round the mulberry-bush! And even---this and!!!this."))

Return:

['Here', 'we', 'go', 'round', 'the', 'mulberry', 'bush', 'And', 'even', 'this', 'and', 'this']
0
source

Source: https://habr.com/ru/post/1668049/


All Articles