Separation of sentences in python

Question

Separation of sentences in python

I am trying to split sentences into words.

words = content.lower().split()

it gives me a list of words like

'evening,', 'and', 'there', 'was', 'morning--the', 'first', 'day.'

and with this code:

def clean_up_list(word_list):
    clean_word_list = []
    for word in word_list:
        symbols = "~!@#$%^&*()_+`{}|\"?><`-=\][';/.,']"
        for i in range(0, len(symbols)):
            word = word.replace(symbols[i], "")
        if len(word) > 0:
            clean_word_list.append(word)

I get something like:

'evening', 'and', 'there', 'was', 'morningthe', 'first', 'day'

if you see the word "morning" in the list, it had a "-" between the words. Now I can break them into two words, for example "morning","the"??

+4

python split python-3.x python-2.7

Yun tae hwang Jan 27 '17 at 21:57

source share

5 answers

FlipTack · Answer 1 · 2017-01-27T22:02:14+0000

I would suggest a regex based solution:

import re

def to_words(text):
    return re.findall(r'\w+', text)

It searches for all words - groups of letter characters, ignoring characters, delimiters and spaces.

>>> to_words("The morning-the evening")
['The', 'morning', 'the', 'evening']

Note that if you iterate over words, the use re.finditerthat the generator object returns is probably better since you do not have the entire list of words at once.

Moinuddin Quadri · Answer 2 · 2017-01-27T22:05:44+0000

itertools.groupby str.alpha() :

>>> from itertools import groupby
>>> sentence = 'evening, and there was morning--the first day.'

>>> [''.join(j) for i, j in groupby(sentence, str.isalpha) if i]
['evening', 'and', 'there', 'was', 'morning', 'the', 'first', 'day']

PS: Regex . .

OP: , , -- , '-' ' ' split. , :

words = content.lower().replace('-', ' ').split()

words .

John Machin · Answer 3 · 2017-01-27T22:23:26+0000

, .

>>> re.findall(r'\w+', "Don't read O'Rourke books!")
['Don', 't', 'read', 'O', 'Rourke', 's', 'books']

nltk.

Ares ou · Answer 4 · 2017-01-27T22:33:06+0000

Besides the solutions already presented, you can also improve your function clean_up_listto do a better job.

def clean_up_list(word_list):
    clean_word_list = []
    # Move the list out of loop so that it doesn't
    # have to be initiated every time.
    symbols = "~!@#$%^&*()_+`{}|\"?><`-=\][';/.,']"

    for word in word_list:
        current_word = ''
        for index in range(len(word)):
            if word[index] in symbols:
                if current_word:
                    clean_word_list.append(current_word)
                    current_word = ''
            else:
                current_word += word[index]

        if current_word:
            # Append possible last current_word
            clean_word_list.append(current_word)

    return clean_word_list

Actually, you can apply the block in for word in word_list:to the whole sentence to get the same result.

Jason baker · Answer 5 · 2017-01-28T03:45:08+0000

You can also do this:

import re

def word_list(text):
  return list(filter(None, re.split('\W+', text)))

print(word_list("Here we go round the mulberry-bush! And even---this and!!!this."))

Return:

['Here', 'we', 'go', 'round', 'the', 'mulberry', 'bush', 'And', 'even', 'this', 'and', 'this']

Separation of sentences in python

More articles: