Creating a list of each word from a text file without spaces, punctuation

Question

Creating a list of each word from a text file without spaces, punctuation

I have a long text file (script). I want to turn this text file into a list (where each word is separated) so that I can later review it.

At the moment, the code i matters

file = open('screenplay.txt', 'r') words = list(file.read().split()) print words

I think this works to split all the words into a list, however I am having trouble removing all the extra stuff, such as commas and periods at the end of words. I also want to make lowercase uppercase letters (because I want to be able to search in lowercase and appear both uppercase and lowercase words). Any help would be fantastic :)

+4

python

Tom f Aug 08 '13 at 20:57

source share

7 answers

Brionius · Answer 1 · 2013-08-08T21:12:58+0000

This is a job for regular expressions !

For instance:

 import re file = open('screenplay.txt', 'r') # .lower() returns a version with all upper case characters replaced with lower case characters. text = file.read().lower() file.close() # replaces anything that is not a lowercase letter, a space, or an apostrophe with a space: text = re.sub('[^az\ \']+', " ", text) words = list(text.split()) print words

Colonel panic · Answer 2 · 2013-08-08T21:15:33+0000

Try the algorithm from fooobar.com/questions/19946 / ... , i.e. divide the text into spaces, then draw punctuation. This carefully removes punctuation from the edges of words without harming the apostrophes inside words such as we're .

 >>> text "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'" >>> text.split() ["'Oh,", 'you', "can't", 'help', "that,'", 'said', 'the', 'Cat:', "'we're", 'all', 'mad', 'here.', "I'm", 'mad.', "You're", "mad.'"] >>> [word.strip(string.punctuation) for word in text.split()] ['Oh', 'you', "can't", 'help', 'that', 'said', 'the', 'Cat', "we're", 'all', 'mad', 'here', "I'm", 'mad', "You're", 'mad']

You might want to add .lower()

unutbu · Answer 3 · 2013-08-08T21:03:44+0000

The script should be short enough to be instantly read into memory. If so, you can remove all the punctuation using the translate method. Finally, you can create your list simply by dividing it into spaces using str.split :

 import string with open('screenplay.txt', 'rb') as f: content = f.read() content = content.translate(None, string.punctuation).lower() words = content.split() print words

Note that this will change Mr.Smith to mrsmith . If you want it to become ['mr', 'smith'] , you could replace all punctuation with spaces, and then use str.split :

 def using_translate(content): table = string.maketrans( string.punctuation, ' '*len(string.punctuation)) content = content.translate(table).lower() words = content.split() return words

One problem that a positive regex pattern might encounter, such as [az]+ , is that it will only match ascii characters. If there were letters with an accent in the file, the words would split into each other. Gruyère will become ['Gruy','re'] .

You can fix this by using re.split to separate punctuation. For instance,

 def using_re(content): words = re.split(r"[ %s\t\n]+" % (string.punctuation,), content.lower()) return words

However, using str.translate is faster:

 In [72]: %timeit using_re(content) 100000 loops, best of 3: 9.97 us per loop In [73]: %timeit using_translate(content) 100000 loops, best of 3: 3.05 us per loop

Brian mikey halbert · Answer 4 · 2013-08-08T21:03:28+0000

Use the replace method.

 mystring = mystring.replace(",", "")

If you want a more elegant solution that you will use many times, read RegEx expressions. Most languages use them, and they are extremely useful for more complex notes and such

6502 · Answer 5 · 2013-08-08T21:04:58+0000

You can use a simple regular expression to create a set with all words (sequences of one or more alphabetic characters)

 import re words = set(re.findall("[az]+", f.read().lower()))

Using set , each word will be included only once.

Just using findall instead will give you all the words in order.

Tiago martins · Answer 6 · 2013-08-08T21:13:02+0000

You can use a dictionary to indicate which characters you do not want and format the current line based on your options.

 replaceChars = {'.':'',',':'', ' ':''} print reduce(lambda x, y: x.replace(y, replaceChars[y]), replaceChars, "ABC3.2,1,\nCda1,2,3....".lower())

Output:

 abc321 cda123

MatLecu · Answer 7 · 2013-08-08T21:15:39+0000

You can try something like this. Maybe you need work on regexp.

 import re text = file.read() words = map(lambda x: re.sub("[,.!?]", "", x).lower(), text.split())

Creating a list of each word from a text file without spaces, punctuation

More articles: