The script should be short enough to be instantly read into memory. If so, you can remove all the punctuation using the translate method. Finally, you can create your list simply by dividing it into spaces using str.split :
import string with open('screenplay.txt', 'rb') as f: content = f.read() content = content.translate(None, string.punctuation).lower() words = content.split() print words
Note that this will change Mr.Smith to mrsmith . If you want it to become ['mr', 'smith'] , you could replace all punctuation with spaces, and then use str.split :
def using_translate(content): table = string.maketrans( string.punctuation, ' '*len(string.punctuation)) content = content.translate(table).lower() words = content.split() return words
One problem that a positive regex pattern might encounter, such as [az]+ , is that it will only match ascii characters. If there were letters with an accent in the file, the words would split into each other. Gruyère will become ['Gruy','re'] .
You can fix this by using re.split to separate punctuation. For instance,
def using_re(content): words = re.split(r"[ %s\t\n]+" % (string.punctuation,), content.lower()) return words
However, using str.translate is faster:
In [72]: %timeit using_re(content) 100000 loops, best of 3: 9.97 us per loop In [73]: %timeit using_translate(content) 100000 loops, best of 3: 3.05 us per loop