In addition to nltk.corpus.words , which @salvadordali highlighted:
>>> from nltk.corpus import words >>> print words.readme() Wordlists en: English, http://en.wikipedia.org/wiki/Words_(Unix) en-basic: 850 English words: CK Ogden in The ABC of Basic English (1932) >>> print words.words()[:10] [u'A', u'a', u'aa', u'aal', u'aalii', u'aam', u'Aani', u'aardvark', u'aardwolf', u'Aaron']
Note that nltk.corpus.words is a list of words without frequencies, so it is not completely natural text.
A corpus package containing various enclosures, some of which are English corporations, see http://www.nltk.org/nltk_data/ . For instance. nltk.corpus.brown :
>>> from nltk.corpus import brown >>> brown.words()[:10] [u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of']
To get a list of words from a natural text body:
>>> wordlist = set(brown.words()) >>> print len(wordlist) 56057 >>> wordlist_lowercased = set(i.lower() for i in brown.words()) >>> print len(wordlist_lowercased) 49815
Please note that brown.words() contains words with lower and upper cases, for example, with natural text.
In most cases, a word list is not very useful without frequencies, so you can use FreqDist :
>>> from nltk import FreqDist >>> from nltk.corpus import brown >>> frequency_list = FreqDist(i.lower() for i in brown.words()) >>> frequency_list.most_common()[:10] [(u'the', 69971), (u',', 58334), (u'.', 49346), (u'of', 36412), (u'and', 28853), (u'to', 26158), (u'a', 23195), (u'in', 21337), (u'that', 10594), (u'is', 10109)]
See http://www.nltk.org/book/ch01.html for more information on how to access and process enclosures in NLTK.