Do English words have English words?

Is there a way to get a list of English words in python nltk library? I tried to find it, but the only thing I found was wordnet from nltk.corpus . But based on the documentation, it does not have what I need (it finds synonyms for the word).

I know how to find a list of these words myself (this answer describes it in detail), so I wonder if I can do this using only the nltk library.

+6
source share
2 answers

Yes, from nltk.corpus import words

And check using:

 >>> "fine" in words.words() True 

Ref: Section 4.1 (Wordlist Corpora), Chapter 2 Processing Natural Language with Python .

+5
source

In addition to nltk.corpus.words , which @salvadordali highlighted:

 >>> from nltk.corpus import words >>> print words.readme() Wordlists en: English, http://en.wikipedia.org/wiki/Words_(Unix) en-basic: 850 English words: CK Ogden in The ABC of Basic English (1932) >>> print words.words()[:10] [u'A', u'a', u'aa', u'aal', u'aalii', u'aam', u'Aani', u'aardvark', u'aardwolf', u'Aaron'] 

Note that nltk.corpus.words is a list of words without frequencies, so it is not completely natural text.

A corpus package containing various enclosures, some of which are English corporations, see http://www.nltk.org/nltk_data/ . For instance. nltk.corpus.brown :

 >>> from nltk.corpus import brown >>> brown.words()[:10] [u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of'] 

To get a list of words from a natural text body:

 >>> wordlist = set(brown.words()) >>> print len(wordlist) 56057 >>> wordlist_lowercased = set(i.lower() for i in brown.words()) >>> print len(wordlist_lowercased) 49815 

Please note that brown.words() contains words with lower and upper cases, for example, with natural text.

In most cases, a word list is not very useful without frequencies, so you can use FreqDist :

 >>> from nltk import FreqDist >>> from nltk.corpus import brown >>> frequency_list = FreqDist(i.lower() for i in brown.words()) >>> frequency_list.most_common()[:10] [(u'the', 69971), (u',', 58334), (u'.', 49346), (u'of', 36412), (u'and', 28853), (u'to', 26158), (u'a', 23195), (u'in', 21337), (u'that', 10594), (u'is', 10109)] 

See http://www.nltk.org/book/ch01.html for more information on how to access and process enclosures in NLTK.

+6
source

Source: https://habr.com/ru/post/982023/


All Articles