How to use book functions (e.g. concoordance) in NLTK?

I am experiencing this wonderful tutorial .

I downloaded a collection called book :

 >>> import nltk >>> nltk.download() 

and imported texts:

 >>> from nltk.book import * *** Introductory Examples for the NLTK Book *** Loading text1, ..., text9 and sent1, ..., sent9 Type the name of the text or sentence to view it. Type: 'texts()' or 'sents()' to list the materials. text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 

Then I can run commands on these texts:

 >>> text1.concordance("monstrous") 

How can I run these nltk commands in my own dataset? Are these collections the same as a book object in python?

+6
source share
2 answers

You are right that it is rather difficult to find documentation for the book.py module. Therefore, we must contaminate our hands and look at the code (see here ). Looking at book.py to do the concordance and all the fancy stuff with the book module:

First, you need your raw texts to be placed in the nltk corpus class, see Creating a new body with NLTK for more details,

Secondly, you read the words corpus in the NLTK Text class. Then you can use the functions that you see at http://nltk.org/book/ch01.html

 from nltk.corpus import PlaintextCorpusReader from nltk.text import Text # For example, I create an example text file text1 = ''' This is a story about a foo bar. Foo likes to go to the bar and his last name is also bar. At home, he kept a lot of gold chocolate bars. ''' text2 = ''' One day, foo went to the bar in his neighborhood and was shot down by a sheep, a blah blah black sheep. ''' # Creating the corpus corpusdir = './mycorpus/' with (corpusdir+'text1.txt','w') as fout: fout.write(text1) with (corpusdir+'text2.txt','w') as fout: fout.write(text2, fout) # Read the the example corpus into NLTK corpus class. mycorpus = PlaintextCorpusReader(corpusdir, '.*') # Read the NLTK corpus into NLTK text class, # where your book-like concoordance search is available mytext = Text(mycorpus.words()) mytext.concoordance('foo') 

NOTE. you can use other NLTK CorpusReaders and even specify custom attributes and encoding for the paragraph / sentence / word, but now we will stick to the standard

+4
source

Text analysis with Cheatsheet NLTK from bogs.princeton.edu https://blogs.princeton.edu/etc/files/2014/03/Text-Analysis-with-NLTK-Cheatsheet.pdf

Work with your own texts:

Open the file for reading

 file = open('myfile.txt') 

Before starting Python, make sure you are in the right directory - or provide a complete path specification.

Read the file:

 t = file.read() 

Mark text:

 tokens = nltk.word_tokenize(t) 

Convert to NLTK text object:

 text = nltk.Text(tokens) 
+2
source

Source: https://habr.com/ru/post/949785/


All Articles