Extract all nouns from a text file using nltk

Question

Extract all nouns from a text file using nltk

Is there a more efficient way to do this? My code reads a text file and extracts all nouns.

import nltk File = open(fileName) #open file lines = File.read() #read all lines sentences = nltk.sent_tokenize(lines) #tokenize sentences nouns = [] #empty to array to hold all nouns for sentence in sentences: for word,pos in nltk.pos_tag(nltk.word_tokenize(str(sentence))): if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS'): nouns.append(word)

How to reduce the time complexity of this code? Is there a way to avoid using nested loops?

Thanks in advance!

+5

python nltk

Rakesh adhikesavan Nov 07 '15 at 20:54

source share

4 answers

 import nltk lines = 'lines is some string of words' # function to test if something is a noun is_noun = lambda pos: pos[:2] == 'NN' # do the nlp stuff tokenized = nltk.word_tokenize(lines) nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos)] print nouns >>> ['lines', 'string', 'words']

Useful tip: It often happens that comprehension of lists is a faster method of building a list than adding items to the list using the .insert () or append () method in a "for" loop.

+5

Boa Nov 07 '15 at 21:18

source share

There is no redundancy in your code: you read the file once and each time offer each sentence and each marked word exactly once. Regardless of how you write your code (for example, using concepts), you will hide only nested loops, and not skip any processing.

The only potential for improvement is the complexity of the space: instead of immediately reading the entire file, you can read it in steps. But since you need to process entire sentences at a time, it is not as simple as reading and processing one line at a time; therefore, I would not worry if your files were not whole gigabytes; for short files this will not make any difference.

In short, your loops are fine. There is something or something in the code that you could clear (for example, an if clause that matches POS tags), but that won't change anything useful.

+2

alexis Nov 09 '15 at 11:58

source share

I am not an expert in NLP, but I think that you are already quite close, and there is probably no better way to overcome the quadratic time complexity in these external cycles.

Recent versions of NLTK have a built-in function that does what you do manually, nltk.tag.pos_tag_sents, and returns a list of lists of tagged words.

+1

Will angley Nov 07 '15 at 21:13

source share

Aziz alto · Accepted Answer · 2015-11-07T21:53:55+0000

If you are open to options other than NLTK , check out TextBlob . It easily extracts all nouns and nouns:

 >>> from textblob import TextBlob >>> txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the inter actions between computers and human (natural) languages.""" >>> blob = TextBlob(txt) >>> print(blob.noun_phrases) [u'natural language processing', 'nlp', u'computer science', u'artificial intelligence', u'computational linguistics']

Extract all nouns from a text file using nltk

More articles: