NLTK converts tokenized offers into sync format

I am looking to get the similarities between one word and each word in a sentence using NLTK.

NLTK can get similarities between two specific words, as shown below. This method requires a specific reference to the word, in this case "dog.n.01", where the dog is a noun, and we want to use the first (01) definition of NLTK.

dog = wordnet.synset('dog.n.01') cat = wordnet.synset('cat.n.01') print dog.path_similarity(cat) >> 0.2 

The problem is that I need to get some of the speech information from each word in the sentence. The NLTK package has the ability to receive parts of speech for each word in a sentence, as shown below. However, these speech parts ("NN", "VB", "PRP" ...) do not correspond to the format that synset accepts as a parameter.

 text = word_tokenize("They refuse to permit us to obtain the refuse permit") pos_tag(text) >> [('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')] 

Is it possible to get sync data from pos_tag () in NLTK? By formatting in syntax format, I mean a format like dog.n.01

+6
source share
2 answers

You can use a simple conversion function:

 from nltk.corpus import wordnet as wn def penn_to_wn(tag): if tag.startswith('J'): return wn.ADJ elif tag.startswith('N'): return wn.NOUN elif tag.startswith('R'): return wn.ADV elif tag.startswith('V'): return wn.VERB return None 

After tagging a sentence, you can associate a word inside a sentence with SYNSET using this function. Here is an example:

 from nltk.stem import WordNetLemmatizer from nltk import pos_tag, word_tokenize sentence = "I am going to buy some gifts" tagged = pos_tag(word_tokenize(sentence)) synsets = [] lemmatzr = WordNetLemmatizer() for token in tagged: wn_tag = penn_to_wn(token[1]) if not wn_tag: continue lemma = lemmatzr.lemmatize(token[0], pos=wn_tag) synsets.append(wn.synsets(lemma, pos=wn_tag)[0]) print synsets 

Result: [Synset ('be.v.01'), Synset ('travel.v.01'), Synset ('buy.v.01'), Synset ('gift.n.01 ")]

+9
source

You can use an alternative form of wordnet.synset:

 wordnet.synset('dog', pos=wordnet.NOUN) 

You still need to translate the tags offered by pos_tag to those supported by wordnet.sysnset - unfortunately, I don’t know how the built-in dictionary did it, so (if I’m missing the presence of such a matching table), you will need to create your own own (you can do this once and sort it for a subsequent reboot).

See http://www.nltk.org/book/ch05.html , section 1, for how to get help with a specific set of tags - for example, nltk.help.upenn_tagset('N.*') confirm that UPenn tagget (which in my opinion is used by default for pos_tag ) uses 'N' followed by something to identify variations of what synset will see as wordnet.NOUN .

I have not tried http://www.nltk.org/_modules/nltk/tag/mapping.html , but might be what you need - try

+1
source

Source: https://habr.com/ru/post/980005/


All Articles