Classify words with the same meaning

Question

Classify words with the same meaning

I have 50,000 subject lines from emails, and I want to classify the words in them based on synonyms or words that can be used instead of others.

For instance:

Best selling!

Best selling

I want them to be in the same group.

I am creating the following function with nltk wordnet, but it does not work.

def synonyms(w,group,guide): try: # Check if the words is similar w1 = wordnet.synset(w +'.'+guide+'.01') w2 = wordnet.synset(group +'.'+guide+'.01') if w1.wup_similarity(w2)>=0.7: return True elif w1.wup_similarity(w2)<0.7: return False except: return False

Any ideas or tools for this?

+5

python-3.x text-processing machine-learning nlp nltk

dapo Mar 09 '17 at 15:14

source share

3 answers

The idea is to solve this with attachments and word2vec, the result will be a mapping from words to vectors that are “close” when they have similar meanings, for example, “car” and “vehicle” will be near, and “car” and "food" will not, you can measure the vector distance between two words and determine the threshold to choose whether they are so close that they mean the same thing as I said that this is just an idea word2vec

+1

Luis leal Mar 09 '17 at 20:37

source share

The calculation behind what Nick said is to calculate the distance (cosine distance) between the two phrase vectors.

 Top sales! Best sales

Here is one way to do this: How to calculate the similarity of a phrase between phrases

+1

Aaron Mar 19 '17 at 5:38

source share

Nick hough · Accepted Answer · 2017-03-13T12:20:10+0000

The easiest way to achieve this is to compare the similarity of the corresponding word attachments (the most common implementation of this is Word2Vec).

Word2Vec is a way of representing the semantic meaning of a token in a vector space, which allows you to compare word meanings without requiring a large dictionary / thesaurus such as WordNet.

One of the problems with regular Word2Vec implementations is that it distinguishes between different feelings of the same word. For example, a word bank will have the same Word2Vec representation in all of these sentences:

The riverbank was dry.
The bank lent me money.
The plane may lie to the left.

The bank has the same vector in each of these cases, but you might want them to be sorted into different groups.

One way to solve this problem is to use the Sense2Vec implementation. Sense2Vec models take into account the context and part of the speech (and possibly other functions) of the token, allowing you to distinguish the meanings of different senses of the word.

Great library for this in Python Spacy . This is similar to NLTK, but much faster as written in Cython (20 times faster for tokenization and 400 times faster for tagging). It also has built-in Sense2Vec attachments, so you can accomplish your similarity task without the need for other libraries.

It's simple:

 import spacy nlp = spacy.load('en') apples, and_, oranges = nlp(u'apples and oranges') apples.similarity(oranges)

It is free and has a liberal license!

Classify words with the same meaning

More articles: