The easiest way to achieve this is to compare the similarity of the corresponding word attachments (the most common implementation of this is Word2Vec).
Word2Vec is a way of representing the semantic meaning of a token in a vector space, which allows you to compare word meanings without requiring a large dictionary / thesaurus such as WordNet.
One of the problems with regular Word2Vec implementations is that it distinguishes between different feelings of the same word. For example, a word bank will have the same Word2Vec representation in all of these sentences:
- The riverbank was dry.
- The bank lent me money.
- The plane may lie to the left.
The bank has the same vector in each of these cases, but you might want them to be sorted into different groups.
One way to solve this problem is to use the Sense2Vec implementation. Sense2Vec models take into account the context and part of the speech (and possibly other functions) of the token, allowing you to distinguish the meanings of different senses of the word.
Great library for this in Python Spacy . This is similar to NLTK, but much faster as written in Cython (20 times faster for tokenization and 400 times faster for tagging). It also has built-in Sense2Vec attachments, so you can accomplish your similarity task without the need for other libraries.
It's simple:
import spacy nlp = spacy.load('en') apples, and_, oranges = nlp(u'apples and oranges') apples.similarity(oranges)
It is free and has a liberal license!
source share