Calculate the cosine similarity taking into account 2 lines of the sentence

From Python: tf-idf-cosine: to find document similarity , you can calculate document similarity using tf-idf cosine. Without importing external libraries, are there any ways to calculate the cosine similarity between two lines?

s1 = "This is a foo bar sentence ." s2 = "This sentence is similar to a foo bar sentence ." s3 = "What is this string ? Totally not related to the other two lines ." cosine_sim(s1, s2) # Should give high cosine similarity cosine_sim(s1, s3) # Shouldn't give high cosine similarity value cosine_sim(s2, s3) # Shouldn't give high cosine similarity value 
+66
python string nlp similarity cosine-similarity
Mar 02 '13 at 10:06 on
source share
3 answers

A simple pure-Python implementation would be as follows:

 import re, math from collections import Counter WORD = re.compile(r'\w+') def get_cosine(vec1, vec2): intersection = set(vec1.keys()) & set(vec2.keys()) numerator = sum([vec1[x] * vec2[x] for x in intersection]) sum1 = sum([vec1[x]**2 for x in vec1.keys()]) sum2 = sum([vec2[x]**2 for x in vec2.keys()]) denominator = math.sqrt(sum1) * math.sqrt(sum2) if not denominator: return 0.0 else: return float(numerator) / denominator def text_to_vector(text): words = WORD.findall(text) return Counter(words) text1 = 'This is a foo bar sentence .' text2 = 'This sentence is similar to a foo bar sentence .' vector1 = text_to_vector(text1) vector2 = text_to_vector(text2) cosine = get_cosine(vector1, vector2) print 'Cosine:', cosine 

Print

 Cosine: 0.861640436855 

The cosine formula used here is described here .

This does not include weighting words with tf-idf, but to use tf-idf you need to have a sufficiently large body from which to evaluate tfidf scales.

You can also develop it further using a more complex way to extract words from a piece of text, base it or lemmatize, etc.

+151
Mar 02 '13 at 12:40
source share

The short answer is: "No, this cannot be done in principle, which works even remotely." This is an unresolved problem in research on natural language processing, and also, it turns out, is the subject of my doctoral work. I will very briefly summarize where we are and point to several publications:

The meaning of words

The most important assumption here is that you can get a vector that represents every word in a sentence in quesion. This vector is usually chosen to capture the contexts in which the word may appear. For example, if we consider only three contexts “eat”, “red” and “fluffy”, the word “cat” can be represented as [98, 1, 87], because if you read a very long piece of text (several billion words are not uncommon by today's standard), the word "cat" will appear very often in the context of "fluffy" and "eating", but not so often in the context of "red". Similarly, a “dog” can be represented as [87,2,34], and an “umbrella” can be [1,13,0]. Imagining these vectors as points in three-dimensional space, the “cat” is clearly closer to the “dog” than to the “umbrella”, so “cat” also means something more like a “dog” than an “umbrella”.

This line of work has been investigated since the beginning of the 90s (for example, this work by Greffenstette) and brought some surprisingly good results. For example, here are some random entries in a thesaurus that I recently created when my computer read wikipedia:

 theory -> analysis, concept, approach, idea, method voice -> vocal, tone, sound, melody, singing james -> william, john, thomas, robert, george, charles 

These lists of similar words were obtained completely without human intervention - you insert the text and come back in a few hours.

Phrase problem

You may ask why we are not doing the same for longer phrases such as "foxes love fruit." This is because we do not have enough text. So that we can reliably establish that X is similar to, we need to see many examples of the use of X in context. When X is one word, such as “voice,” it is not too complicated. However, as X grows longer, the chances of finding natural occurrences of X are exponentially slower. For comparison, Google has about 1B pages containing the word "fox", and not one page containing "foxes-foxes", despite the fact that this is a perfectly correct English sentence, and we all understand what this means.

Structure

To solve the problem of sparse data, we want to perform composition, i.e. take vectors for words that are easy to get from real text, and combine them in such a way as to fix their meaning. The bad news is that so far no one has been able to do this well.

The simplest and most obvious way is to add or propagate individual word vectors together. This leads to an undesirable side effect that “cats chase dogs” and “dog cat cats” mean the same thing to your system. In addition, if you multiply, you must be especially careful or all offers will be presented in the form of [0,0,0, ..., 0], which defeats the point.

Further reading

I will not discuss the more complex composition methods that have been proposed so far. I suggest you read Katrin Erk "vector space models of the meaning of words and phrases: an overview" . This is a very good high-level review to get you started. Unfortunately, the publisher’s website does not have free access, write to the author directly to get a copy. In this article you will find links to many more specific methods. More understandable are Mitchell and Lapata (2008) and Baroni and Zamparelli (2010) .




Edit after the comment by @vpekar: The essence of this answer is to emphasize the fact that although there are naive methods (for example, adding, multiplying, similarity to the surface, etc.), they are fundamentally wrong and generally not You should expect great performance from them.

+51
Mar 02 '13 at 11:15
source share

Thanks @vpekar for your implementation. It really helped. I just found that it misses the tf-idf weight when calculating cosine similarity. The counter (word) returns a dictionary that has a list of words along with their appearance.

cos (q, d) = sim (q, d) = (q · d) / (| q || d |) = (sum (qi, di) / (sqrt (sum (qi2))) * (sqrt ( sum (vi2))), where i = 1 - v)

  • qi is the weight of the tf-idf term i in the request.
  • di is tf-idf
  • The weight of the term I am in the document. | D | and | d | - lengths q and d.
  • This is the cosine similarity of q and d. ,,, or, equivalently, the cosine of the angle between q and d.

Please feel free to view my code here . But first you will need to download the anaconda package. It will automatically set you the python path on Windows. Add this python interpreter to Eclipse.

+1
Feb 13 '15 at 22:56
source share



All Articles