Calculate Euclidean distance between two vectors (word bag) in python

I use a dictionary to represent the number of words in an article

For example, {"name" : 2 , "your": 10, "me", 20} to represent that "name" appears twice, "your" appears 10 times, and "I" appears 20 times.

So, is there a good way to calculate the Euclidean distance of these vectors? The difficulty is that these vectors have different lengths, and some vectors contain certain words, and some do not.

I know that I can write a long function for this, just look for a simpler and smarter way. Thanks

Edit: The goal is to get the similarities between the two articles and group them

+4
source share
2 answers

Sort of

 math.sqrt(sum((a[k] - b[k])**2 for k in a.keys())) 

Where a and b are dictionaries with the same keys. If you are going to compare these values ​​between different pairs of vectors, then you must make sure that each vector contains exactly the same words, otherwise your distance measure does not mean anything.

You can calculate the distance based only on the intersection:

 math.sqrt(sum((a[k] - b[k])**2 for k in set(a.keys()).intersection(set(b.keys())))) 

Another option is to use union and set unknown values ​​to 0

 math.sqrt(sum((a.get(k, 0) - b.get(k, 0))**2 for k in set(a.keys()).union(set(b.keys())))) 

But you should carefully think about what it really means that you are counting.

+8
source

You can also use the cosine similarity between two vectors, as in this link: http://mines.humanoriented.com/classes/2010/fall/csci568/portfolio_exports/sphilip/cos.html

0
source

Source: https://habr.com/ru/post/1482374/


All Articles