Database Supporting Fast Nearest Neighbor Queries

Is there a database that supports fast approximate nearest neighbor queries in multidimensional vector spaces?

I am looking for a database that matches the following use case:

  • Works on millions of points.
  • It works in hundreds of thousands of sizes.
  • Potentially uses spanning trees or location-sensitive hashes to index

Is there a reliable implementation of this method?

+8
source share
2 answers

There is an ANN library that works well for large dimensional arrays of data, but it is not a complete "database" and is not a distributed solution.

This is where the SpaceCurve script runs (without me) working on a commercial spatial database, so depending on your needs and budget, they can be useful.

As a tip: you should think deeply about what “closest neighbor” means when you talk about “hundreds of thousands of dimensions”. If you take a million random points in a 20-dimensional cube, the average distance between any two nearest neighbors is already about half the length of the edge of the cube.

It only worsens exponentially as measurements are added. When you talk about hundreds of dimensions, you really need incredibly large numbers of points (for example,> 10 30 ) if they are somewhat evenly distributed; and if they are distributed differently, you're better off with other classification approaches.

+3
source

You can look at Facebook Faiss.

From the documentation:

Faiss is a library for efficiently looking for similarities and clustering of dense vectors. It contains search algorithms in sets of vectors of any size, up to those that may not fit in RAM

Please note that this applies only to L2 (Euclidean) distances and point products.

Link to the project - Faiss

0
source

Source: https://habr.com/ru/post/953883/


All Articles