Effective comparison of 1 million vectors containing (float, integer) tuples

I am working on a chemistry / biology project. We are creating a web application for quickly matching user experimental data with predicted data in a reference database. The link database will contain up to a million entries. The data for one record is a list (vector) of tuples containing a floating point value between 0.0 and 20.0 and an integer value between 1 and 18. For example (7.2394, 2), (7.4011, 1), (9.9367, 3), .. . etc. The user enters a similar list of tuples, and the web application should then return the 50 best matching database entries.

One thing is important: the search algorithm must allow discrepancies between the query data and the reference data, since both may contain small errors in the float values ​​(NOT in integer values). (The query data may contain errors because it is derived from a real experiment and reference data, as it is the result of a prediction.)

Edit - Moved text for reply -

How can we get an effective rating of 1 query per 1 million records?

+3
source share
5 answers

1 ; , .

, , - , . , , ; , , . .

, ... , ( , , )?

, , - , , 1 . , Python . , . Python, .

from cmath import *
import random
r = [(random.uniform(0,20), random.randint(1,18)) for i in range(1000000)]
# this is a decorate-sort-undecorate pattern
# look for matches to (7,9)
# obviously, you can use whatever distance expression you want
zz=[(abs((7-x)+(9-y)),x,y) for x,y in r]
zz.sort()
# return the 50 best matches
[(x,y) for a,x,y in zz[:50]]
+2

? , , . , . , . , , .

log (n)

+1

"" x-y , , / ( ).

.

0

, , , - float. . 0,1, 0,2, 0,3 0,4. , binning 50 200 , 0 18, 0 , . . . , . , , .

( ) , , , float. 1. , . .

- . . , (PCA),

0

Source: https://habr.com/ru/post/1733877/


All Articles