Need help (search algorithm)

I need help with this problem:

As input, I have a line that looks like Blue cat green eyes 2342342 , or maybe Cat blue eyes green 23242 or any other permutation of words.

In my DB table, I have some data. One of the columns is called, say, a keyword.

Here is an example of this table:

enter image description here

My task is to find the entry in my column of the database table, KEYWORDS, which matches some words from the input line.

For example: for the lines โ€œ Blue cat green eyes 2342342โ€ โ€œ Blue cats eyes green 23242โ€ and โ€œ Cat 23242 eyes blue greenโ€ the result should be โ€œblue catโ€ (the first row of my table). The only way to imagine how to solve this problem is as follows:

  • sequentially execute each word from a string.
  • Search for this every word with %like% in the table column.
  • If it is not found, this means that this word is not the key, and we are not interested in it.
  • If it is found once - excellent! Undoubtedly, this is what we are looking for.
  • If there are several results:
  • Of all the words in a string that have not yet been tested, each word is sequentially accepted.
  • Find this word with %like% in the results of step 2.
  • etc...

The graphical diagram of this algorithm is here

But it seems that this algorithm will work very slowly if there are a lot of entries in the table, and if my input line consists of a large number of words.

So my question is: Are there any special algorithms that can help solve this problem?

+4
source share
2 answers

You can use another table, for example

 ID KeywordID Word 1 1 blue 2 2 blue 3 1 cat 

and convert the string

 "Blue cat green eyes 2342342" 

in a series of indices and readings:

 SELECT KeywordID, COUNT(*) FROM ancillary WHERE Word IN ('blue','cat','green','eyes'...) 

This will perform a series of exact matches and return, say,

 KeywordID Count 1 2 2 1 

Then you know that a group of keywords with an identifier of 1 has two words, which means that a counter of 2 matches all of them. So the keyword 1 is executed. Group 2 also has two words (black, cat), but only one is found, and the match is there, but not completed.

If you also write down the size of the set of keywords along with the keyword identifier, then all the keywords from the same identifier will have the same key format, and you can also it:

 KeywordID KeywordSize Count 1 2 2 2 2 1 

and maybe even SELECT COUNT(*)/KeywordSize AS match ... ORDER BY match and have matching keywords by relevance.

Of course, if you have a KeywordID, you can find it in the keyword table.

Implementation

You want to add the black angry cat keyword list to your existing spreadsheet.

So, you blow this list of keywords into words: and get โ€œblackโ€, โ€œangryโ€ and โ€œcatโ€.

To insert a list of keywords usually in an existing table and get an identifier for this newly created row, suppose it is 1701.

Now you insert the words into a new table, which we call the "auxiliary". This table contains only the keyword row identifier of your main table, a single word and the size of the word list from which this word comes.

We know that we insert only 3 words for row 1701 of the table, so size = 3 and insert these tuples:

 (1701, 3, 'black') (1701, 3, 'cat') (1701, 3, 'angry') 

(They will receive a unique identifier, but this does not concern us).

Now after a while we will receive the offer, which is

 'Schroedinger cat is black and angry' 

We could first run a query against a list of null words that need to be removed, such as "is" and "and". But this is optional.

Then we could run as many queries as there are words, and thereby discover that not a single line contained Schroedinger anywhere, and we can drop it. But this is also not necessary.

Finally, we create a real query for the helper:

 SELECT KeywordID, COUNT(*) AS total, ListSize*100/COUNT(*) AS match FROM ancillary WHERE Word IN ('Schroedinger','cat','is','black','and','angry') GROUP BY KeywordID; 

WHERE will return, say, the following lines:

 (1234, 'black') -- from 'black cat' (1234, 'cat') -- from 'black cat' (1423, 'angry') -- from 'angry birds' (1701, 'cat') -- from 'black angry cat' (1701, 'angry') -- from 'black angry cat' (1701, 'black') -- from 'black angry cat' (1999, 'cat') -- from 'nice white cat' 

So GROUP will return the KeywordID these lines with its power:

 1423 1 50% 1701 3 100% 1234 2 100% 1999 1 33% 

Now you can sort descending relationships by coincidence, and then sort by descending list size (since a 100% of 3 words match is better than a 100% of 2 matches, and a 1 in 2 match is better than a 2 in 3 match):

 1701 3 100% -- our best match 1234 2 100% -- second runner 1423 1 50% 1999 1 33% 

You can also get the first table in a single query with an added matching ratio:

 SELECT mytable.*, total, match FROM mytable JOIN ( SELECT KeywordID, COUNT(*) AS total, ListSize*100/COUNT(*) AS match FROM ancillary WHERE Word IN ('Schroedinger','cat','is','black','and','angry') GROUP BY KeywordID ) AS ancil ON (mytable.KeywordID = ancil.KeywordID) ORDER BY match DESC, total DESC; 

The biggest cost is the exact match in the "helper", which should be indexed in the Word column.

+4
source

You can see the full-text search engine, for example, sphinx: http://sphinxsearch.com/

Or, otherwise, make a stored procedure by dividing the search string into keywords using the specified separator and look at the charindex of each keyword in the DB column (depending on your db control system).

+1
source

Source: https://habr.com/ru/post/1442291/


All Articles