Autocompletion Algorithm?

Question

Autocompletion Algorithm?

I mean the algorithm that is used to provide suggestions when a user enters a search query on Google.

I’m mainly interested in how the Google algorithm can show: 1. The most important results (most likely, the queries, and not everything that matches) 2. Match the substrings 3. Fuzzy matches

I know that you could use Trie or generic trie to find matches, but that would not meet the above requirements ...

Related questions asked earlier here.

+45

algorithm data-structures autocomplete scalability autosuggest

StackUnderflow May 25 '10 at 3:36 a.m.

source share

8 answers

fearlesstost · Answer 1 · 2011-07-15 16:24

For (heh) awesome fuzzy / partial string algorithms, check out Damn Cool's algorithms:

They do not replace attempts, but rather hinder the search for brute force in attempts - this is still a huge victory. Then you probably want to bind the trie size:

save the top three last / top N words used globally;
for each user, keep the last three words N for that user.

Finally, you want to prevent the search as much as possible ...

Cached search results: if the user clicks on any search results, you can serve them very quickly, and then asynchronously retrieve the full partial / fuzzy search.
precompop search results: if the user typed "appl", they are likely to continue to "apple", "apply."
prefetch data: for example, a web application can send a smaller set of results to the browser, small enough to make brute force search in JS viable.

Ben DeMott · Answer 2 · 2012-01-26 21:00

I would just like to say ... A good solution to this problem would include more than a triple search tree. Ngrams and Shingles (Phrases). Errors in text boundaries also need to be detected. "hell o" should be "hello" ... and "whitesocks" should be "white socks" - these are the preprocessing steps. If you do not pre-process the data correctly, you will not get valuable search results. Triple search trees are a useful component in determining what a word is, and also for guessing adjacent words when the entered word is not a valid word in the index.

The Google algorithm performs a sentence phrase and correction. The Google algorithm also has some concept of context ... if the first word you are looking for is related to the weather and you combine them "weatherforcst" and "monsoonfrcst" against "deskfrcst" - my guess is behind the scenes. The ranking changes into a sentence based on the first word met - forecast and weather are related words, so the forecast gets a high rank under the Did-You-Mean assumption.

word-partial (ngrams), phrase terms (tile), word-proximity (word-clustering-index), triple search (word search).

Aflafur Waage · Answer 3 · 2010-05-25 03:39

There are tools like soundex and levenshtein distance , which you can use to search for fuzzy matches that fall within a certain range.

Soundex finds words that seem similar and levenshtein distance finds words that are within a certain editing distance from another word.

Dekel · Answer 4 · 2010-07-18 21:09

Check out the Firefox Awesome bar algorithm

Recommended by Google is useful as it takes into account millions of popular searches + your past related searches.

It does not have a good completion / UI algorithm:

Not substrings
This appears to be a fairly simple word-boundary prefix algorithm.
For example: Try tomcat tut correctly suggest "Tomcat Tutorial". Now try tomcat rial no suggestions) -:
Doesn't support "did you mean?" - as in Google search results.

Miguel Fonseca · Answer 5 · 2013-07-01 09:31

The exact algorithm of Google is unknown, but it is said to work by statistical analysis of user input. This approach is not suitable for most cases. Most often, automatic completion is performed using one of the following actions:

The trees . By indexing searchable text in a tree structure (prefix tree, suffix tree, dawg, etc.), you can perform a very fast search by storing memory. Tree traversal can be adapted for approximate matching.
Separation of folders . By dividing the text into tokens (ngrams), you can search for instances of templates using a simple hashing scheme.
Filtering . Find a set of potential matches, and then apply a consistent algorithm to test each candidate.

See completely , the Java autocomplete library that implements some of the latest concepts.

Gabe Halsmer · Answer 6 · 2014-02-21 17:40

For substrings and fuzzy matches, the Levenshtein distance algorithm worked quite well for me. Although I admit that this does not seem as wonderful as industry-specific autofill / offer implementations. I think that both Google and Microsoft Intellisense work better because they have improved this basic algorithm to weigh the editing operations that are required to match heterogeneous strings. For example. hyphenation of two characters should probably only be counted as 1 operation, not 2 (insert and delete).

But even I believe that it is close enough. Here is the implementation in C # ...

 // This is the traditional Levenshtein Distance algorithem, though I've tweaked it to make // it more like Google autocomplete/suggest. It returns the number of operations // (insert/delete/substitute) required to change one string into another, with the // expectation that userTyped is only a partial version of fullEntry. // Gives us a measurement of how similar the two strings are. public static int EditDistance(string userTyped, string fullEntry) { if (userTyped.Length == 0) // all entries are assumed to be fully legit possibilities return 0; // at this point, because the user hasn't typed anything. var inx = fullEntry.IndexOf(userTyped[0]); if (inx < 0) // If the 1st character doesn't exist anywhere in the entry, it not return Int32.MaxValue; // a possible match. var lastInx = inx; var lastMatchCount = 0; TryAgain: // Is there a better starting point? var len = fullEntry.Length - inx; var matchCount = 1; var k = 1; for (; k < len; k++) { if (k == userTyped.Length || userTyped[k] != fullEntry[k + inx]) { if (matchCount > lastMatchCount) { lastMatchCount = matchCount; lastInx = inx; } inx = fullEntry.IndexOf(userTyped[0], inx + 1); matchCount = 0; if (inx > 0) goto TryAgain; else break; } else matchCount++; } if (k == len && matchCount > lastMatchCount) lastInx = inx; if (lastInx > 0) fullEntry = fullEntry.Substring(lastInx); // Jump to 1st character match, ignoring previous values // The start of the Levenshtein Distance algorithem. var m = userTyped.Length; var n = Math.Min(m, fullEntry.Length); int[,] d = new int[m + 1, n + 1]; // "distance" - meaning number of operations. for (var i = 0; i <= m; i++) d[i, 0] = i; // the distance of any first string to an empty second string for (var j = 0; j <= n; j++) d[0, j] = j; // the distance of any second string to an empty first string for (var j = 1; j <= n; j++) for (var i = 1; i <= m; i++) if (userTyped[i - 1] == fullEntry[j - 1]) d[i, j] = d[i - 1, j - 1]; // no operation required else d[i, j] = Math.Min ( d[i - 1, j] + 1, // a deletion Math.Min( d[i, j - 1] + 1, // an insertion d[i - 1, j - 1] + 1 // a substitution ) ); return d[m, n]; }

TK · Answer 7 · 2010-07-16 13:25

I think it would be better to build a specialized trie rather than chasing a completely different data structure.

I could see that the functionality appeared in three, in which each sheet had a field that reflected the frequency of searches for the corresponding word.

The search query method will display the nodes of the descendant leaves with the highest values calculated by multiplying the distance to each leaf of the descendant node by the search frequency associated with each stream of the descendant node.

The data structure (and therefore the algorithm) uses Google, probably much more complicated, potentially taking into account a large number of other factors, such as the frequency of searches from your own account (and time of day ... and weather ... season .. . and the phase of the moon ... and ...). However, I believe that the basic trie data structure can be expanded to any specialized search preference by including additional fields in each node and using these fields in the search query method.

user3273189 · Answer 8 · 2016-08-25 11:26

If you are looking for a general design for the problem, try reading the contents of https://www.interviewbit.com/problems/search-typeahead/ .

They start by creating autocomplete through a naive approach to using trie, and then build on it. They also explain optimization techniques such as sampling and offline updates to suit specific use cases.

To ensure scalability of the solution, you must competently share your data.

Autocompletion Algorithm?

More articles: