Fuzzy text search: a regular expression lookup generator?

I am wondering if there is a way to make fuzzy string matching in PHP. Look for a word in a long line, finding a potential match, even if it is spelled incorrectly; what could have found it if it had been disabled by a single character due to an OCR error.

I thought a regex generator could do this. Therefore, given the input of "crazy", it will generate this regular expression:

.*((crazy)|(.+razy)|(c.+azy)|cr.+zy)|(cra.+y)|(craz.+)).*

Then it will return all matches for the word or variations of the word.

How to create a generator: I would probably split the search string / word up into an array of characters and build a regular expression expression, making foreach a newly created array, replacing the key value (letter position in the string) with ". +".

Is this a good way to do a fuzzy text search, or is there a better way? What about some string comparison that gives me an estimate based on how close it is? I am trying to see if some poorly transformed OCR text contains a short word.

+3
source share
3 answers

, , . pspell:

$p = pspell_new("en");
print_r(pspell_suggest($p, "crazzy"));

http://www.php.net/manual/en/function.pspell-suggest.php

+6
echo generateRegex("crazy");
function generateRegex($word)
{
  $len = strlen($word);
  $regex = "\b((".$word.")";
  for($i = 0; $i < $len; $i++)
  {
    $temp = $word;
    $temp[i] = '.';
    $regex .= "|(".$temp.")";
  }
  $regex = $regex.")\b";
  return $regex;
}
+1
source

Source: https://habr.com/ru/post/1722662/


All Articles