SQL query that can find Typos in Arabic

I want to make a dictionary, and I need a query that can find words with an error (I mean Typos / spelling error or typo) if the query could not find the exact word, and then try to get with other spellings ..

So what is Arabic:. In Arabic, there is a letter with almost the same pronunciation, but another letter, people sometimes do not know which one should be used. For example, there are 4 kinds of Z in Arabic “ز / ظ / ذ / ض” the pronunciation is different (a little), but people will forget which one is the correct spelling. example of one word with different use of the letter "z":

مریز / مریض / مریظ / مریذ

Right - مریض

Here are other worlds that have more than one last:

z: ض / ز / ذ / ظ T: ت / ط / S: ث / س / ص / Gh: ق / غ 

So what is your idea? How should the request be?

I want if the user looked for "مریز" instead of showing him a 404 error (not found), search the database with a different letter (all Z), and then return the result if I found something.

+5
source share
3 answers

In German, we have the same problem regarding t and tt or dt - especially in names.

One way to get closer to this is to keep an extra normalized column containing the name / word with a fixed conversion.

  tt -> t dt -> t ß -> s ss -> s 

So the table will contain

  WORD | NORMALIZED schmitt | schmit schmidt | schmit 

At the time of the query, apply the same transformations to the query, and then compare with the normalized column.

+5
source

There is a Levenshtein distance algorithm (there are others) that reports the editing distance between two lines.

You can learn from this, try to find the most similar words in the dictionary compared to your input.

You can later assign weight to replacements based on the letter tuples you mentioned to refine your search.

Actually there is an implementation for MySQL that you should definitely check: https://www.artfulsoftware.com/infotree/qrytip.php?id=552
Most levenshtein + mysql questions here in SO point to this page.

+4
source

A simpler solution would be to use regular expressions inside a similar operator. For letters that may be misspelled, you can save variations in the regular expression pattern. For letters matching the z pattern, this is “[زذظض]". You can replace all the letters ز, ذ, ظ, ض with a wildcard, and then query for a similar operator:

 select * from searched_table where word like "%[مرى[زذظض%" 

After you find all versions of the word you are looking for, you can either show everything to the user or calculate the distance levenshtein (koshinae answer) and show the closest words.

Edit: for letter Z only, the request will look below

 set @word = 'مرىض'; -- take this text from user set @word = replace(@word, 'ذ', 'Z'); set @word = replace(@word, 'ظ', 'Z'); set @word = replace(@word, 'ض', 'Z'); set @word = replace(@word, 'ز', 'Z'); set @word = replace(@word, 'Z', '[زظضذ]'); set @word = Concat('%', @word, '%'); select @word; select * from mydb.searchTable where word like @word; 
+2
source

Source: https://habr.com/ru/post/1240221/


All Articles