Search engine database encoding in multiple languages

I have a database (Mysql) in which I store over 100,000 keywords with a keyword in different languages. So, an example, if I have three columns [id] [turkish (utf8_turkish_ci)] [german (utf8)]

Users can enter a German or Turkish word in the search field. If the user enters a German word, everything is in order, so he prints the Turkish word, but how to solve it with Turkish. I ask, because each language has its own additional characters, such as ä ü ö ş, etc.

So what should I use

mb_convert_encoding 

to convert the string, but then how to check if it is a German or Turkish string, I think it will be difficult. Or is the coding of the tables wrong?

Stuck now how to implement it so that the user can enter a keyword in both languages.

+5
source share
1 answer

You have several problems to solve this problem correctly.

First , you chose the utf8 character set to store all the text. It's a good choice. If this is a new-for-2016 application, you can choose the utf8mb4 character set utf8mb4 . Once you have selected a character set, your users should be able to read your text.

Second , to search and sort ( WHERE and ORDER BY ) you need to select the appropriate sort for each language. For modern German, utf8_general_ci will work fully. utf8_unicode_ci works a little better if you need standard lexical ordering. Read this. http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html

For modern Spanish, you should use utf8_spanish_ci . This is because in Spanish, N and & # xd1; characters are not considered the same. I do not know if the general assembly works in Turkish.

Note that you seem to be confusing the concepts of character set and sorting in your question. You mentioned the match with the Turkish column and the character set with your German column.

You can explicitly specify the character set and sorting in the queries. For example, you can write

  WHERE _utf8 'München' COLLATE utf8_unicode_ci = table.name; 

In this expression, _utf8 'München' is a symbol constant, and

  constant COLLATE utf8_unicode_ci = table.name 

- a query specifier that contains an explicit sort name. Read this. http://dev.mysql.com/doc/refman/5.7/en/charset-collate.html

Third , you can assign a default mapping for each column specific to each language. Default mappings are baked into indexes, so they will help speed up your search.

Fourth , your users will need to use the appropriate input method (key mapping, etc.) to represent the data in your application. The Turkish language, we hope, knows how to type Turkish words.

0
source

Source: https://habr.com/ru/post/1243133/


All Articles