Accent Insensitive Order in Sphinx

I use Sphinx with the Thinking Sphinx plugin to search for my data. I am using MySQL.

My data contains accented characters ("á", "é", "ã"), and I want them to be equivalent to their inconsistent counterparts (for example, "a", "e", "a") when searching and ordering.

I got a search using the charset table (pastie.org/204316) and searching for “AGUA” returns “ÁGUA”, but ordering the results does not work properly. For example, when searching for the letters "AGUA", "ÁGUA" after "MUITA ÁGUA", but I wanted it to be sorted as if it were written using "A" and not "Á".

The only solution I can imagine is to index a new column containing no accent characters and use it to sort using REPLACE ( http://dev.mysql.com/doc/refman/5.4/en/string-functions. html # function_replace ) to separate accented characters, but I will need one REPLACE call for every possible accented char (and there are many of them), and it seems to me that this is not a very universal workaround.

Does anyone know of any better way to deal with this problem?

Thanks!

+4
source share
3 answers

Sphinx handles sorting by string fields, storing all the values ​​in a list, sorting the list, and then storing the index of each row as an int attribute. According to the docs, this list is sorted at the byte level and is not currently configured.

Ideally, strings should be sorted differently, depending on the encoding and locale. For example, if the strings are known to be Russian text encoded in KOI8R, sorting bytes 0xE0, 0xE1 and 0xE2 should result in 0xE1, 0xE2 and 0xE0, because in the value KOI8R 0xE0 encodes a character that is (noticeably) after the characters encoded by 0xE1 and 0xE2. Unfortunately, Sphinx does not currently support this and simply sorts the lines in turn.

- from http://www.sphinxsearch.com/docs/current.html

So, there is no easy way to achieve this in Sphinx. Modifying your idea of ​​REPLACE () would be to have a separate column and populate it with a callback in your model. This will allow you to handle the replacement in Ruby instead of MySQL, perhaps a more suitable solution.

# save an unaccented copy of your title. Normalise method borrowed from # http://stackoverflow.com/questions/522715/removing-accents-diacritics-from-string-while-preserving-other-special-chars-tri class MyModel < ActiveRecord::Base before_validation :update_sort_col private def update_sort_col sort_col = self.title.to_s.mb_chars.normalize(:kd).gsub(/[^-x00-\x7F]/n, '').to_s end end 
+3
source

you can also use a special index so that you don't even need a new column on your db

 indexes "LOWER(title)", :as => :title, :sortable => true 

its raw sql so you can call the replace method.

+1
source

Just create an index in a lowercase version with the following syntax. This is a very simple and elegant solution for finding case insensitive using Sphinx .

 indexes title, as: :title, sortable: :insensitive 
0
source

Source: https://habr.com/ru/post/1286593/


All Articles