Postgresql full-text search tokenizer

Just a problem. I am trying to configure full-text search for localized content (in particular, in Russian). The problem in the default configuration (as in my normal mode) does not apply to emails. Example:

SELECT * from to_tsvector('test_russian', '     '); > '':1 '':4 '':6 '':3 '':5 '':2 

'On' is a stop word and should be deleted, but it does not even decrease in the result vector. If I pass a lowercase string everything works correctly

 SELECT * from to_tsvector('test_russian', '     '); > '':4 '':6 '':3 '':5 '':2 

Of course, I can pass lowercase strings, but the manual says

A simple dictionary template works by converting the input token to lowercase and checking it for a stop word file.

Config russian_test as follows:

 create text search CONFIGURATION test_russian (COPY = 'russian'); CREATE TEXT SEARCH DICTIONARY russian_simple ( TEMPLATE = pg_catalog.simple, STOPWORDS = russian ); CREATE TEXT SEARCH DICTIONARY russian_snowball ( TEMPLATE = snowball, Language = russian, StopWords = russian ); alter text search configuration test_russian alter mapping for word with russian_simple,russian_snowball; 

But I get exactly the same results with the built-in russian config.

I tried ts_debug and tokens processed as word , as I expected.

Any ideas?

+6
source share
1 answer

The problem is resolved. The reason is that the database was initiated by default ("C") CType and Collate . We used

 initdb --locale=UTF-8 --lc-collate=UTF-8 --encoding=UTF-8 -U pgsql *PGSQL DATA DIR* 

to recreate the instance and

 CREATE DATABASE "scratch" WITH OWNER "postgres" ENCODING 'UTF8' LC_COLLATE = 'ru_RU.UTF-8' LC_CTYPE = 'ru_RU.UTF-8'; 

to restore db and a simple dictionary now works.

+4
source

Source: https://habr.com/ru/post/951281/


All Articles