Postgresql full-text search tokenizer

Question

Postgresql full-text search tokenizer

Just a problem. I am trying to configure full-text search for localized content (in particular, in Russian). The problem in the default configuration (as in my normal mode) does not apply to emails. Example:

SELECT * from to_tsvector('test_russian', '     '); > '':1 '':4 '':6 '':3 '':5 '':2

'On' is a stop word and should be deleted, but it does not even decrease in the result vector. If I pass a lowercase string everything works correctly

 SELECT * from to_tsvector('test_russian', '     '); > '':4 '':6 '':3 '':5 '':2

Of course, I can pass lowercase strings, but the manual says

A simple dictionary template works by converting the input token to lowercase and checking it for a stop word file.

Config russian_test as follows:

 create text search CONFIGURATION test_russian (COPY = 'russian'); CREATE TEXT SEARCH DICTIONARY russian_simple ( TEMPLATE = pg_catalog.simple, STOPWORDS = russian ); CREATE TEXT SEARCH DICTIONARY russian_snowball ( TEMPLATE = snowball, Language = russian, StopWords = russian ); alter text search configuration test_russian alter mapping for word with russian_simple,russian_snowball;

But I get exactly the same results with the built-in russian config.

I tried ts_debug and tokens processed as word , as I expected.

Any ideas?

+6

tokenize postgresql full-text-search

Tommi Aug 08 '13 at 8:44

source share

1 answer

Tommi · Accepted Answer · 2013-08-08T14:43:51+0000

The problem is resolved. The reason is that the database was initiated by default ("C") CType and Collate . We used

 initdb --locale=UTF-8 --lc-collate=UTF-8 --encoding=UTF-8 -U pgsql *PGSQL DATA DIR*

to recreate the instance and

 CREATE DATABASE "scratch" WITH OWNER "postgres" ENCODING 'UTF8' LC_COLLATE = 'ru_RU.UTF-8' LC_CTYPE = 'ru_RU.UTF-8';

to restore db and a simple dictionary now works.

Postgresql full-text search tokenizer

More articles: