Getting the position and number of lexeme entries from tsvector

Is there a way to get information about token positions in a sentence and the number of occurrences from tsvector?

Something like that

SELECT * FROM get_position(to_tsvector('english', 'The Fat Rats'), to_tsquery('Rats')); 

will return 3

and

 SELECT * FROM get_occurrences(to_tsvector('english', 'The Fat Rats'), to_tsquery('Rats')); 

will return 1.

+5
source share
1 answer

The tsvector text view contains a list of entries for a specific token:

 test=# select to_tsvector ( 'english', 'new bar in New York' ); to_tsvector ---------------------------- 'bar':2 'new':1,4 'york':5 

The following is an example of an example function based on this. It takes text parameters and converts them to ts_vector internally, but can be easily rewritten to accept ts_vector.

 CREATE OR REPLACE FUNCTION lexeme_occurrences ( IN _document text , IN _word text , IN _config regconfig , OUT lexeme_count int , OUT lexeme_positions int[] ) RETURNS RECORD AS $$ DECLARE _lexemes tsvector := to_tsvector ( _config, _document ); _searched_lexeme tsvector := strip ( to_tsvector ( _config, _word ) ); _occurences_pattern text := _searched_lexeme::text || ':([0-9,]+)'; _occurences_list text := substring ( _lexemes::text, _occurences_pattern ); BEGIN SELECT count ( a ) , array_agg ( a::int ) FROM regexp_split_to_table ( _occurences_list, ',' ) a WHERE _searched_lexeme::text != '' -- preventing false positives INTO lexeme_count , lexeme_positions; RETURN; END $$ LANGUAGE plpgsql; 

Usage example:

 select * from lexeme_occurrences ( 'The Fat Rats', 'rat', 'english' ); lexeme_count | lexeme_positions --------------+----------------- 1 | {3} (1 row) 
+4
source

Source: https://habr.com/ru/post/1200918/


All Articles