Getting the position and number of lexeme entries from tsvector

Question

Getting the position and number of lexeme entries from tsvector

Is there a way to get information about token positions in a sentence and the number of occurrences from tsvector?

Something like that

SELECT * FROM get_position(to_tsvector('english', 'The Fat Rats'), to_tsquery('Rats'));

will return 3

and

 SELECT * FROM get_occurrences(to_tsvector('english', 'The Fat Rats'), to_tsquery('Rats'));

will return 1.

+5

sql postgresql full-text-search

IgorekPotworek Aug 22 '14 at 11:12

source share

1 answer

Tomasz siorek · Accepted Answer · 2014-08-23T16:31:04+0000

The tsvector text view contains a list of entries for a specific token:

 test=# select to_tsvector ( 'english', 'new bar in New York' ); to_tsvector ---------------------------- 'bar':2 'new':1,4 'york':5

The following is an example of an example function based on this. It takes text parameters and converts them to ts_vector internally, but can be easily rewritten to accept ts_vector.

 CREATE OR REPLACE FUNCTION lexeme_occurrences ( IN _document text , IN _word text , IN _config regconfig , OUT lexeme_count int , OUT lexeme_positions int[] ) RETURNS RECORD AS $$ DECLARE _lexemes tsvector := to_tsvector ( _config, _document ); _searched_lexeme tsvector := strip ( to_tsvector ( _config, _word ) ); _occurences_pattern text := _searched_lexeme::text || ':([0-9,]+)'; _occurences_list text := substring ( _lexemes::text, _occurences_pattern ); BEGIN SELECT count ( a ) , array_agg ( a::int ) FROM regexp_split_to_table ( _occurences_list, ',' ) a WHERE _searched_lexeme::text != '' -- preventing false positives INTO lexeme_count , lexeme_positions; RETURN; END $$ LANGUAGE plpgsql;

Usage example:

 select * from lexeme_occurrences ( 'The Fat Rats', 'rat', 'english' ); lexeme_count | lexeme_positions --------------+----------------- 1 | {3} (1 row)

Getting the position and number of lexeme entries from tsvector

More articles: