Great Postgres Text Finder Tip

Question

Great Postgres Text Finder Tip

I am new to databases and I am looking for high level tips.

Situation
I am building a database using Postgres 9.3, in the database there is a table in which the log files are stored.

CREATE TABLE errorlogs ( id SERIAL PRIMARY KEY, archive_id INTEGER NOT NULL REFERENCES archives, filename VARCHAR(256) NOT NULL, content TEXT);

The text in the content can vary from 1 to 50 MB.

Problem
I would like to be able to do a fairly quick text search on the data in the "content" column (for example, WHERE CONTENT LIKE "% some_error%"). Now the search is very slow (> 10 minutes to search in 8206 lines).

I know that indexing is designed to solve my problem, but I seem to be unable to create indexes - whenever I try to get errors, the index will be too large.

= # CREATE INDEX error_logs_content_idx ON errorlogs (content text_pattern_ops);
ERROR: the index string requires 1796232 bytes, the maximum size is 8191

I was hoping for some tips on how to get around this problem. Can I change the maximum index size? Or should I not try to use Postgres for full-text search in text fields the size of this?

Any advice is greatly appreciated!

+6

postgresql

Jbefat Jan 15 '15 at 17:48

source share

1 answer

afs76 · Accepted Answer · 2015-04-24T16:30:01+0000

Text search vectors cannot process data, large ones - see documented limits . Their strength is a fuzzy search, so you can search for “swim” and find “swim”, “swim”, “swim” and “swim” by the same call. They are not intended to replace grep .

The reason for the restrictions is in the source code as MAXSTRLEN (and MAXSTRPOS). Text search vectors are stored in one long continuous array up to 1 megabyte long (total of all characters for all unique tokens). To access them, the ts_vector index structure allows 11 bits for the word length and 20 bits for its position in the array. These restrictions allow the index structure to fit into a 32-bit unsigned int.

You are probably working in one or both of these restrictions if you have too many unique words in the file, or the words are repeated very often - something is quite possible if you have a 50 MB log file with quasi-random data.

Are you sure you need to store log files in a database? You basically copy the file system, and grep or python can do the search there pretty well. If you really need to, you can think about this:

 CREATE TABLE errorlogs ( id SERIAL PRIMARY KEY , archive_id INTEGER NOT NULL REFERENCES archives , filename VARCHAR(256) NOT NULL ); CREATE TABLE log_lines ( line PRIMARY KEY , errorlog INTEGER REFERENCES errorlogs(id) , context TEXT , tsv TSVECTOR ); CREATE INDEX log_lines_tsv_idx ON log_lines USING gin( line_tsv );

Here you treat each line of the journal as a “document”. To do a search, you would do something like

 SELECT e.id, e.filename, g.line, g.context FROM errorlogs e JOIN log_lines g ON e.id = g.errorlog WHERE g.tsv @@ to_tsquery('some & error');

Great Postgres Text Finder Tip

More articles: