SQL optimization - word count per line - Postgresql

Question

SQL optimization - word count per line - Postgresql

I am trying to update a large table (about 1M rows) with the number of words in the Postgresql field. This query works and sets the token_count field, which counts the words (tokens) in the longtext in the my_table table:

 UPDATE my_table mt SET token_count = (select count(token) from (select unnest(regexp_matches(t.longtext, E'\\w+','g')) as token from my_table as t where mt.myid = t.myid) as tokens);

myid is the main key of the table. \\w+ necessary because I want to count words ignoring special characters. For example, A test . ; ) A test . ; ) A test . ; ) will return 5 with the calculation based on space, and 2 will return the correct value. The problem is that it is terribly slow and 2 days is not enough to complete it on 1M lines. What would you do to optimize it? Are there any ways to avoid the connection?

How to break a package into blocks using, for example, limit and offset ?

Thanks for any advice.

Mulone

UPDATE: I measured the performance of the array_split array, and the update will still be slow. Therefore, perhaps the solution will consist of its parallelization. If I run different queries from psql , only one query works, and the rest are waiting for it to complete. How can I parallelize an update?

+4

optimization sql parallel-processing postgresql

Mulone Jun 19 '13 at 17:50

source share

4 answers

Denis de bernardy · Answer 1 · 2013-06-19T18:00:53+0000

Have you tried using array_length ?

 UPDATE my_table mt SET token_count = array_length(regexp_split_to_array(trim(longtext), E'\\W+','g'), 1)

http://www.postgresql.org/docs/current/static/functions-array.html

 # select array_length(regexp_split_to_array(trim(' some long text '), E'\\W+'), 1); array_length -------------- 3 (1 row)

Clodoaldo neto · Answer 2 · 2013-06-19T18:12:06+0000

 UPDATE my_table SET token_count = array_length(regexp_split_to_array(longtext, E'\\s+'), 1)

Or your original request without correlation

 UPDATE my_table SET token_count = ( select count(*) from (select unnest(regexp_matches(longtext, E'\\w+','g'))) s );

Evan carroll · Answer 3 · 2017-03-10T07:49:25+0000

Using `tsvector` and `ts_stat`

get tsvector column statistics

 SELECT * FROM ts_stat($$ SELECT to_tsvector(t.longtext) FROM my_table AS t $$);

There is no sample data to try, but it should work.

Data examples

 CREATE TEMP TABLE my_table AS SELECT $$A paragraph (from the Ancient Greek παράγραφος paragraphos, "to write beside" or "written beside") is a self-contained unit of a discourse in writing dealing with a particular point or idea. A paragraph consists of one or more sentences.$$::text AS longtext; SELECT * FROM ts_stat($$ SELECT to_tsvector(t.longtext) FROM my_table AS t $$); word | ndoc | nentry --------------+------+-------- παράγραφος | 1 | 1 written | 1 | 1 write | 1 | 2 unit | 1 | 1 sentenc | 1 | 1 self-contain | 1 | 1 self | 1 | 1 point | 1 | 1 particular | 1 | 1 paragrapho | 1 | 1 paragraph | 1 | 2 one | 1 | 1 idea | 1 | 1 greek | 1 | 1 discours | 1 | 1 deal | 1 | 1 contain | 1 | 1 consist | 1 | 1 besid | 1 | 2 ancient | 1 | 1 (20 rows)

DVK · Answer 4 · 2013-06-19T17:59:02+0000

Make sure myid indexed, being the first field in the index.
Consider doing this outside the database. It's hard to say without benchmarking, but counting can be more expensive than choosing + updating; so it may be worth it.
- use the COPY command (BCP equivalent for Postgres) to efficiently copy bulk table data to a file
- Run a simple Perl script to count. 1 million lines should take a couple of minutes to 1 hour for Perl, depending on how slow your IO is.
- use COPY to copy the table back to DB (possibly to a temporary table, then update it from this temp table, or better yet, truncate the main table and COPY directly to it if you can afford downtime).
For both your approach and the last step of my approach # 2, update the component token in batches of 5000 lines (for example, set rowcount to 5000 and where token_count IS NULL updates by adding where token_count IS NULL to the request

SQL optimization - word count per line - Postgresql

Using tsvector and ts_stat

Data examples

More articles:

Using `tsvector` and `ts_stat`