SQL optimization - word count per line - Postgresql

I am trying to update a large table (about 1M rows) with the number of words in the Postgresql field. This query works and sets the token_count field, which counts the words (tokens) in the longtext in the my_table table:

 UPDATE my_table mt SET token_count = (select count(token) from (select unnest(regexp_matches(t.longtext, E'\\w+','g')) as token from my_table as t where mt.myid = t.myid) as tokens); 

myid is the main key of the table. \\w+ necessary because I want to count words ignoring special characters. For example, A test . ; ) A test . ; ) A test . ; ) will return 5 with the calculation based on space, and 2 will return the correct value. The problem is that it is terribly slow and 2 days is not enough to complete it on 1M lines. What would you do to optimize it? Are there any ways to avoid the connection?

How to break a package into blocks using, for example, limit and offset ?

Thanks for any advice.

Mulone

UPDATE: I measured the performance of the array_split array, and the update will still be slow. Therefore, perhaps the solution will consist of its parallelization. If I run different queries from psql , only one query works, and the rest are waiting for it to complete. How can I parallelize an update?

+4
source share
4 answers

Have you tried using array_length ?

 UPDATE my_table mt SET token_count = array_length(regexp_split_to_array(trim(longtext), E'\\W+','g'), 1) 

http://www.postgresql.org/docs/current/static/functions-array.html

 # select array_length(regexp_split_to_array(trim(' some long text '), E'\\W+'), 1); array_length -------------- 3 (1 row) 
+7
source
 UPDATE my_table SET token_count = array_length(regexp_split_to_array(longtext, E'\\s+'), 1) 

Or your original request without correlation

 UPDATE my_table SET token_count = ( select count(*) from (select unnest(regexp_matches(longtext, E'\\w+','g'))) s ); 
+2
source

Using tsvector and ts_stat

get tsvector column statistics

 SELECT * FROM ts_stat($$ SELECT to_tsvector(t.longtext) FROM my_table AS t $$); 

There is no sample data to try, but it should work.

Data examples

 CREATE TEMP TABLE my_table AS SELECT $$A paragraph (from the Ancient Greek παράγραφος paragraphos, "to write beside" or "written beside") is a self-contained unit of a discourse in writing dealing with a particular point or idea. A paragraph consists of one or more sentences.$$::text AS longtext; SELECT * FROM ts_stat($$ SELECT to_tsvector(t.longtext) FROM my_table AS t $$); word | ndoc | nentry --------------+------+-------- παράγραφος | 1 | 1 written | 1 | 1 write | 1 | 2 unit | 1 | 1 sentenc | 1 | 1 self-contain | 1 | 1 self | 1 | 1 point | 1 | 1 particular | 1 | 1 paragrapho | 1 | 1 paragraph | 1 | 2 one | 1 | 1 idea | 1 | 1 greek | 1 | 1 discours | 1 | 1 deal | 1 | 1 contain | 1 | 1 consist | 1 | 1 besid | 1 | 2 ancient | 1 | 1 (20 rows) 
+2
source
  • Make sure myid indexed, being the first field in the index.

  • Consider doing this outside the database. It's hard to say without benchmarking, but counting can be more expensive than choosing + updating; so it may be worth it.

    • use the COPY command (BCP equivalent for Postgres) to efficiently copy bulk table data to a file

    • Run a simple Perl script to count. 1 million lines should take a couple of minutes to 1 hour for Perl, depending on how slow your IO is.

    • use COPY to copy the table back to DB (possibly to a temporary table, then update it from this temp table, or better yet, truncate the main table and COPY directly to it if you can afford downtime).

  • For both your approach and the last step of my approach # 2, update the component token in batches of 5000 lines (for example, set rowcount to 5000 and where token_count IS NULL updates by adding where token_count IS NULL to the request

-1
source

Source: https://habr.com/ru/post/1487106/


All Articles