I am trying to update a large table (about 1M rows) with the number of words in the Postgresql field. This query works and sets the token_count field, which counts the words (tokens) in the longtext in the my_table table:
UPDATE my_table mt SET token_count = (select count(token) from (select unnest(regexp_matches(t.longtext, E'\\w+','g')) as token from my_table as t where mt.myid = t.myid) as tokens);
myid is the main key of the table. \\w+ necessary because I want to count words ignoring special characters. For example, A test . ; ) A test . ; ) A test . ; ) will return 5 with the calculation based on space, and 2 will return the correct value. The problem is that it is terribly slow and 2 days is not enough to complete it on 1M lines. What would you do to optimize it? Are there any ways to avoid the connection?
How to break a package into blocks using, for example, limit and offset ?
Thanks for any advice.
Mulone
UPDATE: I measured the performance of the array_split array, and the update will still be slow. Therefore, perhaps the solution will consist of its parallelization. If I run different queries from psql , only one query works, and the rest are waiting for it to complete. How can I parallelize an update?
source share