Ruby Parallel / Multithreaded programming to read a huge database

I have a ruby ​​script reading a huge table (~ 20 m rows), doing some processing and feeding it to Solr for indexing. This was a big bottleneck in our process. I plan to speed things up here, and I would like to achieve some sort of parallelism. I am confused about the multithreaded nature of Ruby. Our servers are ruby 1.8.7 (2009-06-12 patchlevel 174) [x86_64-linux] . From this blog post and https://stackoverflow.com/a/16830/ ... it can be seen that Ruby does not have a “real” multi-threaded approach. Our servers have several cores, so using parallel gem seems to me another approach.

Which approach should I go with? In addition, any entries to systems with concurrent access to the database would be highly appreciated.

+6
source share
2 answers

You can parallelize this at the OS level. Modify the script so that it can accept a series of lines from your input file

 $ reader_script --lines=10000:20000 mytable.txt 

Then run multiple instances of the script.

 $ reader_script --lines=0:10000 mytable.txt& $ reader_script --lines=10000:20000 mytable.txt& $ reader_script --lines=20000:30000 mytable.txt& 

Unix will automatically distribute them to different kernels.

+4
source

Are there any chances to upgrade to Ruby 1.9? It is usually faster than 1.8.7.

It is true that Ruby suffers from GIL , but if multithreading solves your problem, then you can take a look at JRuby as it supports true streaming.

In addition, you must make sure that it is the CPU, which is the bottleneck, because if I / O multithreading may not buy you a lot.

+1
source

Source: https://habr.com/ru/post/898183/


All Articles