Ruby Parallel / Multithreaded programming to read a huge database

Question

Ruby Parallel / Multithreaded programming to read a huge database

I have a ruby script reading a huge table (~ 20 m rows), doing some processing and feeding it to Solr for indexing. This was a big bottleneck in our process. I plan to speed things up here, and I would like to achieve some sort of parallelism. I am confused about the multithreaded nature of Ruby. Our servers are ruby 1.8.7 (2009-06-12 patchlevel 174) [x86_64-linux] . From this blog post and https://stackoverflow.com/a/16830/ ... it can be seen that Ruby does not have a “real” multi-threaded approach. Our servers have several cores, so using parallel gem seems to me another approach.

Which approach should I go with? In addition, any entries to systems with concurrent access to the database would be highly appreciated.

+6

multithreading ruby multicore

Pranav prakash Sep 28 '11 at 7:01

source share

2 answers

Matthias berth · Answer 1 · 2011-10-06T08:01:28+0000

You can parallelize this at the OS level. Modify the script so that it can accept a series of lines from your input file

 $ reader_script --lines=10000:20000 mytable.txt

Then run multiple instances of the script.

 $ reader_script --lines=0:10000 mytable.txt& $ reader_script --lines=10000:20000 mytable.txt& $ reader_script --lines=20000:30000 mytable.txt&

Unix will automatically distribute them to different kernels.

Jonas elfström · Answer 2 · 2011-09-28T08:27:10+0000

Are there any chances to upgrade to Ruby 1.9? It is usually faster than 1.8.7.

It is true that Ruby suffers from GIL , but if multithreading solves your problem, then you can take a look at JRuby as it supports true streaming.

In addition, you must make sure that it is the CPU, which is the bottleneck, because if I / O multithreading may not buy you a lot.

Ruby Parallel / Multithreaded programming to read a huge database

More articles: