Why is this Ruby XML parsing slower with GC disabled?

I have a piece of code that processes a 500 MB XML file using libxml-ruby gem. Surprisingly for me, this code runs slower with the GC disabled , which seems inconsistent. What could be the reason? I have a lot of available memory, and the system does not swap places.

 require 'xml' #GC.disable @reader = XML::Reader.file('books.xml', :options => XML::Parser::Options::NOBLANKS) @reader.read @reader.read while @reader.name == 'book' book_id = @reader.get_attribute('id') @reader.read until @reader.name == 'book' && @reader.node_type == XML::Reader::TYPE_END_ELEMENT case @reader.name when 'author' author = @reader.read_string when 'title' title = @reader.read_string when 'genre' genre = @reader.read_string when 'price' price = @reader.read_string when 'publish_date' publish_date = @reader.read_string when 'description' description = @reader.read_string end @reader.next end @reader.read end @reader.close 

Here are the results I got:

 ruby gc on gc off 2.2.0 16.93s 18.81s 2.1.5 16.22s 18.58s 2.0.0 17.63s 17.99s 

Why turn off the garbage collector? I read in Ruby Performance Optimization that Ruby is slower because programmers don’t think about memory consumption, which makes the garbage collector take a lot of time to execute. Thus, turning off the GC should instantly speed up work (by using memory, of course) until the system changes.

I wanted to know if my XML parsing module could be improved, so I started experimenting with it by disabling GC, which led me to this problem. I expected significant acceleration with the GC disabled, but instead I got the opposite. I know that the differences are not huge, but still this is strange to me.

libxml-ruby gem uses the built-in C implementation of LibXML under the hood - could this be the reason?

The file I used is manually propagated by the books.xml example downloaded from the Microsoft documentation:

 <catalog> <book id="bk101"> <author>John Doe</author> <title>XML for dummies</title> <genre>Computer</genre> <price>44.95</price> <publish_date>2000-10-01</publish_date> <description>Some description</description> </book> .... </catalog> 

My setup: OS X Yosemite, Intel Core i5 2.6 GHz, 16 GB of RAM.

Thanks for any suggestions.

+5
source share
1 answer

You forgot the operating system β€” you disabled GC in your MRI process, but you have no control over the linux / unix kernel and how it allocates memory for your MRI application.

In fact, I believe that by disabling the GC, you have significantly aggravated the behavior of your application, making it likely that your program will constantly need to request more RAM from the kernel. This is most likely due to some form of overhead in the kernel, as it allocates a swap or memory to you.

Your source data is a 500 MB xml file that you are reading, node on node, in the memory area of ​​your MRI program. Your MRI process is probably consuming several GB of data by the time it is processed; and not one of the values ​​in your main reading block is discarded after each iteration - they just freeze in memory and only get completely cleared when your application exits and the memory returns to the operating system.

GC is designed to manage this; it is designed so that your application does not request additional memory from the kernel if it absolutely does not need it, and so that your application can work "reasonably well" within the allocated memory for a reason.

So, I'm not very surprised that you see a slowdown with the GC disabled. What will be said is the average load and the swap usage of your box during the tests.

+1
source

Source: https://habr.com/ru/post/1245993/


All Articles