Stream based on JSON parsing and writing

I take about 20,000 data sets from a server in 1000 lots. Each dataset is a JSON object. Sustainable this amounts to about 350 MB of uncompressed plaintext.

I have a 1 GB memory limit. Therefore, I write every 1000 JSON objects as an array to the raw JSON file in append mode.

The result is a file with 20 JSON arrays that need to be aggregated. I still need to touch them because I want to add metadata. Typically, Ruby Yajl Parser does this like this:

raw_file = File.new(path_to_raw_file, 'r') json_file = File.new(path_to_json_file, 'w') datasets = [] parser = Yajl::Parser.new parser.on_parse_complete = Proc.new { |o| datasets += o } parser.parse(datasets) hash = { date: Time.now, datasets: datasets } Yajl::Encoder.encode(hash, json_file) 

Where is the problem with this solution? The problem is that all JSON is parsed into memory , which I should avoid.

Basically what I need is a solution that parses JSON from an IO object and , encoding them to another IO object , at the same time.

I assumed that Yajl offers this, but I did not find a way, and its API did not give any hints, so I think not. Is there a JSON Parser library that supports this? Are there other solutions?


The only solution I can think of is to use the capabilities of IO.seek . Record all arrays of datasets one after another [...][...][...] and after each array I return to the beginning and overwrite ][ with,, effectively connecting arrays manually.

+6
source share
2 answers

Why can't you retrieve one record at a time from the database, process it as needed, convert it to JSON, and then emit it with a comma / delimiter point?

If you started with a file containing only [ , you added all your JSON lines, and then did not add a comma to the final record and used closure instead ] , you would have JSON hashes, and you would only need to process one line at a time.

It will be a little slower (maybe), but will not affect your system. And DB I / O can be very fast if you use lock / paging to get a reasonable number of records at a time.

For example, here is a combination of some Sequel example code and code to extract strings as JSON and create a larger JSON structure:

 require 'json' require 'sequel' DB = Sequel.sqlite # memory database DB.create_table :items do primary_key :id String :name Float :price end items = DB[:items] # Create a dataset # Populate the table items.insert(:name => 'abc', :price => rand * 100) items.insert(:name => 'def', :price => rand * 100) items.insert(:name => 'ghi', :price => rand * 100) add_comma = false puts '[' items.order(:price).each do |item| puts ',' if add_comma add_comma ||= true print JSON[item] end puts "\n]" 

What outputs:

 [ {"id":2,"name":"def","price":3.714714089426208}, {"id":3,"name":"ghi","price":27.0179624376119}, {"id":1,"name":"abc","price":52.51248221170203} ] 

Please note that the order is now at a "price".

Validation is simple:

 require 'json' require 'pp' pp JSON[<<EOT] [ {"id":2,"name":"def","price":3.714714089426208}, {"id":3,"name":"ghi","price":27.0179624376119}, {"id":1,"name":"abc","price":52.51248221170203} ] EOT 

Result:

 [{"id"=>2, "name"=>"def", "price"=>3.714714089426208}, {"id"=>3, "name"=>"ghi", "price"=>27.0179624376119}, {"id"=>1, "name"=>"abc", "price"=>52.51248221170203}] 

This confirms JSON and demonstrates that the original data can be restored. Each row retrieved from the database should be the smallest β€œbit” part of the overall JSON structure you want to build.

Based on this, here's how to read incoming JSON in a database, manipulate it, then emit it as a JSON file:

 require 'json' require 'sequel' DB = Sequel.sqlite # memory database DB.create_table :items do primary_key :id String :json end items = DB[:items] # Create a dataset # Populate the table items.insert(:json => JSON[:name => 'abc', :price => rand * 100]) items.insert(:json => JSON[:name => 'def', :price => rand * 100]) items.insert(:json => JSON[:name => 'ghi', :price => rand * 100]) items.insert(:json => JSON[:name => 'jkl', :price => rand * 100]) items.insert(:json => JSON[:name => 'mno', :price => rand * 100]) items.insert(:json => JSON[:name => 'pqr', :price => rand * 100]) items.insert(:json => JSON[:name => 'stu', :price => rand * 100]) items.insert(:json => JSON[:name => 'vwx', :price => rand * 100]) items.insert(:json => JSON[:name => 'yz_', :price => rand * 100]) add_comma = false puts '[' items.each do |item| puts ',' if add_comma add_comma ||= true print JSON[ JSON[ item[:json] ].merge('foo' => 'bar', 'time' => Time.now.to_f) ] end puts "\n]" 

What generates:

 [ {"name":"abc","price":3.268814929005337,"foo":"bar","time":1379688093.124606}, {"name":"def","price":13.871147312377719,"foo":"bar","time":1379688093.124664}, {"name":"ghi","price":52.720984131655676,"foo":"bar","time":1379688093.124702}, {"name":"jkl","price":53.21477190840114,"foo":"bar","time":1379688093.124732}, {"name":"mno","price":40.99364022416619,"foo":"bar","time":1379688093.124758}, {"name":"pqr","price":5.918738444452265,"foo":"bar","time":1379688093.124803}, {"name":"stu","price":45.09391752439902,"foo":"bar","time":1379688093.124831}, {"name":"vwx","price":63.08947792357426,"foo":"bar","time":1379688093.124862}, {"name":"yz_","price":94.04921035056373,"foo":"bar","time":1379688093.124894} ] 

I added a timestamp so that you can see that each line is processed individually, and to give you an idea of ​​how quickly the lines are processed. Of course, this is a tiny database with memory that does not have network I / O for content, but a normal network connection through the transition to the database on a reasonable database node should also be quite fast. Telling ORM to read the database in chunks can speed up processing, because DBM will be able to return larger blocks to fill packages more efficiently. You will need to experiment to determine what size chunks you need, because they will depend on your network, your hosts and the size of your records.

Your original design is not suitable when working with corporate-sized databases, especially when your hardware resources are limited. Over the years, we have learned how to analyze BIG databases that make 20,000 row tables minimal. Nowadays, pieces of VM are common, and we use them for crunching, so they are often the PCs of yesteryear: one processor with small memory prints and hard drives. We cannot defeat them, or they will be bottlenecks, so we must break the data into the smallest atomic parts that we can.

Capturing DB Design: Storing JSON in a database is a dubious practice. DBM these days can call JSON, YAML and XML string representations, but forcing DBM to search inside stored JSON, YAML or XML strings is a major hit in processing speed, so avoid it at all costs if you also don't have equivalent search data indexed in separate fields, so your search queries are at the highest possible speed. If the data is available in separate fields, then making good database queries, customizing in DBM or your custom scripting language and issuing massive data becomes much easier.

+5
source

Maybe via JSON :: Stream or Yajl :: FFI . However, you will have to write your own callbacks. Some tips on how to do this can be found here and here .

In front of a similar problem, I created a json-streamer gem that saves you the trouble of creating your own callbacks. This will give you each object one by one, deleting it from memory. You can then transfer them to another I / O object as intended.

Let me know if you managed to try.

0
source

Source: https://habr.com/ru/post/954195/


All Articles