Read big data mongodb

I have a Java application that needs to read a large amount of data from MongoDB 3.2 and transfer it to Hadoop.

This batch application runs every 4 hours 6 times a day.

Data:

  • Documents: 80,000 at a time (every 4 hours)
  • Size: 3gb

I am currently using MongoTemplate and Morphia to access MongoDB. However, I get an OOM exception when processing this data using the following:

List<MYClass> datalist = datasource.getCollection("mycollection").find().asList(); 

What is the best way to read this data and populate Hadoop?

  • MongoTemplate::Stream() and write to Hadoop one at a time?
  • batchSize(someLimit) and write the whole batch in Hadoop?
  • Cursor.batch() and write to hdfs one by one?
+5
source share
3 answers

Your problem is asList()

This forces the driver to iterate over the entire cursor (80,000 documents, several gigabytes), keeping everything in memory.

batchSize(someLimit) and Cursor.batch() will not help here when you cross the entire cursor, regardless of the size of the packet.

Instead, you can:

1) Loop over the cursor: List<MYClass> datalist = datasource.getCollection("mycollection").find()

2) Reading documents one at a time and submitting documents to the buffer (say, a list)

3) For every 1000 documents (say) of the Hadoop API call, clear the buffer, and then start again.

+1
source

A call to asList() will attempt to load the entire Mongodb collection into memory. Trying to make an object in a memory list larger than 3 GB.

Iterating the collection with the cursor will fix this problem. You can do this with the Datasource class, but I prefer the safe type abstractions that Morphia offers with the DAO classes:

  class Dao extends BasicDAO<Order, String> { Dao(Datastore ds) { super(Order.class, ds); } } Datastore ds = morphia.createDatastore(mongoClient, DB_NAME); Dao dao = new Dao(ds); Iterator<> iterator = dao.find().fetch(); while (iterator.hasNext()) { Order order = iterator.next; hadoopStrategy.add(order); } 
0
source
 Checkout spring-batch https://spring.io/projects/spring-batch, efficient and simple to implement. you can set the data chunks to be processed at a time, in your case 80000 
0
source

Source: https://habr.com/ru/post/1272188/


All Articles