Read big data mongodb

Question

Read big data mongodb

I have a Java application that needs to read a large amount of data from MongoDB 3.2 and transfer it to Hadoop.

This batch application runs every 4 hours 6 times a day.

Data:

Documents: 80,000 at a time (every 4 hours)
Size: 3gb

I am currently using MongoTemplate and Morphia to access MongoDB. However, I get an OOM exception when processing this data using the following:

List<MYClass> datalist = datasource.getCollection("mycollection").find().asList();

What is the best way to read this data and populate Hadoop?

MongoTemplate::Stream() and write to Hadoop one at a time?
batchSize(someLimit) and write the whole batch in Hadoop?
Cursor.batch() and write to hdfs one by one?

+5

java mongodb hadoop morphia

Sid Sep 28 '17 at 9:12

source share

3 answers

A call to asList() will attempt to load the entire Mongodb collection into memory. Trying to make an object in a memory list larger than 3 GB.

Iterating the collection with the cursor will fix this problem. You can do this with the Datasource class, but I prefer the safe type abstractions that Morphia offers with the DAO classes:

  class Dao extends BasicDAO<Order, String> { Dao(Datastore ds) { super(Order.class, ds); } } Datastore ds = morphia.createDatastore(mongoClient, DB_NAME); Dao dao = new Dao(ds); Iterator<> iterator = dao.find().fetch(); while (iterator.hasNext()) { Order order = iterator.next; hadoopStrategy.add(order); }

0

Bit33 Oct 26 '17 at 10:53

source share

 Checkout spring-batch https://spring.io/projects/spring-batch, efficient and simple to implement. you can set the data chunks to be processed at a time, in your case 80000

0

Suceth shivakumar Apr 18 '19 at 7:31

source share

Orient dar · Accepted Answer · 2017-09-28T09:31:03+0000

Your problem is asList()

This forces the driver to iterate over the entire cursor (80,000 documents, several gigabytes), keeping everything in memory.

batchSize(someLimit) and Cursor.batch() will not help here when you cross the entire cursor, regardless of the size of the packet.

Instead, you can:

1) Loop over the cursor: List<MYClass> datalist = datasource.getCollection("mycollection").find()

2) Reading documents one at a time and submitting documents to the buffer (say, a list)

3) For every 1000 documents (say) of the Hadoop API call, clear the buffer, and then start again.

Read big data mongodb

More articles: