Best way to load MongoDB data into a DataFrame using Pandas and PyMongo?

Question

Best way to load MongoDB data into a DataFrame using Pandas and PyMongo?

I have a 0.7 GB MongoDB database containing tweets that I am trying to load into a dataframe. However, I am getting an error.

MemoryError:

My code is as follows:

 cursor = tweets.find() #Where tweets is my collection tweet_fields = ['id'] result = DataFrame(list(cursor), columns = tweet_fields)

I tried the methods in the following answers, which at some point create a list of all database items before loading it.

However, in another answer that talks about list (), the person said that it is good for small data sets because everything is loaded into memory.

stack overflow

In my case, I think this is the source of the error. This is too much data to load into memory. What other method can I use?

+8

python pandas pymongo

blue_chip Jul 25 '14 at 19:21

source share

4 answers

The fastest and probably most memory efficient way to create a DataFrame from a mongodb request, as in your case, would be using monary .

This post has a nice and brief explanation.

+5

shx2 Jul 26 '14 at 6:29

source share

An elegant way to do this would be as follows:

 import pandas as pd def my_transform_logic(x): if x : do_something return result def process(cursor): df = pd.DataFrame(list(cursor)) df['result_col'] = df['col_to_be_processed'].apply(lambda value: my_transform_logic(value)) #making list off dictionaries db.collection_name.insert_many(final_df.to_dict('records')) # or update db.collection_name.update_many(final_df.to_dict('records'),upsert=True) #make a list of cursors.. you can read the parallel_scan api of pymongo cursors = mongo_collection.parallel_scan(6) for cursor in cursors: process(cursor)

I tried the above process in the mongoDB collection with 2.6 million records using Joblib for the specified code. My code did not return memory errors and processing was completed after 2 hours.

+2

Yayati sule Aug 31 '17 at 11:14

source share

from_records classmethod is probably the best way to do this:

 from pandas import pd import pymongo client = pymongo.MongoClient() data = db.mydb.mycollection.find() # or db.mydb.mycollection.aggregate(pipeline) df = pd.DataFrame.from_records(data)

0

Edgar Ramírez Mondragón Sep 27 '19 at 15:29

source share

blue_chip · Accepted Answer · 2014-07-25T20:53:08+0000

I changed my code to the following:

 cursor = tweets.find(fields=['id']) tweet_fields = ['id'] result = DataFrame(list(cursor), columns = tweet_fields)

By adding the fields parameter to the find () function, I limited the output. This means that I do not load each field, but only the selected fields in the DataFrame. Now everything is working fine.

Best way to load MongoDB data into a DataFrame using Pandas and PyMongo?

More articles: