Best way to load MongoDB data into a DataFrame using Pandas and PyMongo?

I have a 0.7 GB MongoDB database containing tweets that I am trying to load into a dataframe. However, I am getting an error.

MemoryError: 

My code is as follows:

 cursor = tweets.find() #Where tweets is my collection tweet_fields = ['id'] result = DataFrame(list(cursor), columns = tweet_fields) 

I tried the methods in the following answers, which at some point create a list of all database items before loading it.

However, in another answer that talks about list (), the person said that it is good for small data sets because everything is loaded into memory.

In my case, I think this is the source of the error. This is too much data to load into memory. What other method can I use?

+8
source share
4 answers

I changed my code to the following:

 cursor = tweets.find(fields=['id']) tweet_fields = ['id'] result = DataFrame(list(cursor), columns = tweet_fields) 

By adding the fields parameter to the find () function, I limited the output. This means that I do not load each field, but only the selected fields in the DataFrame. Now everything is working fine.

+8
source

The fastest and probably most memory efficient way to create a DataFrame from a mongodb request, as in your case, would be using monary .

This post has a nice and brief explanation.

+5
source

An elegant way to do this would be as follows:

 import pandas as pd def my_transform_logic(x): if x : do_something return result def process(cursor): df = pd.DataFrame(list(cursor)) df['result_col'] = df['col_to_be_processed'].apply(lambda value: my_transform_logic(value)) #making list off dictionaries db.collection_name.insert_many(final_df.to_dict('records')) # or update db.collection_name.update_many(final_df.to_dict('records'),upsert=True) #make a list of cursors.. you can read the parallel_scan api of pymongo cursors = mongo_collection.parallel_scan(6) for cursor in cursors: process(cursor) 

I tried the above process in the mongoDB collection with 2.6 million records using Joblib for the specified code. My code did not return memory errors and processing was completed after 2 hours.

+2
source

from_records classmethod is probably the best way to do this:

 from pandas import pd import pymongo client = pymongo.MongoClient() data = db.mydb.mycollection.find() # or db.mydb.mycollection.aggregate(pipeline) df = pd.DataFrame.from_records(data) 
0
source

Source: https://habr.com/ru/post/972853/


All Articles