A Django application that has a large Panda in mind that is common to all requests?

I developed a brilliant app. When it starts, it loads, ONCE, some data. About 4 GB of data. Then people connecting to the application can use the interface and play with this data.

This application is good, but has some limitations. That is why I am looking for a different solution.

My idea is for Pandas and Django to work together. That way I could develop an interface and a RESTful API at the same time. But I need all the requests coming into Django to be able to use the Pandas datatables that were loaded once. Imagine if 4 GB of memory were loaded for any request ... That would be horrible.

I looked everywhere, but could not find a way to do this. I found this question: https://stackoverflow.com/questions/28661255/pandas-sharing-same-dataframe-across-the-request But it has no answers.

Why do I need data in RAM? Because I need performance in order to quickly achieve the desired results. I cannot ask MariaDB to calculate and save this data, for example, because it is connected with some calculations that only R or a specialized package in Python or in other languages ​​can perform.

+5
source share
2 answers

I have a similar use case when I want to load (instantiate) a specific object only once and use it in all requests, since it takes some time (in seconds) to load, and I could not afford the lag that would represent for each request.

I use the function in Django>=1.7 , AppConfig.ready() to load it only once.

Here is the code:

 # apps.py from django.apps import AppConfig from sexmachine.detector import Detector class APIConfig(AppConfig): name = 'api' def ready(self): # Singleton utility # We load them here to avoid multiple instantiation across other # modules, that would take too much time. print("Loading gender detector..."), global gender_detector gender_detector = Detector() print("ok!") 

Then when you want to use it:

 from api.apps import gender_detector gender_detector.get_gender('John') 

Load the data table into the ready() method, and then use it from anywhere. I believe that the table will be loaded once for each WSGI worker, so be careful.

+3
source

I may not understand the problem, but for me there is a 4 GB database table that is easily accessible to users should not be too big a problem. Is there something wrong just loading the data at a time as you described? 4GB is now not too much RAM.

Personally, I would recommend that you simply use the database system, instead of loading stuff into memory and crunching with python. If you set up the data correctly, you can query many thousands of queries in seconds. Pandas is actually written to simulate SQL, so most of the code you use can probably be translated directly into SQL. More recently, I had a situation at work, where I organized a large operation for combining to take a couple of hundred files (~ 4 GB in total, 600 thousand lines per file) using pandas. The total lead time was 72 hours or something else, and it was an operation that should be performed once an hour. The employee finished rewriting the same python / pandas code as a fairly simple SQL query that completed in 5 minutes instead of 72 hours.

In any case, I would recommend that you store your Pandas framework in the actual database table. Django is built on a database (usually mySQL or Postgres), and Pandas has the function of directly inserting your data frame into the dataframe.to_sql( 'database_connection_str' ) database dataframe.to_sql( 'database_connection_str' ) ! From there, you can write django code so that the answers give a single request to the database, select values ​​in a timely manner and return data.

0
source

Source: https://habr.com/ru/post/1235744/


All Articles