Django + Postgres + Big Time Series

I am looking through a project with large, mostly incompressible time series data, and wondering if the correct Django + Postgres is with raw SQL.

I have time series data that is ~ 2K objects / hour, every hour. This is about 2 million lines a year that I store, and I would like to: 1) be able to clip data for analysis through the connection, 2) be able to perform elementary survey work on the Internet, served by Django. I think the best idea is to use Django for the objects themselves, but move on to raw SQL to deal with big time series data. I see this as a hybrid approach; which may be a red flag, but using a full ORM for a long series of sample data seems redundant. Is there a better way?

+6
source share
4 answers

If I understand your thoughts correctly, you plan to store time series in PostgreSQL, one time series record in one database row. Do not do that.

On the one hand, the problem is theoretical. Relational databases (and I think most databases) are based on the premise of row independence, while time series records are physically ordered. Of course, database indexes provide some order for database tables, but this order is designed to speed up the search or presentation of results in alphabetical order or in some other order; this does not imply any natural significance to this order. Regardless of how you order them, each customer is independent of other customers, and each customer’s purchase is independent of his other purchases, even if you can receive them as a whole in chronological order to form a customer’s purchase history. The interdependence of time series records is much stronger, which makes relational databases inappropriate.

In practice, this means that the disk space occupied by the table and its indexes will be huge (perhaps 20 times more than saving time series in files), and the reading time from the database will be very slow, something like an order of magnitude slower than storing in files. It also will not give you any important benefit. You probably will never want to make the request “give me all time series entries whose value is greater than X”. If you ever need such a query, you will also need a hell of an analysis, which the relational database was not designed to run, so you will still read all the time series on some object.

Therefore, each time series should be stored as a file. It can be a file in the file system or a blob in the database. Despite the fact that I have implemented the latter , I believe that the former is better; in Django, I would write something like this:

class Timeseries(models.model): name = models.CharField(max_length=50) time_step = models.ForeignKey(...) other_metadata = models.Whatever(...) data = models.FileField(...) 

Using FileField will make your database smaller and make incremental backups of your system easier. It will also be easier to get snippets by searching in a file, which is possibly impossible or difficult with blob.

Now which file? I would advise you to take a look at pandas. This is a python library for mathematical analysis that supports time series, as well as a way to store time series in files.

I contacted above with my library, which I do not recommend you use; on the one hand, it does not do what you want (it cannot process grit finer than a minute and has other drawbacks), and on the other it is obsolete - I wrote it before pandas, and I intend to turn it into using pandas in the future. There, the author of pandas, which I found invaluable, wrote the book Python for Data Analysis.

Update: There is also InfluxDB. I never used it, and therefore I have no opinion, but this is definitely what you need to study if you are interested in how to store time series.

+14
source

The time series database is apparently one of those things that continue to be reinvented and, as suggested above, relational databases are not suitable.

I made a combination of Django with InfluxDB , which is built for time series data. This is fantastic to use, and python client libraries can work with pandas data files. This means that you can use the InfluxDB query language to play data in place or pull out all of them (aggregated if necessary) for analysis in python. My application processes data streams on the same volume that you need.

I bind the InfluxDB series names to django app / model / primary key as needed. Time series data comes in the linked InfluxDB series, different slowly changing or relational data falls into the ORM django fields.

+5
source

Looks like you want to peek into the timescale . I haven't used it myself yet, but it seems to be a PostgreSQL extension, so I assume full support for Django and it processes as they say

100Ks of rows and millions of metrics per second, even with 100 billion rows per node.

+1
source

You can also use the PostgIS postgres extension, which includes support for raster data types (mostly large number grids) and has many features to use.

However, do not use ORM in this case, you will want to execute SQL directly on the server. ORM will add a huge amount of overhead for large numerical datasets. It is also not very suitable for handling large matrices inside python itself, for this you need numpy.

0
source

Source: https://habr.com/ru/post/973564/


All Articles