How to properly distribute data collection, processing and visualization in Python?

I am working on a project where I want to perform data collection, data processing and visualization of the GUI (using pyqt with pyqtgraph) in Python. Each of the parts is basically implemented, but the different parts are not very well separated, which makes comparison and performance improvement difficult. So the question is:

Is there a good way to handle large amounts of data between different pieces of software?

I am thinking of something like the following scenario:

  • Acquisition: receive data from some devices and store them in a data container that can be accessed from another location. (This part should work without the processing and rendering part. This part is crucial for time, because I do not want to lose data points!)
  • Processing: receives data from the data container, processes it and saves the results in another data container. (Also, this part should work without a GUI and with a delay after the acquisition (for example, process data that I recorded last week).
  • GUI / visualization: Take the received and processed data from the container and visualize it.
  • save data:. I want to be able to store / transfer certain pieces of data to disk.

When I say β€œlarge amounts of data,” I mean that I get arrays with about 2 million data points (16 bits) per second that need to be processed and possibly also saved.

Is there any infrastructure for Python that I can use to properly handle this large amount of data? Perhaps in the form of a data server to which I can connect.

+6
source share
1 answer

How much data?

In other words, do you get so much data that you cannot store all this in memory when you need it?

For example, there are some dimensions that generate so much data, the only way to process them is after that:

  • Retrieve data for storage (usually RAID0 )
  • After data processing
  • Results Analysis
  • Select and archive subsets

Small data

If your computer system is able to keep up with data generation, you can use a separate Python queue between each step.

Big data

If your measurements create more data than your system can consume, you should start by defining several levels (maybe just two) of how important your data is:

  • lossless - if the point is missing, then you can also start with it
  • lossy - if there are no points or a data set, it does not matter much, just wait for the next update

One of the analogies may be a video stream ...

  • lossless - gold archiving masters
  • lossy - YouTube, Netflix, Hulu may drop a few frames, but your experience does not suffer much.

From your description, Acquisition and Processing should be lossless, and GUI / visualization can be lossless.

For lossless data, you must use queues . For lossy data, you can use deques .

Design

Regardless of your data container, there are three ways to connect to your steps:

  • Producer-Consumer : PC simulates FIFO - one actor generates data, and the other consumes it. You can create a chain of producers / consumers to achieve your goal.
  • Observer : although a PC is usually one-to-one, an observer pattern can also be one-to-many. If you want multiple participants to respond to changes in one source, an observer template can provide you with this opportunity.
  • Mediator : Mediators are usually many-to-many. If each actor can make the rest respond, then they can all coordinate through an intermediary.

It seems that you just need a 1-1 ratio between each stage, so the design of the consumer-manufacturer looks like it suits your application.

+2
source

Source: https://habr.com/ru/post/980591/


All Articles