Handling large data pools in python

I am working on a scientific project aimed at studying the behavior of people.

The project will be divided into three parts:

  • A program for reading data from some remote sources and creating a local data pool with it.
  • Program for checking this data pool and ensuring integrity
  • A web interface that allows people to read / manipulate data.

The data consists of a list of people, all with ID #, and with several characteristics: height, weight, age, ...

I need to easily make groups of this data (for example, everything with a given age or height range), and the data is several TB large (but can decrease in smaller subsets of 2-3 gb).

I have strong experience with the theoretical material behind the project, but I am not a computer scientist. I know java, C and Matlab, and now I am learning python.

I would like to use python as it seems quite simple and greatly reduces Java verbosity. The problem is that I am wondering how to process the data pool.

I'm not a database expert, but I probably need it here. Which tools do you think I should use?

Remember that the goal is to implement very advanced mathematical functions on data sets, so we want to reduce the complexity of the source code. Speed ​​is not a problem.

+4
source share
3 answers

It sounds that the main functionality can be found at:
pytables
and
scipy / numpy

+5
source

Go to a NoSQL database such as MongoDB, which in this case is much easier to process data than to learn SQL.

+3
source

Since you are not an expert, I recommend that you use the mysql database as a database for storing your data, it is easy to learn, and you can query your data using SQL and write your data using python. MySQL Python-Mysql Guide

+1
source

Source: https://habr.com/ru/post/1346430/


All Articles