Handling large data pools in python

Question

Handling large data pools in python

I am working on a scientific project aimed at studying the behavior of people.

The project will be divided into three parts:

A program for reading data from some remote sources and creating a local data pool with it.
Program for checking this data pool and ensuring integrity
A web interface that allows people to read / manipulate data.

The data consists of a list of people, all with ID #, and with several characteristics: height, weight, age, ...

I need to easily make groups of this data (for example, everything with a given age or height range), and the data is several TB large (but can decrease in smaller subsets of 2-3 gb).

I have strong experience with the theoretical material behind the project, but I am not a computer scientist. I know java, C and Matlab, and now I am learning python.

I would like to use python as it seems quite simple and greatly reduces Java verbosity. The problem is that I am wondering how to process the data pool.

I'm not a database expert, but I probably need it here. Which tools do you think I should use?

Remember that the goal is to implement very advanced mathematical functions on data sets, so we want to reduce the complexity of the source code. Speed is not a problem.

+4

python database large-data

Mascarpone Apr 3 '11 at 10:38

source share

3 answers

Go to a NoSQL database such as MongoDB, which in this case is much easier to process data than to learn SQL.

+3

Andreas Jung Apr 3 '11 at 11:03

source share

Since you are not an expert, I recommend that you use the mysql database as a database for storing your data, it is easy to learn, and you can query your data using SQL and write your data using python. MySQL Python-Mysql Guide

+1

Troydm Apr 3 '11 at 10:56

source share

eat · Accepted Answer · 2011-04-03T11:01:28+0000

It sounds that the main functionality can be found at:
pytables
and
scipy / numpy

Handling large data pools in python

More articles: