Statistical analysis of a large dataset to be published on the Internet

Question

Statistical analysis of a large dataset to be published on the Internet

I have a data logger that is not connected to a computer that collects data from a field. This data is stored as text files, and I manually combine the files and organize them. The current format is through a csv file per year for each registrar. Each file is about 4,000,000 lines x 7 loggers x 5 years = a lot of data. some of the data is organized as item_type, item_class, item_dimension_class, and other more unique bean data, such as item_weight, item_color, date_collected, etc ...

I am currently doing statistical analysis of data using the python / numpy / matplotlib program that I wrote. It works great, but the problem is that I'm the only one who can use it, since it and the data live on my computer.

I would like to publish data on the web using postgres db; however, I need to find or implement a statistical tool that will take up a large postgres table and return the statistical results for a sufficient period of time. I am not familiar with python for the web; however, I own PHP on websites and python on the standalone side.

users should be allowed to create their own histograms, data analysis. For example, a user can search for all items that are sent in blue between weeks x and weeks y, while another user can search for sorting by weight of all items by the hour throughout the year.

I thought of creating and indexing my own statistical tools or automating the process in some way to emulate most queries. It seemed ineffective.

I look forward to your ideas.

thanks

+4

python php statistics postgresql

dassouki Apr 19 '10 at 12:58

source share

1 answer

tk. · Accepted Answer · 2010-04-19T17:22:43+0000

I think that you can fully use your current combination (python / numpy / matplotlib) if the number of users is not too large. I do some similar work, and the size of my data is a little more than 10 g. The data is stored in several sqlite files, and I use numpy for data analysis, PIL / matplotlib to create diagram files (png, gif), cherrypy as a web server, mako as a template language.

If you need a larger server / client database, you can upgrade to postgresql, but you can still make full use of your current programs if you upgrade using a python web framework like cherrypy.

Statistical analysis of a large dataset to be published on the Internet

More articles: