I am looking for a convenient way to store and request a huge amount of meteorological data (several TB). Additional information about the data type in the middle of the question.
I used to look towards MongoDB (I used it for many of my previous projects and felt comfortable with it), but recently I found out about HDF5 . Reading about this, I found some similarities with Mongo:
HDF5 simplifies the file structure, which includes only two basic types of object: Datasets, which are multidimensional arrays of a homogeneous type, which are container structures that can store datasets and other groups. This leads to a truly hierarchical, file system-like data format. Metadata is stored in the form of user names, named attributes, attached to groups and data sets.
It looks like arrays and inline objects in Mongo, and also supports indexes for querying data.
Because it uses B-trees to index table objects, HDF5 works well for time-series data such as stock price series, network monitoring data, and 3D weather data.
Data:
A specific area is divided into smaller squares. At the intersection of each of the sensors is a point (point).

This sensor collects the following information every X minutes:
- solar luminosity
- location and wind speed
- humidity
- etc. (this information is basically the same, sometimes the sensor does not collect all the information).
He also collects it for different heights (0m, 10m, 25m). Not always the height will be the same. Also, each sensor has some meta-information:
- name
- lat, lng
- it is in water and many others
By providing this, I do not expect the size of one item to be larger than 1 MB. In addition, I have enough memory in one place to save all the data (as far as I understand, no fragments are required)
Data Operations. There are several ways to interact with data:
convert as a large number of files: some TB of data will be provided to me as a point in time in netcdf format and I will need to store them (and itβs relatively easy to convert it to HDF5). Then smaller pieces of data will be provided periodically (1 GB per week), and I have to add them to the repository. Just highlight: I have enough memory to store all this data on one machine.
request data. Often there is a need to request data in real time. Most frequently asked questions: tell me the temperature of sensors from a specific area for a certain time, show me the data of a specific sensor for a certain time, show me the wind for a certain region over a given time range. Aggregated queries (average temperature over the past two months) are highly unlikely. Here I find Mongo to be a good fit, but hdf5 + pytables is an alternative.
perform some statistical analysis. Currently, I do not know what exactly will happen, but I know that this should not be in real time. So I thought using hadoop with mongo might be a good idea, but hdf5 with R is a reasonable alternative.
I know that questions about a better approach are not encouraged , but I'm looking for expert user tips. If you have any questions, I would be happy to answer them and appreciate your help.
PS I reviewed some interesting discussions similar to mine: hdf-forum , search in hdf5 , storage of meteorological data