What is the best way to store and query a large dataset of meteorological data

I am looking for a convenient way to store and request a huge amount of meteorological data (several TB). Additional information about the data type in the middle of the question.

I used to look towards MongoDB (I used it for many of my previous projects and felt comfortable with it), but recently I found out about HDF5 . Reading about this, I found some similarities with Mongo:

HDF5 simplifies the file structure, which includes only two basic types of object: Datasets, which are multidimensional arrays of a homogeneous type, which are container structures that can store datasets and other groups. This leads to a truly hierarchical, file system-like data format. Metadata is stored in the form of user names, named attributes, attached to groups and data sets.

It looks like arrays and inline objects in Mongo, and also supports indexes for querying data.

Because it uses B-trees to index table objects, HDF5 works well for time-series data such as stock price series, network monitoring data, and 3D weather data.

Data:

A specific area is divided into smaller squares. At the intersection of each of the sensors is a point (point).

enter image description here

This sensor collects the following information every X minutes:

  • solar luminosity
  • location and wind speed
  • humidity
  • etc. (this information is basically the same, sometimes the sensor does not collect all the information).

He also collects it for different heights (0m, 10m, 25m). Not always the height will be the same. Also, each sensor has some meta-information:

  • name
  • lat, lng
  • it is in water and many others

By providing this, I do not expect the size of one item to be larger than 1 MB. In addition, I have enough memory in one place to save all the data (as far as I understand, no fragments are required)

Data Operations. There are several ways to interact with data:

  • convert as a large number of files: some TB of data will be provided to me as a point in time in netcdf format and I will need to store them (and it’s relatively easy to convert it to HDF5). Then smaller pieces of data will be provided periodically (1 GB per week), and I have to add them to the repository. Just highlight: I have enough memory to store all this data on one machine.

  • request data. Often there is a need to request data in real time. Most frequently asked questions: tell me the temperature of sensors from a specific area for a certain time, show me the data of a specific sensor for a certain time, show me the wind for a certain region over a given time range. Aggregated queries (average temperature over the past two months) are highly unlikely. Here I find Mongo to be a good fit, but hdf5 + pytables is an alternative.

  • perform some statistical analysis. Currently, I do not know what exactly will happen, but I know that this should not be in real time. So I thought using hadoop with mongo might be a good idea, but hdf5 with R is a reasonable alternative.

I know that questions about a better approach are not encouraged , but I'm looking for expert user tips. If you have any questions, I would be happy to answer them and appreciate your help.

PS I reviewed some interesting discussions similar to mine: hdf-forum , search in hdf5 , storage of meteorological data

+4
source share
2 answers

This is a tricky question, and I'm not sure if I can give a definite answer, but I have experience with HDF5 / pyTables and some NoSQL databases.
Here are a few thoughts.

  • HDF5 itself has no concept of index. This is just a hierarchical storage format that is well suited for multidimensional numeric data. Can be extended to HDF5 to implement data (e.g. PyTables, HDF5 FastQuery ).
  • HDF5 (if you are not using the MPI version) does not support simultaneous write access (read access is possible).
  • HDF5 supports compression filters that can - unlike the popular belief - make data access actually faster (however, you should think about the right block size, which depends on the way you access the data).
  • HDF5 is not a database. MongoDB has ACID properties, HDF5 is not (maybe important).
  • There is a package ( SciHadoop ) that combines Hadoop and HDF5.
  • HDF5 makes it relatively easy to calculate basic calculations (i.e. if the data is too large to fit in memory).
  • PyTables supports some fast core calculations directly in HDF5 using numexpr

I think your data is usually suitable for storage in HDF5. You can also perform statistical analysis either in R or through Numpy/Scipy .
But you can also think of a hybdrid aproach. Store raw data in HDF5 format and use MongoDB for metadata or for caching certain values ​​that are often used.

+9
source

You can try SciDB if loading NetCDF / HDF5 into this array database is not a problem for you. Please note that if your data set is extremely large, the data loading phase will be very time consuming. I am afraid this is a problem for all databases. In any case, SciDB also provides the R package, which should support the necessary analysis.

Alternatively, if you want to fulfill queries without converting HDF5 to something else, you can use this product here: http://www.cse.ohio-state.edu/~wayi/papers/HDF5_SQL.pdf Moreover, if you want to efficiently execute a select query, you must use an index; if you want to execute a real-time aggregation request (in seconds), you can consider sample aggregation. Our group has developed some products to support these features.

Regarding statistical analysis, I think the answer depends on the complexity of your analysis. If you only need to calculate something like entropy or correlation coefficient, we have products to do this in real time. If the analysis is very complex and temporary, you can consider SciHadoop or SciMATE, which can process scientific data in a MapReduce framework. However, I'm not sure if SciHadoop can currently directly support HDF5.

+1
source

Source: https://habr.com/ru/post/1484303/


All Articles