48K csv files, 1000 lines each. How to redo the data warehouse?

This question was suspended because it was too general. I am revising to be more specific.

One of those I help decided to increase the large-scale exercise to mass proportions. The usual thing that we do will have 100 conditions with 1000 runs each, and the result can "easily" fit into a single file or data frame. We do such things with SAS, R or Mplus. This one is in R. I had to see trouble when I heard that the project was failing due to lack of memory. We see that sometimes with Bayesian models, where storing all the results from circuits in memory becomes too demanding. The fix in these cases was to save batches of iterations in separate files. Ignoring the details, I suggested writing smaller files on disk as the simulation continued.

Later, I realized the scale of my mistake. They created 48,000 output CSV files, each of which has 1000 rows and about 80 columns of real numbers. They are recorded in CSV files because researchers are happy with the data they can see. Again, I did not pay attention when they asked me how to analyze it. I thought about small data and told them to collect csv files using a shell script. The result is a 40 + GB csv file. R cannot hope to discover this on the computers that we have here.

I believe / hope that analysis will never need to use all 40 GB of data in one regression model :) I expect that it is more likely that they will want to generalize smaller segments. The usual exercise in this ilk is 3-5 columns of simulation parameters, and then 10 columns of analysis results. In this project, the result is much more massive, because it has 10 columns of parameters, and all combinations of combinations and combinations made the project expand.

I believe the best plan is to store data in a database structure. I want you to advise me which approach to take.

  • Mysql? Don't open anymore, I'm not too enthusiastic.

  • PostgreSQL

    ? Seems increasingly popular, has not previously managed a server.

  • sqlite3? Some administrators provide us with analysis data in this format, but we never received anything more than 1.5 GB.

  • HDF5 (maybe netCDF?) Previously it was (like 2005), these specialized scientific-style container data formats will work well. However, I did not hear mention of them, since I began to help students of social sciences. When we started R, we used HDF5, and one of my friends wrote the R source code to interact with HDF5.

My top priority is a quick data search. I think that if one of the technicians can learn how to extract a rectangular piece, we can show the researchers how to do the same.

+5
source share

Source: https://habr.com/ru/post/1264575/


All Articles