What we want to do is create a local data repository for our laboratories to organize, search, access, catalog, link to our data, etc. I feel that CKAN can do all of this; however, I'm not sure how he will handle these tasks for the data that we have (I could be wrong, so I ask).
Our laboratory buys a lot of data for internal use. We would like to be able to catalog and organize this data in our group (perhaps CKAN?), So that people can insert data into a catalog and extract data and use it. In some cases, using an ACL for data, a web interface, searching, viewing, organizing, adding, deleting, updating datasets, etc. Although CKAN seems very suitable for this, the problem is with the data (the sum is more detailed), which we are trying to deal with.
We want to catalog any terabyte of images (200k + images), geospatial data in various formats, twitter streams (TBs of JSON data), database dump files, binary data, machine learning models, etc. I wouldn’t think it would be wise to add 100k 64MB JSON files as a resource to the CKAN dataset, or is that so? We understand that we will not be able to search in these JSON / images / geodata, which is good. But we would like to get information about whether the data was available (for example, we are looking for "twitter 2015-02-03"), the type of metadata search, if you want. Using local file storage in CKAN, what happens if a user requests 200 thousand images? Will the system stop responding to these requests?
I saw how CKAN is used in datahub.io, and the vast majority of this material is small CSV files, small files of 2-3 MB in size and no more than 20 or 30 separate files in the data set.
So, is CKAN capable of doing what we want? If these are not alternative proposals?
Edit more specific questions instead of a discussion:
I looked and googled for information on this topic, but I did not see a deployed system with a lot of data.
- Is there a limit on the size of the files that I can upload (for example, a 400 GB database file)?
- Is there a limit on the number of files that I upload as a resource to a dataset in CKAN? (for example, I create a data set and upload 250,000 64 megabyte JSON files, and can the system be used?)
- , , (, ). //, ?
- . - , CKAN API ?