SQL: how would you save your own user data?

I am involved in a project that includes time series analytics, and I must allow users to upload a file containing their own time series (e.g. numbers with dates), for example, in a CSV file. Then the data contained in their files will be available at any time, which will be used in our system.

How could I do this? The ideas I was thinking about:

  • Create a table every time a user uploads a file (and saves the name of that table somewhere). If I have many users uploading a large amount of data, I can eventually add some tables.
  • Create one large table of monsters with three or four columns: value date; Value; Dataset name (and / or dataset owner). Everything is loaded into this table, and when Bob needs weather data, I just select (date, value), where owner = Bob and file_name = weatherdata.
  • Between the solution: one table per user and all Bob datasets are in the Bob table.
  • Quite another: just save the CSV file and read it when you need it.

I keep reading bad practice to have a different number of tables (and I believe in that). However, my situation is slightly different from the other questions I saw on this site (most people seem to want to create one table for each user, when they have to create one row for each user).

Additional Information:

  • time series data may contain hundreds of thousands of observations, possibly millions
  • a priori, the stored data should not subsequently be changed. However, I think it would be useful for users to add new data to their time series.
  • a priori, I won’t need to do complex SQL select statements. I just want to read Bob's weather data, and I will probably use it in chronological order - although you never know what tomorrow can bring.
  • using PostgreSQL 9.1 if that matters.

EDIT While reading some answers, I understand that maybe I did not do my job very well, I should have said that I was clearly developing in the SQL environment; I already have a User table; when I write “table”, I really mean “relationship”; all my 4 ideas are related to foreign keys somewhere; and normalizing RDBMSs is a paradigm if something else is not better. (All this does not mean that I am against non-sql solutions).

+4
source share
4 answers

I need to go with the "big fat monster table". This is how relational databases should work, although you must normalize them (create one table for users, another for datasets, and another for data points). Having multiple tables with the same layouts is a bad idea from all sides - design, management, security, and even queries; Are you sure you will never want to combine information from two data sets?

If you are really sure that each data set will be completely isolated, you can also not use SQL at all. HDF (hierarchical data format) was literally built for this purpose, the effective storage and retrieval of "sets of scientific data", which are often very serial data. “Tables” in HDF are literally called datasets, they can share definitions, they can be multidimensional (for example, one dimension for a day, one at a time), and they are much cheaper than SQL tables.

I usually don’t try to push people away from SQL, but unusual situations sometimes require unusual solutions. If you are faced with billions of rows in an SQL table (or more), and you have virtually no other data to store, then SQL may not be the best solution for you.

+3
source

Example T-SQL * for a possible construct:

CREATE TABLE dbo.Datasets ( ID int NOT NULL IDENTITY(1,1), OwnerUserID int NOT NULL, Loaded datetime NOT NULL, CONSTRAINT FK_Datasets_Users FOREIGN KEY ( OwnerUserID ) REFERENCES dbo.Users ( ID ) ); CREATE TABLE dbo.DatasetValues ( DatasetID int NOT NULL, Date datetime NOT NULL, Value int NOT NULL, CONSTRAINT FK_DatasetValues_Datasets FOREIGN KEY ( DatasetID ) REFERENCES dbo.Datasets ( ID ) ); 

The design models two “entities” that are implied in your question: time series data and time series data sets are loaded.

* for SQL Server; I know what you said PostgreSQL 9.1, but I'm sure you can easily translate.

+2
source

Your ideas are all pretty good ways to solve the problem (I hope I read it correctly).

What about a relational database? For example, a table with a username, load time, and unique dataid, then associate dataid with another table containing the foreign key dataid and raw file data. This will minimize the user table (and you could combine it with another table containing, for example, user data). Having a separate table for users and then another for passwords and another for emails and then another 5 for data is probably bad practice, but personally I don’t see anything wrong with separating files from user data.

What language do you work for data processing? It can also be a decisive factor.

Hope this helps :)

Tom

+2
source

Well, I think option 2 is best; creating additional tables is just a nightmare to maintain and gives you the ability to open so many errors. Option 4 is somewhat attractive, but I still think that the database should cope with such a task.

I think I would structure my tables like this:

User table - UserID, name, etc.

Row - each row in your loaded data (rowid, etc.)

RowInDataSet - row identifier, DataSetID

DataSet - DataSetID, Upload Date, UploadBy, etc.

This allows you to break up your data a little and simplifies its maintenance. Storing large amounts of data should not be such a problem if you index these tables correctly.

+2
source

Source: https://habr.com/ru/post/1379138/


All Articles