This type of integration task is usually solved using EAV (attribute attribute model) in relational systems (for example, the one that Ashrafaul demonstrates). A key consideration when considering the EAV model is an unlimited number of columns. For example, an EAV data model can be simulated in a CQL system such as Cassandra or ScyllaDB. The EAV model is well-written, but difficult to read. You did not describe your considerations in great detail. Do you need all the columns back or do you need specific columns for each user?
Files
Having said that, there are still some considerations inherent in Cassandra and ScyllaDB that may point you to the unified EAV model for some of the projects that you describe in your question. Both Cassandra and ScyllaDB host keyspaces and databases as files on disk. The number of files is basically products with the number of key intervals multiplied by the number of tables. Thus, the more keys, tables, or combinations of them you have, the more files you will have on disk. This can be a problem with file descriptors and other problems with file juggling. Due to the long tail of the access you were talking about, it may happen that every file is open all the time. This is not so desirable, especially when you start with a cold boot.
[edit for clarity] All things being equal, one key space / table will always create fewer files than many keys / tables. This has nothing to do with the amount of data stored or the compaction strategy.
Wide lines
But back to the data model. The Ashraful model has a primary key (userid) and another clustering key (key-> column1). Due to the number of “records” in each user file (500K-2M) and provided that each record is a row of 54 columns, what you basically do is create 500k-2m * 60 avg columns per one key section, creating very large partitions. Cassandra and Scylla do not like very large partitions. Of course, they can handle large sections. In practice, large partitions affect performance, yes.
Updates or Versions
You mentioned updates. The basic EAV model will only be the latest update. No version. What you can do is add time as a clustering key to ensure that the historical values of your columns persist over time.
Reads
If you want all the columns to return, you can just serialize everything into a json object and put it in one column. But I think this is not what you want. In the primary key (partition) model of a key value / value-based system such as Cassandra and Scylla, you need to know all the key components in order to return data. If you put column1 unique row identifier in your primary key, you will need to know it in advance, as well as other column names if they also fall into the primary key.
Sections and compound dividing keys
The number of sections dictates the parallelism of your cluster. The number of full partitions or the number of partitions in your overall enclosure affects the use of your cluster hardware. More sections = better parallelism and higher resource utilization.
What I can do here is change the PRIMARY KEY to include column1 . Then I would use column as the clustering key (which not only dictates uniqueness inside the section, but also the sort order - so see this in the column naming conventions).
In the following table definition, you need to provide userid and column1 as equalities in your WHERE .
CREATE TABLE data ( userid bigint, column1 text, column text, value text, PRIMARY KEY ( (userid, column1), column ) );
I would also have a separate table, possibly columns_per_user , which writes all the columns for each userid . Sort of
CREATE TABLE columns_per_user ( userid bigint, max_columns int, column_names text PRIMARY KEY ( userid ) );
Where max_columns is the total number of columns for this user, and column_names are the actual column names. You may also have a column for the total number of entries per user, something like user_entries int , which will mainly consist of the number of rows in each user csv file.