Best Practice Modeling Data for Cassandra Databases

Question

Best Practice Modeling Data for Cassandra Databases

I am new to Cassandra and I am looking for best practice on how to model data that has the following general structure:

The data is based on a "user" (per client), each of which supplies a large data file of about 500 thousand square meters. m (periodically updated several times a day - sometimes a full update, and sometimes only a delta)

Each data file has certain required data fields (~ 20 required), but can add additional columns as it sees fit (up to ~ 100).

Additional data fields are NOT necessarily the same for different users (field names or types of these fields)

Example (csv format :)

user_id_1.csv | column1 (unique key per user_id) | column2 | column3 | ... | column10 | additionalColumn1 | ...additionalColumn_n | |-----------------------------------|-----------|----------|---------|------------|---------------------|------------------------| | user_id_1_key_1 | value | value | value | value | ... | value | | user_id_1_key_2 | .... | .... | .... | .... | ... | ... | | .... | ... | ... | ... | ... | ... | ... | | user_id_1_key_2Million | .... | .... | .... | .... | ... | ... | user_id_XXX.csv (notice that the first 10 columns are identical to the other users but the additional columns are different - both the names and their types) | column1 (unique key per user_id) | column2 | column3 | ... | column10 | additionalColumn1 (different types than user_id_1 and others) | ...additional_column_x | |-----------------------------------------------------------|-----------|----------|---------|------------|-----------------------------------------------------------------|-------------------------| | user_id_XXX_key_1 | value | value | value | value | ... | value | | user_id_XXX_key_2 | .... | .... | .... | .... | ... | ... | | .... | ... | ... | ... | ... | ... | ... | | user_id_XXX_key_500_thousand (less rows than other user) | .... | .... | .... | .... | ... | ... |

A few options that I have considered:

Option 1:

Create a global keyspace
Create a large "data" table containing everything

Join the user_id column to all other columns in a large table (including optional columns). The primary key becomes user_id + "column_1" (column_1 is unique to user_id)

  Keyspace +--------------------------------------------------------------------------+ | | | | | Data_Table | | + +--------+-------+--------------------------+-----+ | | | | | | | | | | | +-------------------------------------------------+ | | | | | | | | | | many rows | +-------------------------------------------------+ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Many columns | | | | | | | +------------------------> | | | | | | | | | | | | | +-------------------------------------------------+ | | v +-------------------------------------------------+ | | | +--------------------------------------------------------------------------+

A few things I noticed right away:

User user_id repeats itself as many times as there are entries per user
Rows are very sparse for additional columns (empty null values), since users do not have to share with them.
The number of users is relatively small, so the number of additional columns is not huge (maximum 10 thousand columns)
I could combine the extra column data for each user into one column named “metadata” and share it with all users

Option 2:

Create a key space for each user_name

Create a data table for each key space

 +-----------------------------------------------------------------------------------+ | column_1 | column_2 | ... | column_n | additional_column_1 | additional_column_n | +-----------------------------------------------------------------------------------+ keyspace_user1 keyspace_user2 keyspace_user_n +----------------+ +---------------+ +---------------+ | | | | | | | | | | | | | +-+-+--+-+ | | +-+--+--+ | | +--+--+---+ | | | | | | | | | | | | | | many keyspaces | | | | | | | | | | | | | | | | | | | +-------------> | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | +--------+ | | +-------+ | | +---------+ | +----------------+ +---------------+ +---------------+

notes:

Many key spaces (key space for each user)
Prevents adding the value "user_id" for each line (I can use the space name as the user ID)
Very few tables per key space (in this example, only 1 table per key space)

Option 3:

1) Create a global key space 2) Create a table on user_id (required columns, as well as their additional columns in the table)

 +---------------------------------------------------------------+ | Keyspace | | | | user_1 user_2 user_n | | +--+---+--+ +--+--+--+ +--+--+--+ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | +--+---+--+ +--+--+--+ +--+--+--+ | | | | | +---------------------------------------------------------------+

Notes

Global keyspace
Table for user_id ("many") table
Prevent duplicate user id in string

Option 4: (Does that make sense?)

Create several key spaces (for example, "x" is the number of key spaces), each of which contains a range of tables (table per user)

  keyspace_1 keyspace_x +---------------------------------------------------------------+ +---------------------------------------------------------------+ | | | | | | | | | user_1 user_2 user_n/x | | user_n-x user_n-x+1 user_n | | +--+---+--+ +--+--+--+ +--+--+--+ | | +--+------+ +--+--+--+ +--+--+--+ | | | | | | | | | | | | | | | "X" keyspaces | | | | | | | | | | | | | | | | | | | | | | | | | | | | +---------------------> | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | +--+---+--+ +--+--+--+ +--+--+--+ | | +--+---+--+ +--+--+--+ +--+--+--+ | | | | | | | | | +---------------------------------------------------------------+ +---------------------------------------------------------------+

Notes:

Multiple Key Spaces
Multiple tables for each user
A “search” is required to find out which keyspace contains the required table.

Option 5:

Split data into multiple tables and multiple key areas

Notes: 1. In some cases, “connecting” information from several tables is required 2. It seems more complicated

General notes for all scenarios:

Write is less than write
Many millions of readings per day.
Traffic varies depending on user_id - some users have a lot of traffic, and some user_ids have much less traffic. You will need to configure this metric.
Some user_ids are updated (written) more often than others.
We have several data centers for geographic regions and must synchronize
There is a long tail on the primary key (some keys are available many times, while other keys are rarely available)

+5

database cassandra database-normalization cql scylla

Avner barr 25 sept. '17 at 9:13

source share

2 answers

Try the following scheme:

 CREATE TABLE data ( userid bigint, key text, column text, value text, PRIMARY KEY (userid, key) );

Here

 userid -> userid key -> column1 column -> column name from column2 value -> column value

Example insert for data below:

 | column1 (unique key per user_id) | column2 | column3 | |-----------------------------------|---------------|-----------------| | key_1 | value12 | value13 | | key_2 | value22 | value23 |

Insert expression:

 INSERT INTO data (userid , key , column , value ) VALUES ( 1, 'key_1', 'column2', 'value12'); INSERT INTO data (userid , key , column , value ) VALUES ( 1, 'key_1', 'column3', 'value13'); INSERT INTO data (userid , key , column , value ) VALUES ( 1, 'key_2', 'column2', 'value22'); INSERT INTO data (userid , key , column , value ) VALUES ( 1, 'key_2', 'column3', 'value23');

0

Ashraful islam 25 sept. '17 at 12:01

source share

siculars · Accepted Answer · 2017-09-27T01:29:19+0000

This type of integration task is usually solved using EAV (attribute attribute model) in relational systems (for example, the one that Ashrafaul demonstrates). A key consideration when considering the EAV model is an unlimited number of columns. For example, an EAV data model can be simulated in a CQL system such as Cassandra or ScyllaDB. The EAV model is well-written, but difficult to read. You did not describe your considerations in great detail. Do you need all the columns back or do you need specific columns for each user?

Files

Having said that, there are still some considerations inherent in Cassandra and ScyllaDB that may point you to the unified EAV model for some of the projects that you describe in your question. Both Cassandra and ScyllaDB host keyspaces and databases as files on disk. The number of files is basically products with the number of key intervals multiplied by the number of tables. Thus, the more keys, tables, or combinations of them you have, the more files you will have on disk. This can be a problem with file descriptors and other problems with file juggling. Due to the long tail of the access you were talking about, it may happen that every file is open all the time. This is not so desirable, especially when you start with a cold boot.

[edit for clarity] All things being equal, one key space / table will always create fewer files than many keys / tables. This has nothing to do with the amount of data stored or the compaction strategy.

Wide lines

But back to the data model. The Ashraful model has a primary key (userid) and another clustering key (key-> column1). Due to the number of “records” in each user file (500K-2M) and provided that each record is a row of 54 columns, what you basically do is create 500k-2m * 60 avg columns per one key section, creating very large partitions. Cassandra and Scylla do not like very large partitions. Of course, they can handle large sections. In practice, large partitions affect performance, yes.

Updates or Versions

You mentioned updates. The basic EAV model will only be the latest update. No version. What you can do is add time as a clustering key to ensure that the historical values of your columns persist over time.

Reads

If you want all the columns to return, you can just serialize everything into a json object and put it in one column. But I think this is not what you want. In the primary key (partition) model of a key value / value-based system such as Cassandra and Scylla, you need to know all the key components in order to return data. If you put column1 unique row identifier in your primary key, you will need to know it in advance, as well as other column names if they also fall into the primary key.

Sections and compound dividing keys

The number of sections dictates the parallelism of your cluster. The number of full partitions or the number of partitions in your overall enclosure affects the use of your cluster hardware. More sections = better parallelism and higher resource utilization.

What I can do here is change the PRIMARY KEY to include column1 . Then I would use column as the clustering key (which not only dictates uniqueness inside the section, but also the sort order - so see this in the column naming conventions).

In the following table definition, you need to provide userid and column1 as equalities in your WHERE .

 CREATE TABLE data ( userid bigint, column1 text, column text, value text, PRIMARY KEY ( (userid, column1), column ) );

I would also have a separate table, possibly columns_per_user , which writes all the columns for each userid . Sort of

 CREATE TABLE columns_per_user ( userid bigint, max_columns int, column_names text PRIMARY KEY ( userid ) );

Where max_columns is the total number of columns for this user, and column_names are the actual column names. You may also have a column for the total number of entries per user, something like user_entries int , which will mainly consist of the number of rows in each user csv file.

Best Practice Modeling Data for Cassandra Databases

More articles: