How to copy data from a Cassandra table to another structure for better performance

In several places, he advised creating our Cassandra tables in accordance with the requests that we will fulfill on them. This DataScale article says the following:

The truth is that having many similar tables with similar data is good in Kassandra. Limit the primary key to exactly who you are looking for. If you plan to search for data with similar but different criteria, make them a separate table. There is no shortage of the fact that the same data is stored in different ways. Duplication of data is your friend in Kassandra.

[...]

If you need to save the same piece of data in 14 different tables, write it down 14 times. There are no barriers to multiple entries.

I understood this, and now my question is: provided that I have an existing table, let's say

CREATE TABLE invoices ( id_invoice int PRIMARY KEY, year int, id_client int, type_invoice text ) 

But I want to query for year and type instead, so I would like to have something like

 CREATE TABLE invoices_yr ( id_invoice int, year int, id_client int, type_invoice text, PRIMARY KEY (type_invoice, year) ) 

With id_invoice as the partition key and year as the clustering key, what is the preferred way to copy data from one table to another to make optimized queries later?

My version of Cassandra:

 user@cqlsh > show version; [cqlsh 5.0.1 | Cassandra 3.5.0 | CQL spec 3.4.0 | Native protocol v4] 
+12
source share
3 answers

Repeating what was said about the COPY team, this is a great solution for something like that.

However, I do not agree with what has been said about Bulk Loader, since its use is infinitely more complicated. In particular, because you need to run it on each node (whereas COPY needs to run on only one node).

To help scale COPY for large datasets, you can use the PAGETIMEOUT and PAGESIZE parameters.

 COPY invoices(id_invoice, year, id_client, type_invoice) TO 'invoices.csv' WITH PAGETIMEOUT=40 AND PAGESIZE=20; 

Using these parameters appropriately, I used COPY to successfully export / import 370 million rows.

For more information, check out this article: New options and better performance in a copy of cqlsh .

+8
source

You can use the cqlsh COPY command :
To copy your account data to a csv file, use:

 COPY invoices(id_invoice, year, id_client, type_invoice) TO 'invoices.csv'; 

And copy back from the csv file to the table in your case invoices_yr use:

 COPY invoices_yr(id_invoice, year, id_client, type_invoice) FROM 'invoices.csv'; 

If you have huge data, you can use the sstable script to write, and sstableloader can load the data faster. http://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated

+11
source

An alternative to using the COPY command (see other answers for examples) or Spark to transfer data is to create a materialized view to perform denormalization for you.

 CREATE MATERIALIZED VIEW invoices_yr AS SELECT * FROM invoices WHERE id_client IS NOT NULL AND type_invoice IS NOT NULL AND id_client IS NOT NULL PRIMARY KEY ((type_invoice), year, id_client) WITH CLUSTERING ORDER BY (year DESC) 

Cassandra will fill out a spreadsheet for you so you don't have to migrate on your own. Beginning with version 3.5, it should be remembered that the repair does not work properly (see CASSANDRA-12888 ).

Note that materialized views are probably not the best idea to use, and have been changed to "experimental" status

+4
source

Source: https://habr.com/ru/post/1013731/


All Articles