How to make data migration to cassandra

We have one general requirement (data transfer) for batch changing data, such as a user ID column (changing user ID from 001 to 002, changing user ID from 003 to 004). but the user ID field in table 1 is not the primary key (we cannot get all the rows for updating, except for selecting * from the table), and in table2 the primary key (we can handle this case). Thus, we do not have methods to select all the data, using the reason for all the tables.

So how to fulfill this requirement?

I just come up with two methods:

(1) select * from the table with sample size setting. Then update it .// Correct? (2) use the copy command in one CVS, and then change it and import again .// Is the performance slow?

Is it possible to use these methods in production (s> millionth records). Or is there any other standard method for this requirement? Sstableloader? Pig?

Perhaps it is a common requirement to change one column to the entire existing table in order to possibly exist on a standard solution.

Regardless of which method we choose, finally, when the data migration , how to solve the new problem of data transfer over the past period of the old data migration. In other words, how to solve the problem of data migration?

Expect Your Playback

table1 userid (pk) name sex

table2 phonenumber (pk) userid

+5
source share
3 answers

It smells like an anti-pattern.

Primary keys must be stable

Primary keys (especially the partition section) should not be changed, especially globally through a data set.

When the partition key changes, the rows will receive a new token, and the rows will have to move from their current replica nodes to the new replica nodes.

When changing any part of the primary key, you must use strings.

Changing a primary key is an expensive operation. And as you discover, updating all the links in other tables is also expensive.

If the field that you selected as the primary key is unstable, you should use another, more stable field as the primary key. In the worst case, use a synthetic key (uuid or timeuuid).

I highly recommend that you revert to your data model and configure it to support your data migration needs so that you do not require a primary key change.

If you provide more detailed information about migration requirements, we can offer a better way to model it.

+4
source

Depending on the amount of data you probably have 3 options:

1) COPY TO in CQLSH, which will use paging and create a CSV file. You can then analyze this CSV using the programming language of your choice, create a new CSV with updated identifiers, crop the table (or create a new table) and COPY FROM back to the system. This will work on several million records; I probably would not have tried it for several billion. COPY FROM does not need to know all the keys in advance.

2) Use a spark. Jim Mayer did a reasonable job explaining the spark. Spark will scale better than COPY commands in CQLSH, but requires additional configuration.

3) Use CQLSSTableWriter , sstableloader and streaming. Read the lines using a paged driver (e.g. java datastax driver). Use CQLSSTableWriter to convert this data and write new STLINE-sstables. Remove or trim the old table and use sstableloader to feed the new sstables to the cluster. This is suitable for terabytes of data and can be parallelized if you plan ahead. Yuki Morishita is well documenting this approach on the Datastax blog . You do not need to know all the keys, you can SELECT DISTINCT get each line or use COPY FROM to create a CSV file.

+4
source

I don’t quite understand what you are trying to do, but you may need to use the spark-cassandra connector to use Spark for these conversions.

Using the connector, you can read entire tables in spark RDDs, combine and convert fields into these RDDs, and then save the resulting RDDs back to Cassandra. So, what you describe, you roughly do the following steps:

  • Read tables1 and table2 in RDD1 and RDD2
  • It is possible to create a connection with a user ID between RDD1 and RDD2 to create RDD3
  • Convert userid field and everything you want to change
  • Creating tables in Cassandra so that you want to be the main key
  • Save converted RDDs to new tables in Cassandra

This approach will scale well for millions of records, since Spark is designed to work with data in chunks if there is not enough system memory to store everything in memory at the same time. And Spark will be able to do parallel work on all nodes at the same time, and not write a CQL client to retrieve all the records and do it all on one client computer.

The hard part will be adding Spark to your Cassandra cluster and learning how to write Spark assignments, but if this is something you will do often, it can be worth the trouble.

+3
source

Source: https://habr.com/ru/post/1237153/


All Articles