Spark Cassandra Connector - Request Range for Partition Key

I am evaluating the spark-cassandra connector, and I am struggling to get a range request on the partition key for the job.

According to the documentation for the junction box, it seems that you can do server-side filtering on the partition key using equality or the IN operator, but unfortunately my partition key is a timestamp, so I cannot use it.

So I tried using Spark SQL with the following query ("timestamp" is the key of the partition):

select * from datastore.data where timestamp >= '2013-01-01T00:00:00.000Z' and timestamp < '2013-12-31T00:00:00.000Z' 

Although the job spawns 200 jobs, the query does not return any data.

I can also assure that there is data that will be returned from the moment the query on cqlsh is launched (performing the corresponding conversion using the token function). Returns data.

I am using spark 1.1.0 with offline mode. Cassandra is 2.1.2, and the connector version is branch b1.1. The Cassandra driver is the main branch of DataStax. The Cassandra cluster overlaps a spark cluster with 3 servers with a replication ratio of 1.

Here is the complete job log

Does anyone know anyone?

Update: When I try to filter on the server side based on the partition key (using the CassandraRDD.where method), I get the following exception:

 Exception in thread "main" java.lang.UnsupportedOperationException: Range predicates on partition key columns (here: timestamp) are not supported in where. Use filter instead. 

But, unfortunately, I do not know what a "filter" is ...

+5
source share
2 answers

You have several options for the solution you are looking for.

The most powerful will be the use of Lucene indexes, integrated with Cassandra Stratio, which allows you to search on any indexed field on the server side. Your recording time will be increased, but on the other hand, you can request any time range. You can find more information about the Lucene indices in Cassandra here . This enhanced version of Cassandra is fully integrated into the deep spark project, so you can take full advantage of the Lucene indices in Kassandra. I would recommend you use Lucene indexes when you are doing a limited query that retrieves a small result set, if you are going to extract most of your data set, you should use the third parameter below.

Another approach, depending on how your application works, may be to trim the timestamp field so that you can search for it using the IN statement. The problem is, as far as I know, you cannot use the spark-cassandra connector for this, you must use the direct Cassandra driver, which is not integrated with Spark, or you can look at the project from a deep spark where a new feature will be released soon, allowing you to do this. Your query will look something like this:

 select * from datastore.data where timestamp IN ('2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', ... , '2013-12-31') 

but, as I said, I don’t know if it suits your needs, since you can’t trim your data and group them by date / time.

The last option, but less effective, is to deliver a complete set of data to your spark cluster and apply a filter on RDD.

Disclaimer: I work at Stratio :-) Feel free to contact us if you need help.

Hope this helps!

+8
source

I think the CassandraRDD error says that the query you are trying to make is not allowed in Cassandra, and you need to load the whole table into CassandraRDD and then perform the spark filter operation on that CassandraRDD.

So your code (in scala) should look something like this:

 val cassRDD= sc.cassandraTable("keyspace name", "table name").filter(row=> row.getDate("timestamp")>=DateFormat('2013-01-01T00:00:00.000Z')&&row.getDate("timestamp") < DateFormat('2013-12-31T00:00:00.000Z')) 

If you are interested in creating this type of query, you might have to look at other Cassandra connectors, such as those developed by Stratio

+8
source

Source: https://habr.com/ru/post/1207264/


All Articles