Get individual partition keys from table C *

Question

Get individual partition keys from table C *

cqlsh does not allow nested queries, so I cannot export the selected data to csv .. I am trying to export the selected data (about 200,000 rows with one column) from cassandra using:

echo "SELECT distinct imei FROM listener.snapshots;" > select.cql bin/cqlsh -f select.cql > output.txt

and it just stuck forever without errors and the file does not grow.

If I use strace on the last line, I got a lot of lines:

 select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout)

and --debug give me only:

cqlsh --debug -f select.cql > output.txt

 Using CQL driver: <module 'cassandra' from '/usr/share/dse/resources/cassandra/bin/../lib/cassandra-driver-internal-only-2.5.1.zip/cassandra-driver-2.5.1/cassandra/__init__.py'>

what's wrong? Is there a better way to get individual partition keys from a large C * table?

+5

cassandra cqlsh datastax-enterprise

Rada Dec 24 '15 at 15:10

source share

2 answers

Usually you should use the cqlsh command "COPY ... TO ..." to export the data from the table to the csv file.

See the documentation here .

I'm not sure why you are trying to get stuck. For debugging, I would suggest not using DISTINCT and add a LIMIT clause. Perhaps your table is large, and therefore cqlsh does a swapping of results, but no one can press return to go to the next page of results, so it waits forever.

+1

Jim meyer Dec 26 '15 at 21:46

source share

Rada · Accepted Answer · 2015-12-27T15:16:11+0000

I used capture :

 cqlsh> CAPTURE 'temp.csv' Now capturing query output to 'temp.csv'. cqlsh> SELECT distinct imei FROM listener.snapshots; ---MORE--- ---MORE--- ---MORE--- ---MORE--- . . . cqlsh> cqlsh>

And press enter until you're done.

An even faster option is to use paging :

 cqlsh> PAGING off Disabled Query paging. cqlsh> CAPTURE 'temp.csv' Now capturing query output to 'temp.csv'. cqlsh> SELECT distinct imei FROM listener.snapshots;

It will immediately extract the data to a file (if you get OperationTimedOut, you must edit the timeout settings in cassandra.yaml).

I can't believe this is a quick way ... I know that I can export data using a spark using CassandraSQLContext , but not so fast when I need to create a C * rdd query for a single column from a very large table (rows 2B ~) and print them in a file:

  val conf = new SparkConf().setAppName("ExtractDistinctImeis") val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) val connector = CassandraConnector(conf) val cc = new CassandraSQLContext(sc) val snapshots_imeis = cc.sql("select distinct imei from listener.snapshots").map(row => row(0).toString) val imeis = snapshots_imeis.collect def printToFile(f: java.io.File)(op: java.io.PrintWriter => Unit) { val p = new java.io.PrintWriter(f) try { op(p) } finally { p.close() } } printToFile(new File("/path/to/file.txt")) { p => imeis.foreach(p.println) }

It took about 3.5 hours with a spark! With capture, I manage to get the file in 3 min / 3 seconds.

Get individual partition keys from table C *

More articles: