How to replace JAVA cycle with Direct Scrip Cassandra Table Data Manipulation

Question

How to replace JAVA cycle with Direct Scrip Cassandra Table Data Manipulation

I am trying to make my code more efficient, since I have to process billions of lines of data in cassandra. I am currently using the JAVA loop in the Datastax Cassandra Spark Connector to pull data and put it in a format that I am familiar with (multimap) to get a spark to do the manipulation. I would like to be able to replace this Multimap loop with direct spark manipulation of the cassandra table in order to save time and make everything more efficient. I would really appreciate any code suggestions for this. Here is my existing code:

        Statement stmt = new SimpleStatement("SELECT \"Power\",\"Bandwidth\",\"Start_Frequency\" FROM \"SB1000_49552019\".\"Measured_Value\";");
        stmt.setFetchSize(2000000);
        ResultSet results = session.execute(stmt);

// Get the Variables from each Row of Cassandra Data        
        Multimap<Double, Float> data = LinkedListMultimap.create();
        for (Row row : results){       
           // Column Names in Cassandra (Case Sensitive)
           start_frequency = row.getDouble("Start_Frequency");
           power = row.getFloat("Power");
           bandwidth = row.getDouble("Bandwidth"); 

// Create Channel Power Buckets    
                for(channel = 1.6000E8; channel <= channel_end;  ){ 
                    if( (channel >= start_frequency) && (channel <= (start_frequency + bandwidth)) ) {     
                     data.put(channel, power);
                    }  // end if
                    channel+=increment;
                }  // end for      
        } // end "row" for

// Create Spark List for DataFrame        
        List<Value> values = data.asMap().entrySet()
            .stream()
            .flatMap(x -> x.getValue()
                    .stream()
                    .map(y -> new Value(x.getKey(), y)))
            .collect(Collectors.toList());

// Create DataFrame and Calculate Results
    sqlContext.createDataFrame(sc.parallelize(values), Value.class).groupBy(col("channel"))
        .agg(min("power"), max("power"), avg("power"))
        .write().mode(SaveMode.Append)      
        .option("table", "results")
        .option("keyspace", "model")
        .format("org.apache.spark.sql.cassandra").save();

    } // end session
} // End Compute

+4

java apache-spark datastax-java-driver datastax

mithrix Feb 19 '16 at 21:11

1

Pankaj Arora · Accepted Answer · 2016-02-22T13:50:16+0000

JavaRDD<MeasuredValue> rdd = javaFunctions(sc).cassandraTable("SB1000_47130646", "Measured_Value", mapRowTo(MeasuredValue.class));
JavaRDD<Value> valueRdd = rdd.flatMap(new FlatMapFunction<MeasuredValue, Value>(){
@Override 
public Iterable<Value> call(MeasuredValue row) throws Exception { 
double start_frequency = row.getStart_frequency(); 
float power = row.getPower(); 
double bandwidth = row.getBandwidth(); 

// Define Variable 
double channel,channel_end, increment;  

// Initialize Variables 
channel_end = 1.6159E8; 
increment = 5000; 

List<Value> list = new ArrayList<Value>(); 
// Create Channel Power Buckets 
for(channel = 1.6000E8; channel <= channel_end; ){ 
if( (channel >= start_frequency) && (channel <= (start_frequency + bandwidth)) ) { 
list.add(new Value(channel, power)); 
} // end if 
channel+=increment; 
} // end for 

return list; 
}    
    });

    sqlContext.createDataFrame(valueRdd, Value.class).groupBy(col("channel"))
    .agg(min("power"), max("power"), avg("power"))
    .write().mode(SaveMode.Append)      
    .option("table", "results")
    .option("keyspace", "model")
    .format("org.apache.spark.sql.cassandra").save();

} // end session

public static class MeasuredValue implements Serializable {

        public MeasuredValue() { }

        private double start_frequency;
        public double getStart_frequency() { return start_frequency; }
        public void setStart_frequency(double start_frequency) { this.start_frequency = start_frequency; }

        private double bandwidth ;
        public double getBandwidth() { return bandwidth; }
        public void setBandwidth(double bandwidth) { this.bandwidth = bandwidth; }

        private float power;    
        public float getPower() { return power; }
        public void setPower(float power) { this.power = power; }

    }

How to replace JAVA cycle with Direct Scrip Cassandra Table Data Manipulation

More articles: