My code algorithm as shown below Step1 . get hbase entity data for hBaseRDD
JavaPairRDD<ImmutableBytesWritable, Result> hBaseRDD = jsc.newAPIHadoopRDD(hbase_conf, TableInputFormat.class, ImmutableBytesWritable.class, Result.class);
Step2 . convert hBaseRDD to rowPairRDD
Step3 . convert rowPairRDD to schemaRDD
JavaSchemaRDD schemaRDD = sqlContext.applySchema(rowPairRDD.values(), schema); schemaRDD.registerTempTable("testentity"); sqlContext.sqlContext().cacheTable("testentity");
Step4 . using spark sql makes the first simple SQL query.
JavaSQLContext sqlContext = new org.apache.spark.sql.api.java.JavaSQLContext(jsc); JavaSchemaRDD retRDD=sqlContext.sql("SELECT column1, column2 FROM testentity WHERE column3 = 'value1' ") List<org.apache.spark.sql.api.java.Row> rows = retRDD.collect();
Step5 . use the sql spark by executing a second simple sql query.
JavaSchemaRDD retRDD=sqlContext.sql("SELECT column1, column2 FROM testentity WHERE column3 = 'value2' ") List<org.apache.spark.sql.api.java.Row> rows = retRDD.collect();
Step 6. use the sql spark by making the third simple SQL query.
JavaSchemaRDD retRDD=sqlContext.sql("SELECT column1, column2 FROM testentity WHERE column3 = 'value3' "); List<org.apache.spark.sql.api.java.Row> rows = retRDD.collect();
Test result as shown below:
Test Case1 :
When I insert 300,000 entries, an hbase object, then run the code.
- first request requires 60407 ms
- the second request requires 838 ms
- 3td request requires 792 ms
If I use hbase Api to execute a similar request, it takes only 2000 ms. Apparently the last 2-bit sql request is much faster than the hbase apb request.
I believe the first spark sql query spends a lot of time loading data from hbase.
Thus, the first request is much slower than the last 2 requests. I think the result is expected.
Test Case2 :
When I insert 400,000 records. hbase object, then run the code.
- 1st request required 87213 ms
- the second request requires 83,238 ms
- 3td request requires 82092 ms
If I use hbase Api to execute a similar request, it takes only 3500 ms. Apparently 3 sql spark requests are much slower than an hbase apb request.
And the last 2 sql intrinsic safety queries are also very slow, and the performance is similar to the first query, why? How to tune performance?