Apache Phoenix vs Hive-Spark

What is faster / easier to convert to SQL that accepts SQL scripts as input: Spark SQL, which is included as a speed level for high latency queries of Hive or Phoenix? And if so, how? I need to do a lot of upserts / join / grouping over the data. [HBase]

Is there an alternative on top of Cassandra CQL to support the above (real-time concatenation / grouping)?

I am most likely attached to Spark, as I would like to use MLlib. But for data processing, which should be my options?

Thanks kraster

+6
source share
2 answers

http://phoenix-hbase.blogspot.com/ I am more than sure that Phoenix on Hbase will work faster.

Here is an example of a request and PC requirements for testing. Request: select a counter (1) from the table above lines 10M and 100M. Data - 5 narrow columns. Number of regional servers: 4 (HBase heap: 10 GB, processor: 6 cores @ 3.3 GHz Xeon) enter image description here Because Phoenix uses the HBASE client interface to load the entire query and uses the query mechanism only to map the sql task to the map reduction task in HBase

+2
source

You have several options (as far as I know)

  • Apache phoenix is ​​a good choice for a table with low latency and medium size (rows 1M - 100M, but beware of tables with many columns!). A great plus for the phoenix is ​​that it is very easy to get started. My company already has an HBase cluster (with keberos). To use Phoenix, I needed the HMaster URL, Hbase-site.xml, and keytab to complete the operation. Very fast reads and writes are decent (this is slower for me because I needed to do this dynamically, so I had to use the Java API instead of bulk loading)

  • The hive with Spark is wonderful too. I'm not sure how great performance is over Phoenix. Since Spark does most of the things in memory, I guess it should be fast. However, I can tell you if you want to expose SQL access as a kind of API, using a spark becomes quite complicated.

  • Presto is a great product that offers Spark-like processing power using the SQL interface, which allows you to combine data from many sources (Hive, Cassandra, MySQL..etc)

Hope this helps.

+2
source

Source: https://habr.com/ru/post/984598/


All Articles