Apache phoenix Join query performance

I started using phoenix a couple of months ago. The following are environment and version information.

Hadoop - Cloudera CDH 5.4.7-1. Phoenix - 4.3 - Phoenix, which comes as packages on the CDH5.4.7-1. HBase version - HBase 1.0.0 JDK - 1.7.0_67 1 master and 3 regional servers.

We started doing POC to evaluate Apache Phoenix. We have data in 12 different tables in Oracle DB. We get data into the Hadoop system using the Oracle Golden Gate.

There are 12 different Phoenix tables, each containing 40-100 columns with several hundred rows. We do the conversion process and then load it into the final table. This is the main ETL that we do. The transformation process goes through several intermediate stages, where we fill in the intermediate tables. Therefore, there are β€œjoins” between the tables.

Everything worked perfectly, and we were able to implement the entire ETL process. I was very pleased with the ease of use and implementation.

The problems started when we started performance testing with millions of lines. Below are the problems.

  • The process of intermediate transformations breaks down the domain servers: the union of two tables, each of which contains 20 million rows. The secondary index is created in the column that I am joining. Two salted tables with 9 buckets. This is good if the number of lines resulting from the join is less than ~ 200 thousand. To complete 200 thousand lines, it takes more than 10 minutes. If the number of rows is larger as a result, then the area servers fail. Test code explain select count (available_bal) from salted.b_acct2 ba internal connection (select c_item_id from salted.m_item2 mi, where s_info_id = 12345) as mi on ba.c_item_id = mi.c_item_id;

    + ------------------------------------------ + | PLAN | + ------------------------------------------ + | CUSTOMER 9-CHUNK PARALLEL COMPLETE SCAN 9TH ANNIVERSARY OVER SALTED.S2_BA_CI_IDX | | SERVER AGGREGATE IN ONE HAND | | PARALLEL DOMESTIC TABLE 0 (SKIP SERVICE) | | CUSTOMER 9-CHUNK PARALLEL 9-SIGNIFICANT DISTANCE DOWNLOAD SALTED.S_MI_SI_IDX [0,19,266] | | CUSTOMER MERGE SORT | | DYNAMIC SERVER FILTER TO_BIGINT ("C_ITEM_ID") IN (MI.C_ITEM_ID) | + ------------------------------------------ +

  • Including 6 tables for the final conversion: combining six tables in indexed columns returns data for rows less than 1 M. This takes 10-12 minutes. But if the connection results are approximately greater than 1 M, it freezes and the result does not return. I initially got an InsufficientMemoryException, which I solved by changing the configuration and increasing the available memory. I will not get an InsufficientMemoryException, but the request has not been running for more than 20 minutes. We are awaiting completion in a few seconds.

The following are the options:

jvm.bootoptions= -Xms512m –Xmx9999M. hbase.regionserver.wal.codec : org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec hbase.rpc.timeout 600000 phoenix.query.timeoutMs 360000000 phoenix.query.maxServerCacheBytes 10737418240 phoenix.coprocessor.maxServerCacheTimeToLiveMs 900000 hbase.client.scanner.timeout.period 600000 phoenix.query.maxGlobalMemoryWaitMs 100000 phoenix.query.maxGlobalMemoryPercentage 50 hbase.regionserver.handler.count 50 

Summary. The main problems are the slow execution of join requests and, ultimately, the failure of the region’s servers when the data goes beyond 1 million lines. Is there a performance limit? Please help me solve these problems as we go through the evaluation process and I do not want to let Phoenix go! If I can fulfill the above requests as soon as possible, then I will use Phoenix without hesitation.

Regards, Shivamohan

+5
source share
2 answers

By default, Phoenix uses hash connections, requiring data to fit into memory. If you encounter problems (with very large tables), you can increase the amount of memory allocated for Phoenix (configuration settings), or set a "prompt" for the query (ie SELECT /*+ USE_SORT_MERGE_JOIN*/ FROM ... ), to use sort-merge joins that don't have the same requirement. They plan to automatically determine the ideal connection algorithm in the future. In addition, Phoenix currently only supports a subset of join operations.

+2
source

Have you tried the LHS & RHS concept, which was described in the Phoenix documentation, as a performance optimization function ( http://phoenix.apache.org/joins.html )? In the case of an internal join, the RHS of the join will be built as a hash table in the server cache, so make sure that your smaller table forms the RHS of the internal join. Were the columns you selected in the query part of the secondary index that you created? If you tried the above and still get a delay in a matter of minutes, then you need to check the memory of the servers in the Hbase region and determine if there are enough of them to service your request.

0
source

Source: https://habr.com/ru/post/1241203/


All Articles