I started using phoenix a couple of months ago. The following are environment and version information.
Hadoop - Cloudera CDH 5.4.7-1. Phoenix - 4.3 - Phoenix, which comes as packages on the CDH5.4.7-1. HBase version - HBase 1.0.0 JDK - 1.7.0_67 1 master and 3 regional servers.
We started doing POC to evaluate Apache Phoenix. We have data in 12 different tables in Oracle DB. We get data into the Hadoop system using the Oracle Golden Gate.
There are 12 different Phoenix tables, each containing 40-100 columns with several hundred rows. We do the conversion process and then load it into the final table. This is the main ETL that we do. The transformation process goes through several intermediate stages, where we fill in the intermediate tables. Therefore, there are βjoinsβ between the tables.
Everything worked perfectly, and we were able to implement the entire ETL process. I was very pleased with the ease of use and implementation.
The problems started when we started performance testing with millions of lines. Below are the problems.
The process of intermediate transformations breaks down the domain servers: the union of two tables, each of which contains 20 million rows. The secondary index is created in the column that I am joining. Two salted tables with 9 buckets. This is good if the number of lines resulting from the join is less than ~ 200 thousand. To complete 200 thousand lines, it takes more than 10 minutes. If the number of rows is larger as a result, then the area servers fail. Test code explain select count (available_bal) from salted.b_acct2 ba internal connection (select c_item_id from salted.m_item2 mi, where s_info_id = 12345) as mi on ba.c_item_id = mi.c_item_id;
+ ------------------------------------------ + | PLAN | + ------------------------------------------ + | CUSTOMER 9-CHUNK PARALLEL COMPLETE SCAN 9TH ANNIVERSARY OVER SALTED.S2_BA_CI_IDX | | SERVER AGGREGATE IN ONE HAND | | PARALLEL DOMESTIC TABLE 0 (SKIP SERVICE) | | CUSTOMER 9-CHUNK PARALLEL 9-SIGNIFICANT DISTANCE DOWNLOAD SALTED.S_MI_SI_IDX [0,19,266] | | CUSTOMER MERGE SORT | | DYNAMIC SERVER FILTER TO_BIGINT ("C_ITEM_ID") IN (MI.C_ITEM_ID) | + ------------------------------------------ +
Including 6 tables for the final conversion: combining six tables in indexed columns returns data for rows less than 1 M. This takes 10-12 minutes. But if the connection results are approximately greater than 1 M, it freezes and the result does not return. I initially got an InsufficientMemoryException, which I solved by changing the configuration and increasing the available memory. I will not get an InsufficientMemoryException, but the request has not been running for more than 20 minutes. We are awaiting completion in a few seconds.
The following are the options:
jvm.bootoptions= -Xms512m βXmx9999M. hbase.regionserver.wal.codec : org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec hbase.rpc.timeout 600000 phoenix.query.timeoutMs 360000000 phoenix.query.maxServerCacheBytes 10737418240 phoenix.coprocessor.maxServerCacheTimeToLiveMs 900000 hbase.client.scanner.timeout.period 600000 phoenix.query.maxGlobalMemoryWaitMs 100000 phoenix.query.maxGlobalMemoryPercentage 50 hbase.regionserver.handler.count 50
Summary. The main problems are the slow execution of join requests and, ultimately, the failure of the regionβs servers when the data goes beyond 1 million lines. Is there a performance limit? Please help me solve these problems as we go through the evaluation process and I do not want to let Phoenix go! If I can fulfill the above requests as soon as possible, then I will use Phoenix without hesitation.
Regards, Shivamohan
source share