Optimization of several GB of data in Hadoop Hive

I used Hive quite actively, and I was wondering if there is a way to improve the next workflow.

Every night in HDFS, an uncompressed text file delimiter is written with a tab delimiter from our Oracle cluster, which is processed by Hive.

I load the table as follows:

CREATE EXTERNAL TABLE ACCOUNTINGTABLE ( ts STRING, duid STRING, owner STRING, hidden STRING, lgroup STRING, nbfiles INT, length BIGINT, replicas INT, provenance STRING, state STRING, campaign STRING, rlength BIGINT, rnbfiles INT, rowner STRING, rgroup STRING, rarchived STRING, rsuspicious STRING, name STRING, ami STRING, site STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' LOCATION '/user/accounting/dump'; LOAD DATA INPATH '/user/accounting/dump_REPLACEWITHTIMESTAMP.lst' INTO TABLE ACCOUNTINGTABLE; 

and then do a few of these summaries to create text output for postprocessing:

 set hive.exec.reducers.max=90; CREATE EXTERNAL TABLE ACCOUNTINGTABLE_site_project_datatype_tag ( ts STRING, project STRING, datatype STRING, tag STRING, site STRING, duids INT, replicas INT, nbfiles INT, rnbfiles INT, length BIGINT, rlength BIGINT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' LOCATION '/user/accounting/summary/REPLACEWITHTIMESTAMP/site_project_datatype_tag'; INSERT OVERWRITE TABLE ACCOUNTINGTABLE_site_project_datatype_tag SELECT 'REPLACEWITHTIMESTAMP', regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){1}', 1), regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){5}', 1), split(regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){6}', 1), '_tid')[0], site, count(distinct duid), sum(replicas), sum(nbfiles), sum(rnbfiles), sum(length), sum(rlength) from ACCOUNTINGTABLE where ( ami='project.datasetnumber.physicsshort.prodstep.datatype.version' or ami='project.runnumber.streamname.prodstep.datatype.version' ) group by 'REPLACEWITHTIMESTAMP', regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){1}', 1), regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){5}', 1), split(regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){6}', 1), '_tid')[0], site; DROP TABLE ACCOUNTINGTABLE_site_project_datatype_tag; 

Now:

The average Oracle dump file size is around 5 GB (actually not many), with approximately 250 million lines. Amounts no more than 1-2 MB.

The average hive job, as described above, takes about an hour. The display phase progresses very well, and it is 100% after about 15 minutes, but then the reduction takes almost 45 minutes all the time, showing 100%. Now we are gradually adding more and more different resumes, and soon we will reach the magic limit of 24 hours for summary processing. Our infrastructure monitoring also shows that node usage is low (cpu ~ 30-40%, io ~ 10%).

I tried playing with io.sort.mb, io.sort.factor, etc., but that almost always made things worse. So now I am running Hadoop defaults (distribution of Cloudera btw). The cluster consists of 12 nodes (8 cores) with 24 GB of RAM and 2 TB of disk, each of which is configured for 8 cards, 8 gearboxes (6/6 on nenenode).

I also tried to create a temporary table as a compressed sequence file with INSERT INTO SELECT, but this INSERT took too long ...

I have a suspicion that there may be something wrong with the workflow itself, not just the cluster / configuration.

Any advice is appreciated.

+4
source share
2 answers

Have you tried partitions or indexes

A partition can significantly speed up the work of a group, since the data is physically divided into nodes by sections, and therefore the reduction operations are significantly reduced and in some cases instantly.

0
source

Along with sharing, you can do bucketing, and you can specify Sort By in it.

Check the creation of views. This may help, except for a table without partitions.

0
source

Source: https://habr.com/ru/post/1391476/


All Articles