HBase Mapreduce for Multiple Scan Objects

I'm just trying to evaluate HBase for some of the data analysis materials we make.

HBase will contain event data. The key will be eventId + time. We want to analyze several types of events (4-5) between a date range. The total number of events is about 1000.

The problem with running the mapreduce job in the hbase table is that initTableMapperJob (see below) takes up just 1 scan object. For performance reasons, we want to check data only on 4-5 types of events in the date range, and not on 1000 types of events. If we use the method below, I think we have no such choice, because it accepts only 1 scan object.

public static void initTableMapperJob (row table, Scan scan, Cluster, class outputKeyClass, Class outputValueClass, org.apache.hadoop.mapreduce.Job job) throws an IOException

Can mapreduce be launched in the list of scan objects? any workaround?

thank

+3
source share
3 answers

TableMapReduceUtil.initTableMapperJobsets up your task for use TableInputFormat, which, as you noticed, takes one Scan.

, . InputFormat, - MultiSegmentTableInputFormat. TableInputFormatBase getSplits, super.getSplits / . ( - TableInputFormatBase.scan.setStartRow() ). InputSplit, .

, MultiSegmentTableInputFormat.

+9

:

/Apache/Hadoop/HBase//FilterList.java

. . FilterList , AND OR . .

0

I tried the Dave L approach and it works beautifully.

To configure the map task, you can use the function

  TableMapReduceUtil.initTableMapperJob(byte[] table, Scan scan,
  Class<? extends TableMapper> mapper,
  Class<? extends WritableComparable> outputKeyClass,
  Class<? extends Writable> outputValueClass, Job job,
  boolean addDependencyJars, Class<? extends InputFormat> inputFormatClass)

where inputFormatClass refers to the MultiSegmentTableInputFormat mentioned in the comments of Dave L.

0
source

Source: https://habr.com/ru/post/1788319/


All Articles