Can we get all column names from the HBase table?

Setup:

I have an HBase table, with 100M + rows and 1 million + columns. Each row has data for only 2-5 columns. There is only 1 family of speakers.

Problem:

I want to find out all the different qualifiers (columns) in this column family . Is there a quick way to do this?

I can think of scanning the whole table, and then get familyMap for each row, get qualifier and add it to Set<> . But it will be terribly slow as there are 100M + lines.

Can we do better?

+5
source share
3 answers

You can use mapreduce for this. In this case, you do not need to install custom libraries for hbase, as in the case of the coprocessor. Below is the code for creating the mapreduce task.

Job Setting

  Job job = Job.getInstance(config); job.setJobName("Distinct columns"); Scan scan = new Scan(); scan.setBatch(500); scan.addFamily(YOU_COLUMN_FAMILY_NAME); scan.setFilter(new KeyOnlyFilter()); //scan only key part of KeyValue (raw, column family, column) scan.setCacheBlocks(false); // don't set to true for MR jobs TableMapReduceUtil.initTableMapperJob( YOU_TABLE_NAME, scan, OnlyColumnNameMapper.class, // mapper Text.class, // mapper output key Text.class, // mapper output value job); job.setNumReduceTasks(1); job.setReducerClass(OnlyColumnNameReducer.class); job.setReducerClass(OnlyColumnNameReducer.class); 

Chart maker

  public class OnlyColumnNameMapper extends TableMapper<Text, Text> { @Override protected void map(ImmutableBytesWritable key, Result value, final Context context) throws IOException, InterruptedException { CellScanner cellScanner = value.cellScanner(); while (cellScanner.advance()) { Cell cell = cellScanner.current(); byte[] q = Bytes.copy(cell.getQualifierArray(), cell.getQualifierOffset(), cell.getQualifierLength()); context.write(new Text(q),new Text()); } } 

}

Gearbox

 public class OnlyColumnNameReducer extends Reducer<Text, Text, Text, Text> { @Override protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { context.write(new Text(key), new Text()); } } 
+2
source

HBase can be rendered as distributed NavigableMap<byte[], NavigableMap<byte[], NavigableMap<byte[], NavigableMap<Long, byte[]>>>>

There is no metadata "(say, something is centrally stored in the master node) about the list of all qualifiers available on all servers in the region.

So, if you have a one-time use case, the only way for you would be to look at the whole table and add qualifier names to Set<> , as you mentioned.

If this is a repeated use case (plus, if you have the ability to add components to your technical stack), you might want to add Redis. The set of qualifiers can be supported in a distributed manner using the Redis Set .

+1
source

HBase coprocessors can be used for this scenario. You can write a custom EndPoint implementation that works like stored procedures in an RDBMS. It runs your server side code and gets separate columns for each region. On the client, you can get individual columns in all regions.

Performance Benefits: All columns are not passed to the client, which reduces network calls.

0
source

Source: https://habr.com/ru/post/1234067/


All Articles