How to handle BigTable Scan InvalidChunk exceptions?

I am trying to scan BigTable data where some rows are dirty, but this fails depending on the scan causing (serialization?) InvalidChunk exceptions. The code is as follows:

from google.cloud import bigtable from google.cloud import happybase client = bigtable.Client(project=project_id, admin=True) instance = client.instance(instance_id) connection = happybase.Connection(instance=instance) table = connection.table(table_name) for key, row in table.scan(limit=5000): #BOOM! pass 

leaving some columns or restricting rows to smaller or defining start and stop keys allows you to scan. I cannot determine which values ​​are problematic from stacktrace - it varies across columns - the scan just fails. This makes cleaning up the data in the source problematic.

When I use the python debugger, I see that the piece (which is of type google.bigtable.v2.bigtable_pb2.CellChunk ) does not matter (it is NULL / undefined):

 ipdb> pp chunk.value b'' ipdb> chunk.value_size 0 

I can confirm this with the HBase shell from the string (I got from self._row.row_key )

So the question becomes: How can BigTable filtering columns have an undefined / empty / null value?

I get the same problem from both Google Cloud APIs that return generators that internally pass data as chunks over gRPC:

  • google.cloud. happybase .table.Table # scan ()
  • google.cloud. bigtable .table.Table # read_rows (). consume_all ()

The abbreviated stacktrace is as follows:

 --------------------------------------------------------------------------- InvalidChunk Traceback (most recent call last) <ipython-input-48-922c8127f43b> in <module>() 1 row_gen = table.scan(limit=n) 2 rows = [] ----> 3 for kvp in row_gen: 4 pass .../site-packages/google/cloud/happybase/table.py in scan(self, row_start, row_stop, row_prefix, columns, timestamp, include_timestamp, limit, **kwargs) 391 while True: 392 try: --> 393 partial_rows_data.consume_next() 394 for row_key in sorted(rows_dict): 395 curr_row_data = rows_dict.pop(row_key) .../site-packages/google/cloud/bigtable/row_data.py in consume_next(self) 273 for chunk in response.chunks: 274 --> 275 self._validate_chunk(chunk) 276 277 if chunk.reset_row: .../site-packages/google/cloud/bigtable/row_data.py in _validate_chunk(self, chunk) 388 self._validate_chunk_new_row(chunk) 389 if self.state == self.ROW_IN_PROGRESS: --> 390 self._validate_chunk_row_in_progress(chunk) 391 if self.state == self.CELL_IN_PROGRESS: 392 self._validate_chunk_cell_in_progress(chunk) .../site-packages/google/cloud/bigtable/row_data.py in _validate_chunk_row_in_progress(self, chunk) 368 self._validate_chunk_status(chunk) 369 if not chunk.HasField('commit_row') and not chunk.reset_row: --> 370 _raise_if(not chunk.timestamp_micros or not chunk.value) 371 _raise_if(chunk.row_key and 372 chunk.row_key != self._row.row_key) .../site-packages/google/cloud/bigtable/row_data.py in _raise_if(predicate, *args) 439 """Helper for validation methods.""" 440 if predicate: --> 441 raise InvalidChunk(*args) InvalidChunk: 

Can you show me how to scan BigTable with Python, ignoring / registering dirty lines that raise InvalidChunk? ( try ... except does not work around the generator, which is located in the Google API cloud row_data PartialRowsData strong>)

Also, can you show me the code for streaming a table scan in BigTable? HappyBase batch_size and scan_batching do not seem to be supported.

+6
source share

Source: https://habr.com/ru/post/1012634/


All Articles