Data Stream Mixing Integer and Long Types

In my data stream pipeline, I set the impressions_raw field to Long in the com.google.api.services.bigquery.model.TableRow object:

enter image description here

Next in my pipeline I read TableRow back. But instead of Long I return Integer .

enter image description here

However, if I explicitly set the value as a Long value greater than Integer.MAX_VALUE , for example 3 billion, then I return a Long !

enter image description here enter image description here

It seems that the Dataflow SDK is doing some kind of optimization of type checking under the hood.

So, without doing an ugly type check, how should you deal with this programmatically? (maybe I missed something obvious)

+5
source share
1 answer

Thanks for the report. Unfortunately, this problem is fundamental using TableRow . We highly recommend Solution 1 below: Convert with TableRow as soon as possible into your pipeline.

The TableRow object in which you store these values ​​is serialized and deserialized by Jackson, inside the TableRowJsonCoder . Jackson has exactly the behavior you are describing, i.e. for this class:

 class MyClass { Object v; } 

it will serialize the instance with v = Long.valueOf(<number>) as {v: 30} or {v: 3000000000} . However, during deserialization, it will determine the type of object using the number of bits needed to represent the response. See this SO post .

Two possible solutions come to mind: solution 1 is highly recommended:

  • Do not use TableRow as an intermediate value. In other words, convert to POJO as soon as possible. The main reason for this mixup type is that TableRow is essentially a Map<String, Object> , and Jackson (or other encoders) may not know that you want to return Long . With POJO types will be clear.

    Another advantage of disabling TableRow is to get an efficient encoder, say AvroCoder . Since TableRow encoded and decoded to / from JSON, the encoding is both verbose and slow - shuffling TableRow will be both intensive and intensive using CPU / I / O. I expect you to see much better performance with the help of advised POJOs than if you were going through TableRow objects.

    For an example, see LaneInfo in TrafficMaxLaneFlow .

  • Enter a code that can handle both:

     long numberToLong(@Nonnull Number n) { return n.longValue(); } long x = numberToLong((Number) row.get("field")); Long numberToLong(@Nonnull Number n) { if (n instanceof Long) { // avoid a copy return n; } return Long.valueOf(n.longValue()); } Long x = numberToLong((Number) row.get("field")); 

    You may need additional checks in the second option if n can be null .

+3
source

Source: https://habr.com/ru/post/1235591/


All Articles