Apache Pig: space-space prefix (: :) after a batch operation

A common data processing model is grouping by a certain set of columns, applying a filter, then smoothing again. For instance:

my_data_grouped = group my_data by some_column; my_data_grouped = filter my_data_grouped by <some expression>; my_data = foreach my_data_grouped flatten(my_data); 

The problem is that if my_data starts with a scheme like (c1, c2, c3) after this operation, it will have the same scheme (mydata :: c1, mydata :: c2, mydata :: c3). Is there a way to easily remove the prefix "mydata ::" if the columns are unique?

I know I can do something like this:

 my_data = foreach my_data generate c1 as c1, c2 as c2, c3 as c3; 

However, this becomes inconvenient and difficult to maintain for datasets with a large number of columns and is not possible for datasets with variable columns.

+6
source share
2 answers

If all fields in the schema have the same set of prefixes (for example, group1 :: id, group1 :: amount, etc.), you can ignore the prefix when referring to certain fields (and just refer to them as id, amount, etc. .d.)

Alternatively, if you still want to split the scheme of one level of prefix, you can use UDF as follows:

 public class RemoveGroupFromTupleSchema extends EvalFunc<Tuple> { @Override public Tuple exec(Tuple input) throws IOException { Tuple result = input; return result; } @Override public Schema outputSchema(Schema input) throws FrontendException { if(input.size() != 1) { throw new RuntimeException("Expected input (tuple) but input does not have 1 field"); } List<Schema.FieldSchema> inputSchema = input.getFields(); List<Schema.FieldSchema> outputSchema = new ArrayList<Schema.FieldSchema>(inputSchema); for(int i = 0; i < inputSchema.size(); i++) { Schema.FieldSchema thisInputFieldSchema = inputSchema.get(i); String inputFieldName = thisInputFieldSchema.alias; Byte dataType = thisInputFieldSchema.type; String outputFieldName; int findLoc = inputFieldName.indexOf("::"); if(findLoc == -1) { outputFieldName = inputFieldName; } else { outputFieldName = inputFieldName.substring(findLoc+2); } Schema.FieldSchema thisOutputFieldSchema = new Schema.FieldSchema(outputFieldName, dataType); outputSchema.set(i, thisOutputFieldSchema); } return new Schema(outputSchema); } } 
+3
source

You can put the "AS" statement on the same line as the "foreach".

i.e.

 my_data_grouped = group my_data by some_column; my_data_grouped = filter my_data_grouped by <some expression>; my_data = FOREACH my_data_grouped FLATTEN(my_data) AS (c1, c2, c3); 

However, this is the same as for two rows, and does not reduce your problem for "variable column datasets."

+1
source

Source: https://habr.com/ru/post/917832/


All Articles