How to use pigs, how can I make out a mixed format line in tuples and a bag of tuples?

I'm new to swing, and I had a problem analyzing my input and getting it in a format that I can use. The input file contains lines that have both fixed fields and KV pairs as follows:

Ff1 | Ff2 | Ff3 | Ff4 | KVP1 | KVP2 | ... | Kvpn

My goal here is to count the number of unique fixed field combinations for each of the KV pairs. Therefore, consider the following input lines:

1|2|3|4|key1=value1|key2=value2 2|3|4|5|key1=value7|key2=value2|key3=value3 

When I finished, I would like to be able to generate the following results (the output format doesn't matter at the moment, I just show you what I would like the results to be)

 key1=value1 : 1 key1=value7 : 1 key2=value2 : 2 key3=value3 : 1 

It seems like I should do this by grouping fixed fields and flattening a bag of KV pairs to generate cross-product

I tried reading this with something like:

 data = load 'myfile' using PigStorage('|'); A = foreach data generate $0 as ff1:chararray, $1 as ff2:long, $2 as ff3:chararray, $3 as ff4:chararray, TOBAG($4..) as kvpairs:bag{kvpair:tuple()}; B = foreach A { sorted = order A by ff2; lim = limit sorted 1; generate group.ff1, group.ff4, flatten( lim.kvpairs ); }; C = filter B by ff3 matches 'somevalue'; D = foreach C generate ff1, ff4, flatten( kvpairs ) as kvpair; E = group D by (ff1, ff4, kvpair); F = foreach E generate group, COUNT(E); 

This generates schema entries as follows:

A: {date: long, hms: long, id: long, ff1: chararray, ff2: long, ff3: chararray, ff4: chararray, kvpairs: {kvpair: (NULL)}}

As long as this gets me the circuit I want, there are several problems that I cannot solve:

  • Using TOBAG with .., no scheme can be applied to my kvpairs, so I cannot filter on kvpair, and I seem to be unable to do this at any time, so this is all or nothing.
  • The filter in the expression "C" does not seem to return any data no matter what value I use, even if I use something like ". *" Or ". +". I do not know if this is because there is no circuit, or if it is actually a mistake in the pig. If I discard some data from instruction B, I definitely see data there that will match these expressions.

So, I tried to approach the problem differently by loading data using:

 data = load 'myfile' using PigStorage('\n') as (line:chararray); init_parse = foreach data generate FLATTEN( STRSPLIT( line, '\\|', 4 ) ) as (ff1:chararray, ff2:chararray, ff3:chararray, ff4:chararray, kvpairsStr:chararray); A = foreach mc_bk_data generate ff1, ff2, ff3, ff4, TOBAG( STRSPLIT( kvpairsStr, '\\|', 500 ) ) as kvpairs:bag{t:(kvpair:chararray)}; 

The problem is that TOBAG (STRSPLIT (...)) gives a bag with one tuple, with each kvpairs being a field in that tuple. I really need a storage bag, each of the individual kvpairs as a tuple of one field, so when I flatten the bag, I get a cross-product of the bag and the group that interests me.

I am open to other ways to attack this problem, but I can find a good way to turn my tuple from several fields into a tuple package, each tuple having one field.

I am using Apache Pig version 0.11.1.1.3.0.0-107

Thanks in advance.

+6
source share
1 answer

Your second approach is on the right track. Unfortunately, you will need UDF to convert the tuple to a bag, and as far as I know, there is no built-in for this. It is easy to write, however.

You do not want to group fixed fields, but pairs of key values ​​themselves. Therefore, you only need to save a tuple of key-value pairs; you can completely ignore fixed fields.

UDF is pretty simple. In Java, you can just do something like this in your exec method:

 DataBag b = new DefaultDataBag(); Tuple t = (Tuple) input.get(0); for (int i = 0; i < t.size(); i++) { Object o = t.get(i); Tuple e = TupleFactory.getInstance().createTuple(o); b.add(e); } return b; 

After that, turn the tuple from STRSPLIT into the bag, smooth it, and then perform grouping and counting.

+2
source

Source: https://habr.com/ru/post/954802/


All Articles