I'm new to swing, and I had a problem analyzing my input and getting it in a format that I can use. The input file contains lines that have both fixed fields and KV pairs as follows:
Ff1 | Ff2 | Ff3 | Ff4 | KVP1 | KVP2 | ... | Kvpn
My goal here is to count the number of unique fixed field combinations for each of the KV pairs. Therefore, consider the following input lines:
1|2|3|4|key1=value1|key2=value2 2|3|4|5|key1=value7|key2=value2|key3=value3
When I finished, I would like to be able to generate the following results (the output format doesn't matter at the moment, I just show you what I would like the results to be)
key1=value1 : 1 key1=value7 : 1 key2=value2 : 2 key3=value3 : 1
It seems like I should do this by grouping fixed fields and flattening a bag of KV pairs to generate cross-product
I tried reading this with something like:
data = load 'myfile' using PigStorage('|'); A = foreach data generate $0 as ff1:chararray, $1 as ff2:long, $2 as ff3:chararray, $3 as ff4:chararray, TOBAG($4..) as kvpairs:bag{kvpair:tuple()}; B = foreach A { sorted = order A by ff2; lim = limit sorted 1; generate group.ff1, group.ff4, flatten( lim.kvpairs ); }; C = filter B by ff3 matches 'somevalue'; D = foreach C generate ff1, ff4, flatten( kvpairs ) as kvpair; E = group D by (ff1, ff4, kvpair); F = foreach E generate group, COUNT(E);
This generates schema entries as follows:
A: {date: long, hms: long, id: long, ff1: chararray, ff2: long, ff3: chararray, ff4: chararray, kvpairs: {kvpair: (NULL)}}
As long as this gets me the circuit I want, there are several problems that I cannot solve:
- Using TOBAG with .., no scheme can be applied to my kvpairs, so I cannot filter on kvpair, and I seem to be unable to do this at any time, so this is all or nothing.
- The filter in the expression "C" does not seem to return any data no matter what value I use, even if I use something like ". *" Or ". +". I do not know if this is because there is no circuit, or if it is actually a mistake in the pig. If I discard some data from instruction B, I definitely see data there that will match these expressions.
So, I tried to approach the problem differently by loading data using:
data = load 'myfile' using PigStorage('\n') as (line:chararray); init_parse = foreach data generate FLATTEN( STRSPLIT( line, '\\|', 4 ) ) as (ff1:chararray, ff2:chararray, ff3:chararray, ff4:chararray, kvpairsStr:chararray); A = foreach mc_bk_data generate ff1, ff2, ff3, ff4, TOBAG( STRSPLIT( kvpairsStr, '\\|', 500 ) ) as kvpairs:bag{t:(kvpair:chararray)};
The problem is that TOBAG (STRSPLIT (...)) gives a bag with one tuple, with each kvpairs being a field in that tuple. I really need a storage bag, each of the individual kvpairs as a tuple of one field, so when I flatten the bag, I get a cross-product of the bag and the group that interests me.
I am open to other ways to attack this problem, but I can find a good way to turn my tuple from several fields into a tuple package, each tuple having one field.
I am using Apache Pig version 0.11.1.1.3.0.0-107
Thanks in advance.