Expand chararray delimiter to multiple tuples

One of the columns in my relation contains separation values ​​(e.g. csv) and I want to split them so that I have an entry in the relation for each value (in combination with other columns that have atomic values). For example, if I had the following data:

SomeID|Age|CommaSeperatedNames 1 |23 |Steve,Joe,Bob 2 |26 |Dan,Mike,Tom 

I would like the resulting ratio to contain:

 SomeID|Age|Names 1 |23 |Steve 1 |23 |Joe 1 |23 |Bob 2 |26 |Dan 2 |26 |Mike 2 |26 |Tom 

Can this be done using only PigLatin and the UDFS built-in / piggy bank? Note. I have a hacker solution using UDF that I wrote, I would like to know if this is possible with only Pig.

+4
source share
1 answer

TOKENIZE will split your names into a bag. Then, if you are FLATTEN , it will be split into a bag, in turn. If TOKENIZE not tokenized as you would like (it should work fine with commas), you probably have to write some kind of UDF that writes out the bag.

 A = LOAD ... USING PigStorage('|') AS (SomeID, Age, Names); B = FOREACH A GENERATE SomeID, Age, FLATTEN(TOKENIZE(Names)) as Name; C = STORE B INTO ...; 
+5
source

Source: https://habr.com/ru/post/1368885/


All Articles