How to load TSV containing JSON field in Pig Latin?

I am trying to upload a file using a scheme that is mainly related to TSV (values ​​separated by tabs), but one of the fields is the JSON value. It seems that the Latin pig has a TextLoader for tabs (or other) of shared values, and JsonLoader for JSON ...

In particular, each row of data looks like this:

date\tevent_name\tevent_details\n 

where event_details is a formatted JSON string. The rest are just char arrays.

What is the easiest way to download this data?

Notes: I am using Pig ver 0.11.1.

+4
source share
2 answers

(After doing the research, here is the answer :)

Download the required libraries from http://mvnrepository.com/ required for register commands.

The piglet script will be as follows:

 register 'libs/elephant-bird-core-4.1.jar'; register 'libs/elephant-bird-pig-4.1.jar'; register 'libs/guava-14.0.1.jar'; register 'libs/json-simple-1.1.1.jar'; register 'libs/piggybank.jar'; define decode_json com.twitter.elephantbird.pig.piggybank.JsonStringToMap(); e1 = load '$filename' using PigStorage() as ( date: chararray, event_name: chararray, event_details_str: chararray, ); -- Remove the header row: e2 = filter e1 by not date matches '.*DATE'; -- Convert the event_details from a JSON string to a map: events = foreach e2 generate *, decode_json(event_details_str) as event_details; 
+9
source

The answers on mbells work fine, on the one hand I struggled with how to get map values. The following is an example of retrieving key1, key2 from an event map.

 fields = FOREACH events GENERATE events#'key1', events#'key2'; 
+2
source

Source: https://habr.com/ru/post/1495843/


All Articles