Loading raw JSON in pigs

I have a file where each line is a JSON object (in fact, this is a stackoverflow dump). I would like to load this into Apache Pig as simple as possible, but it's hard for me to understand how I can tell you what the input format is. Here is an example entry,

{ "_id" : { "$oid" : "506492073401d91fa7fdffbe" }, "Body" : "....", "ViewCount" : 7351, "LastEditorDisplayName" : "Rich B", "Title" : ".....", "LastEditorUserId" : 140328, "LastActivityDate" : { "$date" : 1314819738077 }, "LastEditDate" : { "$date" : 1313882544213 }, "AnswerCount" : 12, "CommentCount" : 19, "AcceptedAnswerId" : 7, "Score" : 83, "PostTypeId" : "question", "OwnerUserId" : 8, "Tags" : [ "c#", "winforms" ], "CreationDate" : { "$date" : 1217540572667 }, "FavoriteCount" : 13, "Id" : 4, "ForumName" : "stackoverflow.com" } 

Is there a way to upload a file where each line is one of the above in Pig without the need to specify the scheme manually? Or perhaps a way to automatically generate a circuit based on (possibly nested) keys observed in all objects? If I need to specify the schema manually, what does the schema string look like?

Thanks!

+4
source share
1 answer

Quick and easy way: use the Twitter project elephantbird. Inside is a bootloader named com.twitter.elephantbird.pig.load.JsonLoader . When used directly so

 A = LOAD '/path/to/data.json' USING com.twitter.elephantbird.pig.load.JsonLoader() as (json:map[]); B = FOREACH A GENERATE json#'fieldName' AS field_name; 

nested items will not be loaded. However, you can easily fix this (if desired) by changing it to,

 A = LOAD '/path/to/data.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') 

Including elephantbird is easy - just pull the project "elephantbird" with the organization "com.twitter.elephantbird" using the Maven (or equivalent in) dependency manager, and then issue the usual <T23> Piggyback command>

 register 'lib/elephantbird.jar'; 
+10
source

Source: https://habr.com/ru/post/1436863/


All Articles