I have a file where each line is a JSON object (in fact, this is a stackoverflow dump). I would like to load this into Apache Pig as simple as possible, but it's hard for me to understand how I can tell you what the input format is. Here is an example entry,
{ "_id" : { "$oid" : "506492073401d91fa7fdffbe" }, "Body" : "....", "ViewCount" : 7351, "LastEditorDisplayName" : "Rich B", "Title" : ".....", "LastEditorUserId" : 140328, "LastActivityDate" : { "$date" : 1314819738077 }, "LastEditDate" : { "$date" : 1313882544213 }, "AnswerCount" : 12, "CommentCount" : 19, "AcceptedAnswerId" : 7, "Score" : 83, "PostTypeId" : "question", "OwnerUserId" : 8, "Tags" : [ "c#", "winforms" ], "CreationDate" : { "$date" : 1217540572667 }, "FavoriteCount" : 13, "Id" : 4, "ForumName" : "stackoverflow.com" }
Is there a way to upload a file where each line is one of the above in Pig without the need to specify the scheme manually? Or perhaps a way to automatically generate a circuit based on (possibly nested) keys observed in all objects? If I need to specify the schema manually, what does the schema string look like?
Thanks!
source share