Hadoop Pig - removing the csv header

Question

Hadoop Pig - removing the csv header

My csv files have a title in the first line. Loading them into pigs creates a mess on any subsequent functions (e.g. SUM). To date, I first apply a filter to the loaded data to remove rows containing headers:

affaires = load 'affaires.csv' using PigStorage(',') as (NU_AFFA:chararray, date:chararray) ; affaires = filter affaires by date matches '../../..';

I think this is a little stupid as a method, and I wonder if there is a way to tell pigs not to load the first line of csv, like the boolean parameter "as_header" into the load function. I do not see this in the document. What would be better? How do you usually deal with this?

+6

csv hadoop apache-pig

romain jouin Mar 29 '15 at 10:24

source share

2 answers

Another simple option for Pig 0.9 without using the SKIP_INPUT_HEADER option can be done as follows:

Input file (input.txt)

input.txt

 Name,Age,Location a,10,chennai b,20,banglore

PigScript: (without using the SKIP_INPUT_HEADER option, since this option is not available in Pig 0.9)

 register '<Your location>/piggybank.jar'; d_with_headers = LOAD 'input.csv' using org.apache.pig.piggybank.storage.CSVExcelStorage() AS (name:chararray, age:long, location:chararray); d = FILTER places_with_headers BY name!='Name'; dump d;

Output:

 (a,10,chennai) (b,20,banglore)

0

served_raw Feb 07 '18 at 21:04

source share

Sivasakthi jayaraman · Accepted Answer · 2015-03-29T22:51:07+0000

CSVExcelStorage loader support for skipping the title bar, so use PigStorage instead of CSVExcelStorage . Download piggybank.jar and try this option.

Example example

input.csv

 Name,Age,Location a,10,chennai b,20,banglore

PigScript: (with SKIP_INPUT_HEADER capability)

 REGISTER '/tmp/piggybank.jar'; A = LOAD 'input.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER'); DUMP A;

Output:

 (a,10,chennai) (b,20,banglore)

Reference:
http://pig.apache.org/docs/r0.13.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html

Hadoop Pig - removing the csv header

More articles: