Hadoop Pig - removing the csv header

My csv files have a title in the first line. Loading them into pigs creates a mess on any subsequent functions (e.g. SUM). To date, I first apply a filter to the loaded data to remove rows containing headers:

affaires = load 'affaires.csv' using PigStorage(',') as (NU_AFFA:chararray, date:chararray) ; affaires = filter affaires by date matches '../../..'; 

I think this is a little stupid as a method, and I wonder if there is a way to tell pigs not to load the first line of csv, like the boolean parameter "as_header" into the load function. I do not see this in the document. What would be better? How do you usually deal with this?

+6
source share
2 answers

CSVExcelStorage loader support for skipping the title bar, so use PigStorage instead of CSVExcelStorage . Download piggybank.jar and try this option.

Example example

input.csv

 Name,Age,Location a,10,chennai b,20,banglore 

PigScript: (with SKIP_INPUT_HEADER capability)

 REGISTER '/tmp/piggybank.jar'; A = LOAD 'input.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER'); DUMP A; 

Output:

 (a,10,chennai) (b,20,banglore) 

Reference:
http://pig.apache.org/docs/r0.13.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html

+11
source

Another simple option for Pig 0.9 without using the SKIP_INPUT_HEADER option can be done as follows:

Input file (input.txt)

input.txt

 Name,Age,Location a,10,chennai b,20,banglore 

PigScript: (without using the SKIP_INPUT_HEADER option, since this option is not available in Pig 0.9)

 register '<Your location>/piggybank.jar'; d_with_headers = LOAD 'input.csv' using org.apache.pig.piggybank.storage.CSVExcelStorage() AS (name:chararray, age:long, location:chararray); d = FILTER places_with_headers BY name!='Name'; dump d; 

Output:

 (a,10,chennai) (b,20,banglore) 
0
source

Source: https://habr.com/ru/post/984403/


All Articles