How can I ignore "(double quotes) when loading a file in PIG?

I have the following data in a file

"a","b","1","2"
"a","b","4","3"
"a","b","3","1"

I am reading this file using the command

File1 = LOAD '/path' using PigStorage (',') as (f1:chararray,f2:chararray,f3:int,f4:int)

But here he ignores the data of fields 3 and 4

How to read this file or make PIG skip '"' in any way

Additional Information I am using Apache Pig version 0.10.0

+4
source share
5 answers

You can use the function REPLACE(it will not be in one pass):

file1 = load 'your.csv' using PigStorage(',');
data = foreach file1 generate $0 as (f1:chararray), $1 as (f2:chararray), REPLACE($2, '\\"', '') as (f3:int), REPLACE($3, '\\"', '') as (f4:int);

You can also use regular expressions with REGEX_EXTRACT:

file1 = load 'your.csv' using PigStorage(',');
data = foreach file1 generate $0, $1, REGEX_EXTRACT($2, '([0-9]+)', 1), REGEX_EXTRACT($3, '([0-9]+)', 1);

Of course, you can erase "for f1 and f2 in the same way.

+4
source

( ):

using org.apache.pig.piggybank.storage.CSVExcelStorage() 
+1

Jython, UDF .

python UDF

#!/usr/bin/env python

'''
udf.py
'''

@outputSchema("out:chararray")
def formatter(item):
    chars = 'abcdefghijklmnopqrstuvwxyz'
    nums = '1234567890'
    new_item = item.split('"')[1]
    if new_item in chars:
        output = str(new_item)
    elif new_item in nums:
        output = int(new_item)

    return output

script

REGISTER 'udf.py' USING jython as udf;
data = load 'file' USING PigStorage(',') AS (col1:chararray, col2:chararray,
       col3:chararray, col4:chararray);
out = foreach data generate udf.formatter(col1) as a, udf.formatter(col3) as b;
dump out

(a,1)
(a,4)
(a,3)
0

REPLACE? ?

data = LOAD 'YOUR_DATA' Using PigStorage(',') AS (a:chararray, b:chararray, c:chararray, d:chararray) ;

new_data = foreach data generate 
   REPLACE(a, '"', '') AS a,
   REPLACE(b, '"', '') AS b, 
   (int)REPLACE(c, '"', '') AS c:int, 
   (int)REPLACE(d, '"', '') AS d:int;

: csv, Excel, .

0

CSVExcelStorage Pig. . Piggy-bank .

Register ${jar_location}/piggybank-0.15.0.jar;

load_data = load '${data_location}' using 
org.apache.pig.piggybank.storage.CSVExcelStorage(',');

, .

0

Source: https://habr.com/ru/post/1547012/


All Articles