Using PIG with Hadoop, how does a regular expression match a piece of text with an unknown number of groups?

Question

Using PIG with Hadoop, how does a regular expression match a piece of text with an unknown number of groups?

I am using Amazon Stretch Card.

I have log files that look something like this.

   random text foo="1" more random text foo="2"
   more text notamatch="5" noise foo="1"
   blah blah blah foo="1" blah blah foo="3" blah blah foo="4" ...

How can I write a pig expression to highlight all the numbers in "foo" expressions?

I prefer tuples that look something like this:

(1,2)
(1)
(1,3,4)

I tried the following:

TUPLES = foreach LINES generate FLATTEN(EXTRACT(line,'foo="([0-9]+)"'));

But this gives only the first match on each line:

(1)
(1)
(1)

+3

amazon-web-services mapreduce hadoop apache-pig

lmonson Dec 30 '10 at 4:50

source share

2 answers

Donald Miner · Answer 1 · 2010-12-30T14:49:10+0000

You can use STRSPLIT: http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#STRSPLIT

[^0-9]+ (.. ) , .

- Pig UDF.

user3922840 · Answer 2 · 2014-09-04T07:13:30+0000

REGEX_EXTRACT

REGEX_EXTRACT (, 'foo = (. *)', 2) AS-;

Using PIG with Hadoop, how does a regular expression match a piece of text with an unknown number of groups?

More articles: