Retrieving Inappropriate Entries Between Files in Pig Latin

Question

Retrieving Inappropriate Entries Between Files in Pig Latin

I start by learning Pig latin. You must extract the entries from the file. Created two files T1 and T2. Some tuples are common to both files. Therefore, you need to extract the tuples represented only in T1 and omit the general tuples between T1 and T2. Can someone please help me ...

thanks

+4

hadoop apache-pig

user2639239 Jul 31 '13 at 18:31

source share

2 answers

mr2ert · Answer 1 · 2013-07-31T20:02:11+0000

First off, you'll want to take a look at this Venn Diagram . What you want is everything except the middle bit. So first you need to do full outer JOIN in the data. Then, since nulls are created in the external JOIN when the key is not shared, you will need to filter the result. JOINs only contain strings having one zero (the disjoint part of the Venn diagram).

Here's what the pig would look like in a script:

 -- T1 and T2 are the two sets of tuples you are using, their schemas are: -- T1: {t: (num1: int, num2: int)} -- T2: {t: (num1: int, num2: int)} -- Yours will be different, but the principle is the same B = JOIN T1 BY t FULL, T2 BY t ; C = FILTER B BY T1::t is null OR T2::t is null ; D = FOREACH C GENERATE (T1::t is not null? T1::t : A2::t) ;

Passing steps using this input:

 T1: T2: (1,2) (4,5) (3,4) (1,2)

B executes a full outer JOIN, resulting in:

 B: {T1::t: (num1: int,num2: int),T2::t: (num1: int,num2: int)} ((1,2),(1,2)) (,(4,5)) ((3.4),)

T1 is the left tuple, and T2 is the correct tuple. We must use :: to determine which t , since they have the same name.

Now C filters B so that only rows with zero are saved. Result:

 C: {T1::t: (num1: int,num2: int),T2::t: (num1: int,num2: int)} (,(4,5)) ((3.4),)

This is what you want, but it's a little dirty. D uses bincond (? :) to remove zero. Thus, the end result will be:

 D: {T1::t: (num1: int,num2: int)} ((4,5)) ((3.4))

Update:
If you want to keep only the left (T1) (or right (T2) if you switch around) side of the connection. You can do it:

 -- B is the same -- We only want to keep tuples where the T2 tuple is null C = FILTER B BY T2::t is null ; -- Generate T1::t to get rid of the null T2::t D = FOREACH C GENERATE T1::t ;

However, looking back at the original Venn diagram, the use of a full JOIN not required. If you look at the different Venn Diagram , you will see that it covers the set you want, without any additional operations. Therefore, you should change B to:

 B = JOIN T1 BY t LEFT, T2 BY t ;

Ran locar · Answer 2 · 2015-12-19T14:46:35+0000

I believe there is a more efficient way to do this, especially if T1 and T2 are very large. I am working on a data set with several billion lines per file, and I'm only interested in T2 lines that are not in T1. Both files have the same layout and similar size.

 T1 = load '/path/to/file1' using PigStorage() as ( f1, f2, f3); T1 = foreach T1 generate $0.., --all fields 1 as day1, 0 as day2); T2 = load '/path/to/file2' using PigStorage() as ( f1, f2, f3); T2 = foreach T2 generate $0.., --all fields 0 as day1, 1 as day2); T3 = union T1, T2; -- assuming f1 is your join field T3grouped = foreach (group T3 by f1) generate flatten(T3), sum(T3.day1) as day1, sum(T3.day2) as day2; T3result = filter T3grouped by day1==0;

This will return rows having f1 that did not appear on day1. It is equivalent

 T3 = T2 by f1 LEFT OUTER, T1 by f1; T3result = filter T3 by T1::f1 is null

but much faster. The UNION version works after ~ 10 minutes, the JOIN version works for> 2 hours (and is still not finished). Looking at the counters, the UNION version generates more I / O (especially around the cards), but uses only 50% of the CPU.

Retrieving Inappropriate Entries Between Files in Pig Latin

More articles: