First off, you'll want to take a look at this Venn Diagram . What you want is everything except the middle bit. So first you need to do full outer JOIN in the data. Then, since nulls are created in the external JOIN when the key is not shared, you will need to filter the result. JOINs only contain strings having one zero (the disjoint part of the Venn diagram).
Here's what the pig would look like in a script:
-- T1 and T2 are the two sets of tuples you are using, their schemas are: -- T1: {t: (num1: int, num2: int)} -- T2: {t: (num1: int, num2: int)} -- Yours will be different, but the principle is the same B = JOIN T1 BY t FULL, T2 BY t ; C = FILTER B BY T1::t is null OR T2::t is null ; D = FOREACH C GENERATE (T1::t is not null? T1::t : A2::t) ;
Passing steps using this input:
T1: T2: (1,2) (4,5) (3,4) (1,2)
B executes a full outer JOIN, resulting in:
B: {T1::t: (num1: int,num2: int),T2::t: (num1: int,num2: int)} ((1,2),(1,2)) (,(4,5)) ((3.4),)
T1 is the left tuple, and T2 is the correct tuple. We must use :: to determine which t , since they have the same name.
Now C filters B so that only rows with zero are saved. Result:
C: {T1::t: (num1: int,num2: int),T2::t: (num1: int,num2: int)} (,(4,5)) ((3.4),)
This is what you want, but it's a little dirty. D uses bincond (? :) to remove zero. Thus, the end result will be:
D: {T1::t: (num1: int,num2: int)} ((4,5)) ((3.4))
Update:
If you want to keep only the left (T1) (or right (T2) if you switch around) side of the connection. You can do it:
-- B is the same -- We only want to keep tuples where the T2 tuple is null C = FILTER B BY T2::t is null ; -- Generate T1::t to get rid of the null T2::t D = FOREACH C GENERATE T1::t ;
However, looking back at the original Venn diagram, the use of a full JOIN not required. If you look at the different Venn Diagram , you will see that it covers the set you want, without any additional operations. Therefore, you should change B to:
B = JOIN T1 BY t LEFT, T2 BY t ;