Bag subtraction

I use Pig to analyze application logs to find out which public methods were called by a user who was not called last month (by the same user).

I managed to get methods called grouped users until the last month and after the last month:

BEFORE THE PREVIOUS MONTH EXAMPLE OF RELATIONSHIP

u1 {(m1),(m2)} u2 {(m3),(m4)} 

AFTER last month’s relationship pattern

 u1 {(m1),(m3)} u2 {(m1),(m4)} 

What I want is to find to users what methods are in AFTER that are not in FRONT, i.e.

Expected Result NEWLY_CALLED

 u1 {(m3)} u2 {(m1)} 

Question: How can I do this in Pig? can bags be deducted?

I tried the DIFF function but does not perform the expected subtraction.

Hi,

Joel

0
source share
2 answers

I think you need to write UDF, then you can use

 Set<T> setA ... Set<T> setB ... Set<T> setAminusB = setA.subtract(setB); 
+2
source

For those who may be interested, here is the subtraction function that I wrote below and suggested it with Pig ( PIG-2881 ):

 /** * Subtract takes two bags as arguments returns a new bag composed of tuples of first bag not in the second bag.<br> * If null bag arguments are replaced by empty bags. * <p> * The implementation assumes that both bags being passed to this function will fit entirely into memory simultaneously. * </br> * If that is not the case the UDF will still function, but it will be <strong>very</strong> slow. */ public class Subtract extends EvalFunc<DataBag> { /** * Compares the two bag fields from input Tuple and returns a new bag composed of elements of first bag not in the second bag. * @param input a tuple with exactly two bag fields. * @throws IOException if there are not exactly two fields in a tuple or if they are not {@link DataBag}. */ @Override public DataBag exec(Tuple input) throws IOException { if (input.size() != 2) { throw new ExecException("Subtract expected two inputs but received " + input.size() + " inputs."); } DataBag bag1 = toDataBag(input.get(0)); DataBag bag2 = toDataBag(input.get(1)); return subtract(bag1, bag2); } private static String classNameOf(Object o) { return o == null ? "null" : o.getClass().getSimpleName(); } private static DataBag toDataBag(Object o) throws ExecException { if (o == null) { return BagFactory.getInstance().newDefaultBag(); } if (o instanceof DataBag) { return (DataBag) o; } throw new ExecException(format("Expecting input to be DataBag only but was '%s'", classNameOf(o))); } private static DataBag subtract(DataBag bag1, DataBag bag2) { DataBag subtractBag2FromBag1 = BagFactory.getInstance().newDefaultBag(); // convert each bag to Set, this does make the assumption that the sets will fit in memory. Set<Tuple> set1 = toSet(bag1); // remove elements of bag2 from set1 Iterator<Tuple> bag2Iterator = bag2.iterator(); while (bag2Iterator.hasNext()) { set1.remove(bag2Iterator.next()); } // set1 now contains all elements of bag1 not in bag2 => we can build the resulting DataBag. for (Tuple tuple : set1) { subtractBag2FromBag1.add(tuple); } return subtractBag2FromBag1; } private static Set<Tuple> toSet(DataBag bag) { Set<Tuple> set = new HashSet<Tuple>(); Iterator<Tuple> iterator = bag.iterator(); while (iterator.hasNext()) { set.add(iterator.next()); } return set; } } 
+2
source

Source: https://habr.com/ru/post/1494557/


All Articles