How to generate a pig line number?

I use pigs for data preparation and I have a problem that seems easy, but I can not deal with it:

for example i have a name column

name ------ Alicia Ana Benita Berta Bertha 

then how can I add a line number for each name? The result will be like this:

 name | id ---------------- Alicia | 1 Ana | 2 Benita | 3 Berta | 4 Bertha | 5 

Thanks for reading this question!

+4
source share
4 answers

Unfortunately, there is no way to list strings in Pig Latin. At least I could not find an easy way. One solution is to implement a separate MapReduce job with one Reduce task that does the actual enumeration. More precisely,

Card Phase: Assign all rows to the same key. Single Reduce task: gets one key with an iterator to all rows. Since the reduction task will be performed on only one physical machine, and the β€œreduce” function will be called only once, the local counter inside the function solves the problem.

If the data is huge and impossible to process on a machine with one reduction, then MapReduce Counters on the master node can be used by default.

+3
source

The pig did not have a mechanism for this when you asked this question. However, in Pig 0.11, the RANK operator was introduced, which can be used for this purpose.

+10
source

the idea of ​​a sketch, assuming that the column "name" that we want to arrange is a numeric, not a string. also suggesting a good disjoint distribution.

  • WITH_GROUPS = foreach TABLE generate name, name / 100 as group_id;
  • group WITH_GROUPS by group_id;
  • PER_GROUP = generate group, count (*);
  • ACCUM_PER_GROUP = cross-connecting PER_GROUP with itself, calculating the accumulated quantity per group;
  • cogroup ACCUM_PER_GROUP with WITH_GROUPS by group_id;
  • UDF is launched in the reducer, which assigns an identifier to each line starting with this group accumulative_count
+1
source

@cabad

On the surface, you can see that the RANK operator will work, but you are not guaranteed to have an increasing row identifier without any restrictions on your data.

The problem arises from any rows that are provided to the ranking operator equal, will have the same rank. If you can guarantee that none of the two lines have equal ranking fields, then this approach may work, but I think I would put it in the round-hole approach.

See this example from [docs] http://pig.apache.org/docs/r0.11.0/basic.html#rank (takes 2, 6, 10):

 C = rank A by f1 DESC, f2 ASC; dump C; (1,Tete,2,N) (2,Ranjit,3,M) (2,Ranjit,3,P) (4,Michael,8,T) (5,Jose,10,V) (6,Jillian,8,Q) (6,Jillian,8,Q) (8,JaePak,7,Q) (9,David,1,N) (10,David,4,Q) (10,David,4,Q) 
+1
source

Source: https://habr.com/ru/post/1399113/


All Articles