Scalding: How to save another field after groupBy (') {. Size}?

Question

Scalding: How to save another field after groupBy (') {. Size}?

So my input has two fields / columns: id1 and id2, and my code is as follows:

TextLine(args("input")) .read .mapTo('line->('id1,'id2)) {line: String => val fields = line.split("\t") (fields(0),fields(1)) } .groupBy('id2){.size} .write(Tsv(args("output")))

The result is (as I assume) two fields: id2 * size. I am a bit stuck in figuring out whether it is possible to save the id1 value, which was also grouped with id2 and add it as another field?

+6

twitter cascading scalding

jeremy.ting Jul 6 '13 at 10:02

source share

1 answer

samthebest · Accepted Answer · 2013-09-09T14:06:36+0000

You cannot do it beautifully, I'm afraid. Think about how it works under the hood - it breaks the data that needs to be converted into pieces, and sends it to different processes, each process counts its piece, and then one reducer adds them at the end. Although each process counts, it does not know the whole size, so it cannot add a field. The only way is to go back and add it to the data as soon as the whole size (i.e. Connection) is known.

If each group fits in memory (and you can configure the memory), you can:

 Tsv(args("input"), ('id1, 'id2)) .groupBy('id2)(_.size.toList[(String, String)](('id1, 'id2) -> 'list)) .flatMapTo[(Iterable[(String, String)], Int), (String, String, Int)](('list, 'size) -> ('id1, 'id2, 'size)) { case (list, size) => list.map(record => (record._1, record._2, size)) } .write(Tsv(args("output")))

But if your system does not have enough memory, you will have to use an expensive connection.

Note: You can use Tsv instead of TextLine, followed by mapTo and the split.

Scalding: How to save another field after groupBy (') {. Size}?

More articles: