Scalding / Cascading TsvCompressed Output Compression

Thus, people had problems compressing the results of Scalding Jobs, including me. After googling, I get a weird hiff of response in some obscure forum somewhere, but nothing suits people copying and pasting needs.

I need output like Tsv , but writing compressed output.

+2
scala compression hadoop cascading scalding
May 29 '14 at 17:42
source share
2 answers

In any case, after a significant improvement, I managed to write the TsvCompressed output, which seems to do the job (you still need to set the configuration properties of the hadoop job system, i.e. set the complex to true and set the codec to something reasonable or default crappy deflation )

 import com.twitter.scalding._ import cascading.tuple.Fields import cascading.scheme.local import cascading.scheme.hadoop.{TextLine, TextDelimited} import cascading.scheme.Scheme import org.apache.hadoop.mapred.{OutputCollector, RecordReader, JobConf} case class TsvCompressed(p: String) extends FixedPathSource(p) with DelimitedSchemeCompressed trait DelimitedSchemeCompressed extends Source { val types: Array[Class[_]] = null override def localScheme = new local.TextDelimited(Fields.ALL, false, false, "\t", types) override def hdfsScheme = { val temp = new TextDelimited(Fields.ALL, false, false, "\t", types) temp.setSinkCompression(TextLine.Compress.ENABLE) temp.asInstanceOf[Scheme[JobConf,RecordReader[_,_],OutputCollector[_,_],_,_]] } } 
+3
May 29 '14 at 17:42
source share

I also have a small project showing how to achieve compressed output from Tsv . WordCount-Compressed .

Scalding set null to the Cascading TextDelimeted parameter, which disables compression.

+1
Jun 18
source share



All Articles