Spark processing of recording sizes greater than 3 GB

I get below exceptions when a single record is larger than 3 GB `

java.lang.IllegalArgumentException App > at java.nio.CharBuffer.allocate(CharBuffer.java:330) App > at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:792) App > at org.apache.hadoop.io.Text.decode(Text.java:412) App > at org.apache.hadoop.io.Text.decode(Text.java:389) App > at org.apache.hadoop.io.Text.toString(Text.java:280) App > at org.apache.spark.sql.execution.datasources.json.JsonFileFormat$$anonfun$createBaseRdd$1.apply(JsonFileFormat.scala:135) App > at org.apache.spark.sql.execution.datasources.json.JsonFileFormat$$anonfun$createBaseRdd$1.apply(JsonFileFormat.scala:135) 

How to increase the buffer size for one record?

+5
source share
1 answer

You probably have one huge line in your file containing the array. You get an exception here because you are trying to build a CharBuffer that is too large (most likely an integer that becomes negative after exiting the binding). The maximum size of an array / string in java is 2 ^ 31-1 (Integer.MAX_VALUE -1) (see this thread ). You say that you have a 3 GB record, from 1B to char, which is 3 billion characters, which is more than 2 ^ 31, which is roughly equal to 2 billion.

What you could do is a little hack, but since you only have one key with a large array, it can work. Your json file might look like this:

 { "key" : ["v0", "v1", "v2"... ] } 

or like that, but I think in your case this is the first:

 { "key" : [ "v0", "v1", "v2", ... ] } 

So you can try changing the line separator used by hadoop to "," like here . Basically, they do it like this:

 import org.apache.hadoop.io.LongWritable import org.apache.hadoop.io.Text import org.apache.hadoop.conf.Configuration import org.apache.hadoop.mapreduce.lib.input.TextInputFormat def nlFile(path: String) = { val conf = new Configuration conf.set("textinputformat.record.delimiter", ",") sc.newAPIHadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf) .map(_._2.toString) } 

Then you can read your array and just have to remove the JSON brackets yourself with something like this:

 nlFile("...") .map(_.replaceAll("^.*\\[", "").replaceAll("\\].*$","")) 

Note that you need to be more careful if your posts may contain the characters “[" and "]", but here is the idea.

0
source

Source: https://habr.com/ru/post/1273224/


All Articles