You probably have one huge line in your file containing the array. You get an exception here because you are trying to build a CharBuffer that is too large (most likely an integer that becomes negative after exiting the binding). The maximum size of an array / string in java is 2 ^ 31-1 (Integer.MAX_VALUE -1) (see this thread ). You say that you have a 3 GB record, from 1B to char, which is 3 billion characters, which is more than 2 ^ 31, which is roughly equal to 2 billion.
What you could do is a little hack, but since you only have one key with a large array, it can work. Your json file might look like this:
{ "key" : ["v0", "v1", "v2"... ] }
or like that, but I think in your case this is the first:
{ "key" : [ "v0", "v1", "v2", ... ] }
So you can try changing the line separator used by hadoop to "," like here . Basically, they do it like this:
import org.apache.hadoop.io.LongWritable import org.apache.hadoop.io.Text import org.apache.hadoop.conf.Configuration import org.apache.hadoop.mapreduce.lib.input.TextInputFormat def nlFile(path: String) = { val conf = new Configuration conf.set("textinputformat.record.delimiter", ",") sc.newAPIHadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf) .map(_._2.toString) }
Then you can read your array and just have to remove the JSON brackets yourself with something like this:
nlFile("...") .map(_.replaceAll("^.*\\[", "").replaceAll("\\].*$",""))
Note that you need to be more careful if your posts may contain the characters “[" and "]", but here is the idea.