I have a list of data, the value is basically a bson document (think json), each json varies from 5k to 20k. It can either be in the format of a bson object, or directly converted to json:
Key, Value
I expect groupByKey to produce:
K1, (JSON1, JSON2) K2, (JSON3, JSON4)
so that when executed:
val data = [...].map(x => (x.Key, x.Value)) val groupedData = data.groupByKey() groupedData.foreachRDD { rdd =>
I am so confused about the behavior of RDD. I read many articles on the Internet, including the official site from Spark: https://spark.apache.org/docs/0.9.1/scala-programming-guide.html
Still not able to achieve what I want.
-------- UPDATED ---------------------
Basically, I really need it to be grouped by key, the key is the index that will be used in Elasticsearch so that I can execute the batch process based on the key through Elasticsearch for Hadoop:
EsSpark.saveToEs(rdd);
I can not do for each section, because Elasticsearch accepts only RDD. I tried using sc.MakeRDD or sc.parallize, both say it is not serializable.
I tried using:
EsSpark.saveToEs(rdd, Map( "es.resource.write" -> "{TheKeyFromTheObjectAbove}", "es.batch.size.bytes" -> "5000000")
The configuration documentation is here: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html
But this is a VERY slow comparison with not using the configuration to determine the dynamic index based on the value of a single document, I suspect that it parses every json to dynamically extract the value.