Spark Java Map function runs twice

Question

Spark Java Map function runs twice

Above is the code as a Spark driver, when I execute my program, it works properly, saving the necessary data as a Parquet file.

String indexFile = "index.txt";
JavaRDD<String> indexData = sc.textFile(indexFile).cache();
JavaRDD<String> jsonStringRDD = indexData.map(new Function<String, String>() {
  @Override
  public String call(String patientId) throws Exception {
   return "json array as string"
  }   
}); 

//1. Read json string array into a Dataframe (execution 1)
DataFrame dataSchemaDF = sqlContext.read().json(jsonStringRDD );
//2. Save dataframe as parquet file (execution 2)
dataSchemaDF.write().parquet("md.parquet");

But I noticed that my mapping function in RDD indexDatais executed twice. firstly, when I read jsonStringRddas DataFrameusing the SQLContextsecond, when I write dataSchemaDFto the parquet file

Can you help me with this, how to avoid a repetition of execution? Is there any other better way to convert a JSON string to a Dataframe?

+4

java apache-spark apache-spark-sql rdd spark-dataframe

blob Oct 16 '16 at 17:35

source share

1 answer

user6910411 · Accepted Answer · 2016-10-16T18:15:22+0000

, JSON. :

sqlContext.read().json(jsonStringRDD);

Spark DataFrame. RDD,

, StructType, JSON:

StructType schema;
...

DataFrame:

DataFrame dataSchemaDF = sqlContext.read().schema(schema).json(jsonStringRDD);

Spark Java Map function runs twice

More articles: