Above is the code as a Spark driver, when I execute my program, it works properly, saving the necessary data as a Parquet file.
String indexFile = "index.txt";
JavaRDD<String> indexData = sc.textFile(indexFile).cache();
JavaRDD<String> jsonStringRDD = indexData.map(new Function<String, String>() {
@Override
public String call(String patientId) throws Exception {
return "json array as string"
}
});
DataFrame dataSchemaDF = sqlContext.read().json(jsonStringRDD );
dataSchemaDF.write().parquet("md.parquet");
But I noticed that my mapping function in RDD indexDatais executed twice. firstly, when I read jsonStringRddas DataFrameusing the SQLContextsecond, when I write dataSchemaDFto the parquet file
Can you help me with this, how to avoid a repetition of execution? Is there any other better way to convert a JSON string to a Dataframe?
source
share