We run Spark Java locally on a single instance of AWS EC2 using
"local[*]"
However, profiling using New Relic tools and a simple “top show” show that only one CPU core of our 16-core machine is used during the three different Java spark jobs we wrote (we also tried different AWS instances but only used one core).
Runtime.getRuntime().availableProcessors()reports 16 processors and
sparkContext.defaultParallelism()also reports 16.
I looked at various Stackoverflow local mode issues, but none of them seemed to resolve the issue.
Any advice is greatly appreciated.
thank
EDIT: Process
1) Use sqlContext to read gzipped CSV file 1 using com.databricks.spark.csv from disk (S3) in DataFrame DF1.
2) Use sqlContext to read gzipped CSV file 2 using com.databricks.spark.csv from disk (S3) in DataFrame DF2.
3) Use DF1.toJavaRDD (). mapToPair (new map function that returns Tuple>) RDD1
4) Use DF2.toJavaRDD (). mapToPair (new map function that returns Tuple>) RDD2
5) Join calls on RDD
6) Call reduceByKey () on the combined RDDs to "merge by key", so get Tuple>) with only one instance of a specific key (since the same key appears in both RDD1 and RDD2).
7) Call.values (). map (new mapping) , List ,
8) .flatMap(), RDD
9) sqlContext DataFrame DomainClass
10) DF.coalease(1).write(), DF CSV CSV S3.