Spark local mode - all tasks use only one processor core

Question

Spark local mode - all tasks use only one processor core

We run Spark Java locally on a single instance of AWS EC2 using

"local[*]"

However, profiling using New Relic tools and a simple “top show” show that only one CPU core of our 16-core machine is used during the three different Java spark jobs we wrote (we also tried different AWS instances but only used one core).

Runtime.getRuntime().availableProcessors()reports 16 processors and sparkContext.defaultParallelism()also reports 16.

I looked at various Stackoverflow local mode issues, but none of them seemed to resolve the issue.

Any advice is greatly appreciated.

thank

EDIT: Process

1) Use sqlContext to read gzipped CSV file 1 using com.databricks.spark.csv from disk (S3) in DataFrame DF1.

2) Use sqlContext to read gzipped CSV file 2 using com.databricks.spark.csv from disk (S3) in DataFrame DF2.

3) Use DF1.toJavaRDD (). mapToPair (new map function that returns Tuple>) RDD1

4) Use DF2.toJavaRDD (). mapToPair (new map function that returns Tuple>) RDD2

5) Join calls on RDD

6) Call reduceByKey () on the combined RDDs to "merge by key", so get Tuple>) with only one instance of a specific key (since the same key appears in both RDD1 and RDD2).

7) Call.values (). map (new mapping) , List ,

8) .flatMap(), RDD

9) sqlContext DataFrame DomainClass

10) DF.coalease(1).write(), DF CSV CSV S3.

+4

java amazon-web-services amazon-ec2 apache-spark

user894199 31 . '16 4:11

1

Tim · Accepted Answer · 2016-11-02T01:08:45+0000

, , CSV gzip. Spark , , , *. ( gzipped) , bgzip, ( ). .

csv.gz . , !

, , , . Gzipped .

: . sc.textFile 3G gzipped, 1 .

Spark local mode - all tasks use only one processor core

More articles: