What are the ThreadPoolExecutors jobs in the web interface?

I run several Spark jobs using Spark SQL 1.6.1. Looking at the spark interface, I see that there are some jobs with the description "run at ThreadPoolExecutor.java:1142"

An example of some of these exercises

I was wondering why some jobs get this description?

+5
source share
1 answer

After some investigation, I found that it was executing in ThreadPoolExecutor.java:1142 . Spark's tasks are related to queries with join statements.

 scala> spark.version res16: String = 2.1.0-SNAPSHOT scala> val left = spark.range(1) left: org.apache.spark.sql.Dataset[Long] = [id: bigint] scala> val right = spark.range(1) right: org.apache.spark.sql.Dataset[Long] = [id: bigint] scala> left.join(right, Seq("id")).show +---+ | id| +---+ | 0| +---+ 

When you switch to the SQL tab, you will see the Completed Queries and Their Jobs section (on the right).

SQL tab in the web interface with completed queries

In my case, Sparks work running on "run at ThreadPoolExecutor.java:1142", where ids are 12 and 16.

Tab

Both of them correspond to join requests.

If you are wondering: "it makes sense that one of my connections causes this task, but as far as I know, the union is a shuffle, not an action, so why is the task described in ThreadPoolExecutor and not with my action (as in the case with the rest of my assignments)? ", then my answer usually goes line by line:

Spark SQL is an extension of Spark with its own abstractions ( Dataset , to name just one that quickly comes to mind), which have their own statements to execute. One โ€œsimpleโ€ SQL operation can run one or more Spark jobs. It runs at the discretion of Spark SQL how many Spark jobs are started or sent (but they use RDD under covers) - you donโ€™t need to know details with a low level like this ... well ... too low-level ... considering that youโ€™re at such a high level using Spark SQL SQL or Query DSL.

+5
source

Source: https://habr.com/ru/post/1260468/


All Articles