The paper, “The Meaning of Performance in Data Analysis Structures,” published in NSDI 2015, concluded that the CPU (not the IO or the network) is Spark's performance bottleneck. Kay conducted several experiments on Spark, including BDbench, TPC-DS, and the processing workload (only Spark SQL is used in this article). I wonder if this conclusion is suitable for some frameworks built on Spark (for example, Streaming, with a continuous data stream received through the network, both network IOs and the disk will experience high pressure).
source
share