I got a strange problem. As I understand it, DAG operations in Spark are only performed when an action is performed. However, I see that the reduceByKey () operation (being a conversion) starts to execute the DAG.
Steps to play. Try to describe a piece of code
SparkConf conf =new SparkConf().setMaster("local").setAppName("Test");
JavaSparkContext context=new JavaSparkContext(conf);
JavaRDD<String> textFile = context.textFile("any non-existing path");
JavaRDD<String> flatMap = textFile.flatMap(x -> Arrays.asList(x.split(" ")).iterator());
JavaPairRDD<String, Integer> mapToPair = flatMap.mapToPair(x -> new Tuple2<String, Integer>((String) x, 1));
Note: the file path should not be any existing path. In other words, the file must not exist.
If you execute this code, nothing happens as expected. However, if you add the following line to the program and run
mapToPair.reduceByKey((x, y) -> x + y);
This gives the following exception:
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
This means that he began to perform DAG. Since reduceByKey () is a transformation, this should not happen until an action such as collect () or take () is performed.
: 2.0.0. , .