How to print elements of a specific RDD section in Spark?

How to print elements of a certain section, for example, the 5th,?

val distData = sc.parallelize(1 to 50, 10) 
+6
source share
3 answers

Using Spark / Scala:

  val data = 1 to 50 val distData = sc.parallelize(data,10) distData.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) =>it.toList.map(x => if (index ==5) {println(x)}).iterator).collect 

gives:

 26 27 28 29 30 
+6
source

you can use the counter against the foreachPartition () API to achieve it.

Here is a Java program that prints the contents of each section of JavaSparkContext context = new JavaSparkContext (conf);

  JavaRDD<Integer> myArray = context.parallelize(Arrays.asList(1,2,3,4,5,6,7,8,9)); JavaRDD<Integer> partitionedArray = myArray.repartition(2); System.out.println("partitioned array size is " + partitionedArray.count()); partitionedArray.foreachPartition(new VoidFunction<Iterator<Integer>>() { public void call(Iterator<Integer> arg0) throws Exception { while(arg0.hasNext()) { System.out.println(arg0.next()); } } }); 
+1
source

Suppose you do this for testing purposes only, and then use glom (). See Evaporation Documentation: https://spark.apache.org/docs/1.6.0/api/python/pyspark.html#pyspark.RDD.glom

 >>> rdd = sc.parallelize([1, 2, 3, 4], 2) >>> rdd.glom().collect() [[1, 2], [3, 4]] >>> rdd.glom().collect()[1] [3, 4] 

Edit: Example in Scala:

 scala> val distData = sc.parallelize(1 to 50, 10) scala> distData.glom().collect()(4) res2: Array[Int] = Array(21, 22, 23, 24, 25) 
+1
source

Source: https://habr.com/ru/post/986654/


All Articles