The main idea is to take all the lines inside each section and combine them into a long line. Then we replace "with" _ "and call sliding along this line to create trigrams for each section in parallel.
Note. . The resulting trigrams may not be 100% accurate, since we will skip a few trigrams from the beginning and end of each section. Given that each section can contain several million characters, the loss of confidence should be negligible. The main advantage here is that each section can be executed in parallel.
Here are some details about the toy. All sounds can be performed on any Spark REPL:
scala> val data = sc.parallelize(Seq("Hello World, it","is a nice day"))
data: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[12]
val trigrams = data.mapPartitions(_.toList.mkString(" ").replace(" ","_").sliding(3))
trigrams: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[14]
, , ( , )
scala> val asCollected = trigrams.collect
asCollected: Array[String] = Array(Hel, ell, llo, lo_, o_W, _Wo, Wor, orl, rld, ld,, d,_, ,_i, _it, is_, s_a, _a_, a_n, _ni, nic, ice, ce_, e_d, _da, day)