How to remove parentheses around records when saveAsTextFile on RDD [(String, Int)]?

Question

How to remove parentheses around records when saveAsTextFile on RDD [(String, Int)]?

How to remove the bracket "(" and ")" from the output using the spark task below?

When I try to read the result of a spark using PigScript, this creates a problem.

My code is:

scala> val words = Array("HI","HOW","ARE")
words: Array[String] = Array(HI, HOW, ARE)

scala> val wordsRDD = sc.parallelize(words)
wordsRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at    parallelize at <console>:23

scala> val keyvalueRDD = wordsRDD.map(elem => (elem,1))
keyvalueRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[1] at map at <console>:25

scala> val wordcountRDD = keyvalueRDD.reduceByKey((x,y) => x+y)
wordcountRDD: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[2] at reduceByKey at <console>:27

scala> wordcountRDD.saveAsTextFile("/user/cloudera/outputfiles")

The output is in accordance with the code above:

 hadoop dfs -cat /user/cloudera/outputfiles/part*

(HOW,1)
(ARE,1)
(HI,1)

But I want the spark output to be kept lower, as without parentheses

HOW,1
ARE,1
HI,1

Now I want to read the above output using PigScript.

LOAD operation in Pigscript "(AS" as the first atom and "1)" as the second atom

In any case, we can get rid of the brackets in the spark code itself, since I do not want to apply the fix for this in pigscript.

Pig script:

records = LOAD '/user/cloudera/outputfiles' USING PigStorage(',') AS (word:chararray);
dump records;

Lead Output:

 ((HOW)
 ((ARE)
 ((HI)

+4

hadoop apache-spark apache-pig

Surureder raja 30 . '16 12:55

2

Tuple. :

val wordcountRDD = keyvalueRDD.reduceByKey((x,y) => x+y)
                              // here we set custom format
                              .map(x => x._1 + "," + x._2)
wordcountRDD.saveAsTextFile("/user/cloudera/outputfiles")

+1

T. Gawęda 30 . '16 12:58

Jacek Laskowski · Accepted Answer · 2016-12-30T12:59:20+0000

map, outputfiles, .

wordcountRDD.map { case (k, v) => s"$k, $v" }.saveAsTextFile("/user/cloudera/outputfiles")

. .

Datasets.

scala> words.toSeq.toDS.groupBy("value").count().show
+-----+-----+
|value|count|
+-----+-----+
|  HOW|    1|
|  ARE|    1|
|   HI|    1|
+-----+-----+

scala> words.toSeq.toDS.groupBy("value").count.write.csv("outputfiles")

$ cat outputfiles/part-00199-aa752576-2f65-481b-b4dd-813262abb6c2-c000.csv
HI,1

. Spark SQL, DataFrames Datasets.

How to remove parentheses around records when saveAsTextFile on RDD [(String, Int)]?

More articles: