How to remove the bracket "(" and ")" from the output using the spark task below?
When I try to read the result of a spark using PigScript, this creates a problem.
My code is:
scala> val words = Array("HI","HOW","ARE")
words: Array[String] = Array(HI, HOW, ARE)
scala> val wordsRDD = sc.parallelize(words)
wordsRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:23
scala> val keyvalueRDD = wordsRDD.map(elem => (elem,1))
keyvalueRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[1] at map at <console>:25
scala> val wordcountRDD = keyvalueRDD.reduceByKey((x,y) => x+y)
wordcountRDD: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[2] at reduceByKey at <console>:27
scala> wordcountRDD.saveAsTextFile("/user/cloudera/outputfiles")
The output is in accordance with the code above:
hadoop dfs -cat /user/cloudera/outputfiles/part*
(HOW,1)
(ARE,1)
(HI,1)
But I want the spark output to be kept lower, as without parentheses
HOW,1
ARE,1
HI,1
Now I want to read the above output using PigScript.
LOAD operation in Pigscript "(AS" as the first atom and "1)" as the second atom
In any case, we can get rid of the brackets in the spark code itself, since I do not want to apply the fix for this in pigscript.
Pig script:
records = LOAD '/user/cloudera/outputfiles' USING PigStorage(',') AS (word:chararray);
dump records;
Lead Output:
((HOW)
((ARE)
((HI)