I have an input file foo.txtwith the following contents:
c1|c2|c3|c4|c5|c6|c7|c8|
00| |1.0|1.0|9|27.0|0||
01|2|3.0|4.0|1|10.0|1|1|
I want to convert it to Dataframeto execute some Sqlqueries:
var text = sc.textFile("foo.txt")
var header = text.first()
var rdd = text.filter(row => row != header)
case class Data(c1: String, c2: String, c3: String, c4: String, c5: String, c6: String, c7: String, c8: String)
Up to this point, everything is in order, the problem arises in the following sentence:
var df = rdd.map(_.split("\\|")).map(p => Data(p(0), p(1), p(2), p(3), p(4), p(5), p(6), p(7))).toDF()
If I try to print dfwith df.show, I get an error:
scala> df.show()
java.lang.ArrayIndexOutOfBoundsException: 7
I know that the error may be caused by a separation clause. I also tried to split foo.txtusing the following syntax:
var df = rdd.map(_.split("""|""")).map(p => Data(p(0), p(1), p(2), p(3), p(4), p(5), p(6), p(7))).toDF()
And then I get something like this:
scala> df.show()
+------+---------+----------+-----------+-----+-----------+----------------+----------------+
| c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8 |
+------+---------+----------+-----------+-----+-----------+----------------+----------------+
| 0| 0| || | || 1| .| 0|
| 0| 1| || 2| || 3| .| 0|
+------+---------+----------+-----------+-----+-----------+----------------+----------------+
Therefore, my question is how to properly transfer this file to a Dataframe.
EDIT: - || . .