I have an XML document with mixed content, and I use my own schema in the Dataframe to parse it. I had a problem when the circuit only selects the text for "Measure".
XML looks like this:
<QData> <Measure> some text here <Answer>Answer1</Answer> <Question>Question1</Question> </Measure> <Measure> some text here <Answer>Answer1</Answer> <Question>Question1</Question> </Meaure> </QData>
My diagram is as follows:
def getCustomSchema():StructType = {StructField("QData", StructType(Array( StructField("Measure", StructType( Array( StructField("Answer",StringType,true), StructField("Question",StringType,true) )),true) )),true)}
When I try to access the data in Measure, I get "some text here" and it fails when I try to get information from the "Answer". I also just get one measure.
EDIT: This is how I try to access data
val result = sc.read.format("com.databricks.spark.xml").option("attributePrefix", "attr_").schema(getCustomSchema) .load(filename.toString) val qDfTemp = result.mapPartitions(partition =>{val mapper = new QDMapper();partition.map(row=>{mapper(row)}).flatMap(list=>list)}).toDF() case class QDMapper(){ def apply(row: Row):List[QData]={ val qDList = new ListBuffer[QData]() val qualData = row.getAs[Row]("QData") //When I print as list I get the first Measure text and that is it val measure = qualData.getAs[Row]("Measure") //This fails } }
source share