Mixed Content XML Parsing Using a DataFrame

I have an XML document with mixed content, and I use my own schema in the Dataframe to parse it. I had a problem when the circuit only selects the text for "Measure".

XML looks like this:

<QData> <Measure> some text here <Answer>Answer1</Answer> <Question>Question1</Question> </Measure> <Measure> some text here <Answer>Answer1</Answer> <Question>Question1</Question> </Meaure> </QData> 

My diagram is as follows:

 def getCustomSchema():StructType = {StructField("QData", StructType(Array( StructField("Measure", StructType( Array( StructField("Answer",StringType,true), StructField("Question",StringType,true) )),true) )),true)} 

When I try to access the data in Measure, I get "some text here" and it fails when I try to get information from the "Answer". I also just get one measure.

EDIT: This is how I try to access data

 val result = sc.read.format("com.databricks.spark.xml").option("attributePrefix", "attr_").schema(getCustomSchema) .load(filename.toString) val qDfTemp = result.mapPartitions(partition =>{val mapper = new QDMapper();partition.map(row=>{mapper(row)}).flatMap(list=>list)}).toDF() case class QDMapper(){ def apply(row: Row):List[QData]={ val qDList = new ListBuffer[QData]() val qualData = row.getAs[Row]("QData") //When I print as list I get the first Measure text and that is it val measure = qualData.getAs[Row]("Measure") //This fails } } 
+5
source share

Source: https://habr.com/ru/post/1274444/


All Articles