Import all XML fields (and subfields) as dataframe

Question

Import all XML fields (and subfields) as dataframe

To do some analysis, I want to import XML into a data framework using R and an XML package. Example XML file:

<watchers shop_name="TEST" created_at="September 14, 2012 05:44"> <watcher channel="Site Name"> <code>123456</code> <search_key>TestKey</search_key> <date>September 14, 2012 04:15</date> <result>Found</result> <link>http://www.test.com/fakeurl</link> <price>100.0</price> <shipping>0.0</shipping> <origposition>0</origposition> <name>Name Test</name> <results> <result position="1"> <c_name>CTest1</c_name> <c_price>599.49</c_price> <c_shipping>0.0</c_shipping> <c_total_price>599.49</c_total_price> <c_rating>8.3</c_rating> <c_delivery/> </result><result position="2"> <c_name>CTest2</c_name> <c_price>654.0</c_price> <c_shipping>0.0</c_shipping> <c_total_price>654.0</c_total_price> <c_rating>9.8</c_rating> <c_delivery/> </result> <result position="3"> <c_name>CTest3</c_name> <c_price>654.0</c_price> <c_shipping>0.0</c_shipping> <c_total_price>654.0</c_total_price> <c_rating>8.8</c_rating> <c_delivery/> </result> </results> </watcher> </watchers>

I want to have data frame rows containing the following fields:

 shop_name created_at code search_key date result link price shipping origposition name position c_name c_price c_shipping c_total_price c_rating c_delivery

This means that the child nodes must also be taken into account, which will lead to the fact that in this example a three-dimensional series of data will be presented (since the results show 3 positions). Fields

 shop_name created_at code search_key date result link price shipping origposition name

are the same for each of these lines.

I can go through the XML file, but I cannot get the framework with the fields I want. When I convert a dataframe to a dataframe, I get the following fields:

 "code" "search_key" "date" "result" "link" "price" "shipping" "origposition" "name" "results"

Here are the fields

 shop_name created_at

are missing at the beginning, and the “results” are combined in a row in the “results” column.

It should be possible to get the required data framework, but I don't know how to do it.

UPDATE

The solution provided by @MvG works fine in the XML test file mentioned above. However, the column result may also have the value "Not Found." Records with this value will skip certain fields (always the same) and, therefore, give "the number of argument columns does not match" -error when the solution starts. I would like these records to also be placed in the dataframe, and the fields that are not present are left blank. I do not understand how to include this scenario.

test.xml

 <watchers shop_name="TEST" created_at="September 14, 2012 05:44"> <watcher channel="Site Name"> <code>123456</code> <search_key>TestKey</search_key> <date>September 14, 2012 04:15</date> <result>Found</result> <link>http://www.test.com/fakeurl</link> <price>100.0</price> <shipping>0.0</shipping> <origposition>0</origposition> <name>Name Test</name> <results> <result position="1"> <c_name>CTest1</c_name> <c_price>599.49</c_price> <c_shipping>0.0</c_shipping> <c_total_price>599.49</c_total_price> <c_rating>8.3</c_rating> <c_delivery/> </result><result position="2"> <c_name>CTest2</c_name> <c_price>654.0</c_price> <c_shipping>0.0</c_shipping> <c_total_price>654.0</c_total_price> <c_rating>9.8</c_rating> <c_delivery/> </result> <result position="3"> <c_name>CTest3</c_name> <c_price>654.0</c_price> <c_shipping>0.0</c_shipping> <c_total_price>654.0</c_total_price> <c_rating>8.8</c_rating> <c_delivery/> </result> </results> </watcher> <watcher channel="Shopping"> <code>12804</code> <search_key></search_key> <date></date> <result>Not found</result> <link>https://www.test.com/testing1323p</link> <price>0.0</price> <shipping>0.0</shipping> <origposition>0</origposition> <name>MOOVM6002020</name> <results> </results> </watcher> </watchers>

+4

xml import r

Max van der heijden Sep 14 '12 at 9:26

source share

2 answers

Here is one of the possibilities:

 attr2df <- function(n) do.call(data.frame, as.list(xmlAttrs(n))) cbind(attr2df(xmlRoot(doc)), do.call(rbind, xpathApply(doc, "//watcher", function(w) { x <- xmlToDataFrame(nodes = list(w)) x$results<-NULL cbind(attr2df(w), x, xmlToDataFrame(nodes = getNodeSet(w, "results/result"))) } )) )

Iterate over all observers. For each observer, read its subtree in data frame x and read its result nodes in another data frame. Remove the results from the first data frame, then bind the columns together and add attributes from the observer. This application will give one data.frame for each observer, and the external rbind cal combines them into one data frame. The outermost cbind will add node root attributes.

The result will have the following names:

  [1] "shop_name" "created_at" "channel" "code" [5] "search_key" "date" "result" "link" [9] "price" "shipping" "position" "name" [13] "c_name" "c_price" "c_shipping" "c_total_price" [17] "c_rating" "c_delivery"

-1

Mvg Sep 14 '12 at 10:19

source share

Mvg · Accepted Answer · 2012-09-16T08:43:12+0000

Here is a more general approach. Each node is classified as one of three cases:

If the node name is of the form rows , then data frames from the child nodes will lead to different rows of the result.
If the node name is cols , then data frames from child nodes will result in different columns of the result.
If the name of the node is of the form value , then a data frame with a single value will be created, using the name of the node as the column name and the value of node as the column value.
For all three cases, the node attributes will be added to the data frame.

A call to your application is indicated at the bottom.

 library(XML) zeroColSingleRow <- function() { res <- data.frame(dummy=NA) res$dummy <- NULL stopifnot(nrow(res) == 1, ncol(res) == 0) return (res) } xml2df <- function(node, classifier) { if (! inherits(node, c("XMLInternalElementNode", "XMLElementNode"))) { return (zeroColSingleRow()) } kind <- classifier(node) if (kind == "rows") { cdf <- lapply(xmlChildren(node), xml2df, classifier) if (length(cdf) == 0) { res <- zeroColSingleRow() } else { names <- unique(unlist(lapply(cdf, colnames))) cdf <- lapply(cdf, function(i) { missing <- setdiff(names, colnames(i)) if (length(missing) > 0) { i[missing] <- NA } return (i) }) res <- do.call(rbind, cdf) } } else if (kind == "cols") { cdf <- lapply(xmlChildren(node), xml2df, classifier) if (length(cdf) == 0) { res <- zeroColSingleRow() } else { res <- cdf[[1]] if (length(cdf) > 1) { for (i in 2:length(cdf)) { res <- merge(res, cdf[[i]], by=NULL) } } } } else { stopifnot(kind == "value") res <- data.frame(xmlValue(node)) names(res) <- xmlName(node) } if (ncol(res) == 0) { res <- zeroColSingleRow() } attr <- xmlAttrs(node) if (length(attr) > 0) { attr <- do.call(data.frame, as.list(attr)) res <- merge(attr, res, by=NULL) } rownames(res) <- NULL return(res) } doc<-xmlParse("test.xml") xml2df(xmlRoot(doc), function(node) { name <- xmlName(node) if (name %in% c("watchers", "results")) return("rows") # make sure to treat results/result different from watcher/result if (name %in% c("watcher", "result") && xmlName(xmlParent(node)) == paste0(name, "s")) return("cols") return("value") })

Import all XML fields (and subfields) as dataframe

More articles: