I have a bunch of XML files and an R script that reads their contents into a data frame. However, I do have files that I would like to parse as usual, but there is something in their namespace definition that prevents me from selecting their values, usually with XPath expressions.
The XML files are as follows:
xml_nons.xml
<?xml version="1.0" encoding="UTF-8"?> <XML> <Node> <Name>Name 1</Name> <Title>Title 1</Title> <Date>2015</Date> </Node> </XML>
And other:
xml_ns.xml
<?xml version="1.0" encoding="UTF-8"?> <XML xmlns="http://www.nonexistingsite.com"> <Node> <Name>Name 2</Name> <Title>Title 2</Title> <Date>2014</Date> </Node> </XML>
The URL xmlns points to does not exist.
The R code I use looks like this:
library(XML) xmlfiles <- list.files(path = ".", pattern="*.xml$", full.names = TRUE, recursive = TRUE) n <- length(xmlfiles) dat <- vector("list", n) for(i in 1:n){ doc <- xmlTreeParse(xmlfiles[i], useInternalNodes = TRUE) nodes <- getNodeSet(doc, "//XML") x <- lapply(nodes, function(x){ data.frame( Filename = xmlfiles[i], Name = xpathSApply(x, ".//Node/Name" , xmlValue), Title = xpathSApply(x, ".//Node/Title" , xmlValue), Date = xpathSApply(x, ".//Node/Date" , xmlValue) )}) dat[[i]] <- do.call("rbind", x) } xml <- do.call("rbind", dat) xml
However, as a result, I get:
Filename Name Title Date ./xml_nons.xml Name 1 Title 1 2015
If I remove the namespace link from the second file, I will return:
Filename Name Title Date ./xml_nons_1.xml Name 1 Title 1 2015 ./xml_ns_1.xml Name 2 Title 2 2014
Of course, I can have XSL to remove these namespaces from the XML source files, but I would like to have some solution that works inside R. Is there a way to tell R to simply ignore everything in the XML declaration?