XML parsing in R: bad namespaces

I have a bunch of XML files and an R script that reads their contents into a data frame. However, I do have files that I would like to parse as usual, but there is something in their namespace definition that prevents me from selecting their values, usually with XPath expressions.

The XML files are as follows:

xml_nons.xml

<?xml version="1.0" encoding="UTF-8"?> <XML> <Node> <Name>Name 1</Name> <Title>Title 1</Title> <Date>2015</Date> </Node> </XML> 

And other:

xml_ns.xml

 <?xml version="1.0" encoding="UTF-8"?> <XML xmlns="http://www.nonexistingsite.com"> <Node> <Name>Name 2</Name> <Title>Title 2</Title> <Date>2014</Date> </Node> </XML> 

The URL xmlns points to does not exist.

The R code I use looks like this:

 library(XML) xmlfiles <- list.files(path = ".", pattern="*.xml$", full.names = TRUE, recursive = TRUE) n <- length(xmlfiles) dat <- vector("list", n) for(i in 1:n){ doc <- xmlTreeParse(xmlfiles[i], useInternalNodes = TRUE) nodes <- getNodeSet(doc, "//XML") x <- lapply(nodes, function(x){ data.frame( Filename = xmlfiles[i], Name = xpathSApply(x, ".//Node/Name" , xmlValue), Title = xpathSApply(x, ".//Node/Title" , xmlValue), Date = xpathSApply(x, ".//Node/Date" , xmlValue) )}) dat[[i]] <- do.call("rbind", x) } xml <- do.call("rbind", dat) xml 

However, as a result, I get:

 Filename Name Title Date ./xml_nons.xml Name 1 Title 1 2015 

If I remove the namespace link from the second file, I will return:

 Filename Name Title Date ./xml_nons_1.xml Name 1 Title 1 2015 ./xml_ns_1.xml Name 2 Title 2 2014 

Of course, I can have XSL to remove these namespaces from the XML source files, but I would like to have some solution that works inside R. Is there a way to tell R to simply ignore everything in the XML declaration?

+6
source share
1 answer

I think there is no easy way to ignore namespaces. The best way is to learn how to live with them. This answer will use the new XML2 package. But the same goes for solving the XML package.

Using

 library(XML2) fname='myfile.xml' doc <- read_xml(fname) #peak at the namespaces xml_ns(doc) 

The first namespace is assigned to d1. If XPath does not find what you want, the most likely cause is a namespace problem.

 xpath <- "//d1:FormDef" ns <- xml_find_all(doc,xpath, xml_ns(doc)) ns 

In addition, you must do this for each element of the path. Thus, to save text input, you can do

 library(stringr) > xpath <- "/ODM/Study" > (xpath<-str_replace_all(xpath,'/','/d1:')) [1] "/d1:ODM/d1:Study" 
+3
source

Source: https://habr.com/ru/post/984053/


All Articles