ACAGTTGA...">

Editing XML Files in R

I have an XML document with the following element:

<sequence id = "ancestralSequence"> <taxon id="test"> </taxon> ACAGTTGACACCCTT </sequence> 

and would like to analyze a new sequence of characters inside the taxon tags. I began to study the documentation for XML packages, but so far I can not find a simple solution. My code is:

 # load packages require("XML") # create a new sequence newSeq <- "TGTCAATGGAACCTG" # read the xml secondPartXml <- xmlTreeParse("generateSequences_secondPart.xml") 
+4
source share
2 answers

I would read it using xmlParse and then get the bit I want with XPath expressions. For example, in your test data, here's how to get the text value in a sequence tag:

 x=xmlParse("test.xml") xmlValue(xpathApply(x,"//sequence")[[1]]) ## [1] "\n \n ACAGTTGACACCCTT\n" 

- two empty lines, some spaces, and then bases.

To get text in a taxon tag:

 xmlValue(xpathApply(x,"//sequence/taxon")[[1]]) ## [1] "\n " 

- empty, just an empty string.

Now, to replace one line with another, you just need to find the “text node”, which is a bit of XML with invisible magic around it, so that it looks the same as the text, but does not exist, and set its value to something.

Given some data with several sequences in, and suppose you want to copy each sequence with a line of converters at the beginning and GCGGGGGG at the end:

 <data> <sequence id = "ancestralSequence"> <taxon id="test">Taxon </taxon> ACAGTTGACACCCTT </sequence> <sequence id = "someotherSequence"> <taxon id="thing">Taxoff </taxon> GGCGGCGCGGGGGGG </sequence> </data> 

Here is the code:

 # read in to a tree: x = xmlParse("test.xml") # this returns a *list* of text nodes under sequence # and NOT the text nodes under taxon nodeSet = xpathApply(x,"//sequence/text()") # now we loop over the list returned, and get and modify the node value: sapply(nodeSet,function(G){ text = paste("CCCCC",xmlValue(G),"GGGGGGG",sep="") text = gsub("[^AZ]","",text) xmlValue(G) = text }) 

Note that this is done by reference, which is odd in R. After all this, the object x has changed, although we have not done anything directly with it. The nodes with which we play in the loop are links, pointers to the data stored in object x .

In any case, this should do you. Note that “parsing” does not mean a replacement at all, its parsing the syntax in the expression, in this case choosing the tags, attributes, and contents of the XML document.

+3
source

You can try using replaceNodes and either create a new node that can be easier to work with text or replace it.

 # new node name # invisible(replaceNodes(doc[["//sequence/text()"]], newXMLNode("new", newSeq))) # new text only invisible(replaceNodes(doc[["//sequence/text()"]], newXMLTextNode( newSeq))) doc <?xml version="1.0"?> <sequence id="ancestralSequence"><taxon id="test"> </taxon>TGTCAATGGAACCTG</sequence> 
+2
source

Source: https://habr.com/ru/post/1399558/


All Articles