Using xpathSApply to clear XML attributes in R

I am clearing the XML in R using xpathSApply (in the XML package) and having difficulty retrieving the attributes.

First, the corresponding XML fragment:

<div class="offer-name"> <a href="http://www.somesite.com" itemprop="name">Fancy Product</a> </div> 

I successfully pulled out the β€œFancy Product” (ie item?) Using:

 Products <- xpathSApply(parsedHTML, "//div[@class='offer-name']", xmlValue) 

It took some time (I n00b), but the documentation is good and there are some answers to the questions that I could use. I can't figure out how to pull out http://www.somesite.com because of (attribute?). I assumed that this is due to a change in the 3rd term from "xmlValue" to "xmlGetAttr", but I could completely disconnect.

FYI (1) There are 2 more parent elements, div> above the fragment that I pasted, and (2) here is the abbreviated full-ish code (which, I believe, is not relevant, but included for completeness):

 library(XML) library(httr) content2 = paste(readLines(file.choose()), collapse = "\n") # User will select file. parsedHTML = htmlParse(content2,asText=TRUE) Products <- xpathSApply(parsedHTML, "//div[@class='offer-name']", xmlValue) 
+6
source share
2 answers

href is an attribute. You can select the appropriate node //div/a and use the xmlGetAttr function with name = href :

 '<div class="offer-name"> <a href="http://www.somesite.com" itemprop="name">Fancy Product</a> </div>' -> xData library(XML) parsedHTML <- xmlParse(xData) Products <- xpathSApply(parsedHTML, "//div[@class='offer-name']", xmlValue) hrefs <- xpathSApply(parsedHTML, "//div/a", xmlGetAttr, 'href') > hrefs [1] "http://www.somesite.com" 
+9
source

You can also do this directly with XPath without using xpathSApply(...) .

 xData <- '<div class="offer-name"> <a href="http://www.somesite.com" itemprop="name">Fancy Product</a> </div>' library(XML) parsedHTML <- xmlParse(xData) hrefs <- unlist(parsedHTML["//div[@class='offer-name']/a/@href"]) hrefs # href # "http://www.somesite.com" 
+5
source

Source: https://habr.com/ru/post/973834/


All Articles