Extract data using XPathSApply for more than one attribute

Question

Extract data using XPathSApply for more than one attribute

The following url has numbers and tables, and I like to read the first two columns of the table. The xpatahSApply command works fine, but I need to set more than two attributes, and I can't figure it out.

url ="http://floodobservatory.colorado.edu/SiteDisplays/1544data.htm" doc=htmlTreeParse(url,useInternal=TRUE)

sample of analyzed data

 <tr height="20" style="height:15.0pt"> <td height="20" class="xl6521398" align="right" style="height:15.0pt">11-Oct-13</td> <td class="xl7321398">1853</td> <td class="xl7321398"></td> <td class="xl8121398">0.80</td> <td class="xl7221398" align="right">4.87</td> <td class="xl1521398"></td> <td class="xl1521398"></td> <td class="xl1521398"></td> <td class="xl1521398"></td> <td class="xl1521398"></td> <td class="xl1521398"></td> <td class="xl7421398"></td> <td class="xl7421398"></td> <td class="xl7421398"></td> <td class="xl7421398"></td> <td class="xl9621398"></td> <td class="xl7421398"></td> <td class="xl8121398"></td> </tr>

I need to read values from two cells, one corresponds to the date, and the other corresponds to the flow rate and has the following attributes

 <td height="20" class="xl6521398" ...> and [<td class="xl7321398"..]

Compared to the samples above, I need to capture 11-Oct-13 and 1853.

I used the following commands to get the "dates" and "reset the stream."

 dates=xpathSApply(doc,"//td[@class='xl6521398']",xmlValue) streamflowdischarge=xpathSApply(doc,"//td[@class='xl7321398']",xmlValue)

They successfully retrieved the information, but the retrieved values consist of values from other tables / cells, and the important “dates” and “flow rate” do not match.

dates [1:10] [1] "1-Jan-98" "2-Jan-98" "3-Jan-98" "31-Mar-98" "4-Jan-98" "30-Apr-98 "" January 5 - 98 "[8]" May 31-98 "" 6-Jan-98 "" June 30-98

"31-Mar-98" is between "3-Jan-98" and "4-Jan-98" - something unintentional

streamflowdischarge [1:10] [1] "3108" "3076" "3051" "3111" "3064" "3043" "3007" "3066" "378" ""

"3108" does not match "1-Jan-98" - can be checked at the URL

It looks like there are tables / cells with the same attributes that I don't want to read / capture. In this regard, it seems to me that I need to pass the entire attribute, i.e.

 <td height="20" class="xl6521398" align="right" style="height:15.0pt">

in order to get the “date”, and somehow I have to establish that the “stream stream” from the same table is retrieved.

Great deals as well as affordable options.

I tried readHTMLTable but got an "index out of bounds" error

Thanks Satish

+5

r xml-parsing html-parsing

Satishr Nov 19 '14 at 22:15

source share

2 answers

You can use the and and | operators in xpath :

 path_xp <- '//td[@class="xl6521398" and @height="20"]|//td[@class="xl7321398"]' res <- xpathSApply(doc,path_xp,xmlValue) [1] "11-Oct-13" "1853" ""

Note that you have 3 elements because you have 2 elites with an attribute class of xl7321398. Perhaps you need to refine your query or just move the third empty element.

 res[nzchar(res)] [1] "11-Oct-13" "1853"

+3

agstudy Nov 19 '14 at 23:15

source share

Martin morgan · Accepted Answer · 2014-11-20T02:23:13+0000

I enter data

 url = "http://floodobservatory.colorado.edu/SiteDisplays/1544data.htm" html = htmlParse(url)

then requested the rows of the table containing both cell classes that interest you, taking the first or second cell of each

 query = "//tr[./td[@class='xl6521398'] and ./td[@class='xl7321398']]/td[1]" dates = xpathSApply(html, query, xmlValue) query = "//tr[./td[@class='xl6521398'] and ./td[@class='xl7321398']]/td[2]" flows = xpathSApply(html, query, xmlValue)

I think you want

 > df = data.frame(dates=as.Date(dates, "%e-%b-%y"), flows=as.integer(flows)) > nrow(df) [1] 5808 > head(df, 3) dates flows 1 1-Jan-98 1258 2 2-Jan-98 1584 3 3-Jan-98 1272 > tail(df, 3) dates flows 5806 23-Nov-13 2878 5807 24-Nov-13 2852 5808 25-Nov-13 2738

I guess the secret was to use row selection with two columns of interest (? But maybe these are classes created by the spreadsheet used to create the web page and have nothing to do with the semantic value of the data?) To group data. A more "complete" curettage can create a set of node rows and then query the rows (for sometimes several) of the columns marked with the class of interest, for example

 query = "//tr[./td[@class='xl6521398'] and ./td[@class='xl7321398']]" nodes = getNodeSet(html, query) date = lapply(nodes, xpathSApply, "./td[@class='xl6521398']", xmlValue) flow = lapply(nodes, xpathSApply, "./td[@class='xl7321398']", xmlValue)

The date and flow elements are consistent, but there can be several flow measurements per day.

 > head(flow, 3) [[1]] [1] "1258" "" "1799" "2621" "1258" [[2]] [1] "1584" "" "1550" "2033" "978" [[3]] [1] "1272" "" "1104" "3515" "233" > table(sapply(flow, length)) 2 3 5 5577 15 216

So, I think this is for the Blue Nile, in Sudan; orderly

 url = "http://floodobservatory.colorado.edu/SiteDisplays/Summary5.htm" sites = htmlParse(url) > sites["//tr[./td[1] = '1544']"] [[1]] <tr height="17" style="height:12.75pt"><td height="17" class="xl7226158" style="height:12.75pt">1544</td>&#13; <td class="xl6926158"/>&#13; <td class="xl7026158">13.0940</td>&#13; <td class="xl7026158">33.9750</td>&#13; <td class="xl6926158">5070</td>&#13; <td class="xl6926158">Blue Nile</td>&#13; <td class="xl6926158">Sudan</td>&#13; <td class="xl6926158">2</td>&#13; <td class="xl6926158">2</td>&#13; <td class="xl7926158">173%</td>&#13; <td class="xl8226158">15.88</td>&#13; <td class="xl7126158">19-Nov-14</td>&#13; <td class="xl7126158"/>&#13; </tr> attr(,"class") [1] "XMLNodeSet"

Extract data using XPathSApply for more than one attribute

More articles: