Scrambling a complex HTML table into data.frame in R

I am trying to upload Wikipedia data on judges of the US Supreme Court to R:

library(rvest) html = html("http://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States") judges = html_table(html_nodes(html, "table")[[2]]) head(judges[,2]) [1] "Wilson, JamesJames Wilson" "Jay, JohnJohn Jay†" [3] "Cushing, WilliamWilliam Cushing" "Blair, JohnJohn Blair, Jr." [5] "Rutledge, JohnJohn Rutledge" "Iredell, JamesJames Iredell" 

The problem is that the data is distorted. Instead of the name showing how I see it in the actual HTML table (“James Wilson”), it actually appears twice, once as “Last name, first name”, and then again as “Last name first name”.

The reason is that each one actually contains the invisible:

 <td style="text-align:left;" class=""> <span style="display:none" class="">Wilson, James</span> <a href="/wiki/James_Wilson" title="James Wilson">James Wilson</a> </td> 

The same is true for columns with numeric data. I assume that this extra code is needed to sort the HTML table. However, I do not understand how to remove these spaces when trying to create data.frame from a table in R.

+6
source share
2 answers

Maybe this

 library(XML) library(rvest) html = html("http://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States") judges = html_table(html_nodes(html, "table")[[2]]) head(judges[,2]) # [1] "Wilson, JamesJames Wilson" "Jay, JohnJohn Jay†" "Cushing, WilliamWilliam Cushing" "Blair, JohnJohn Blair, Jr." # [5] "Rutledge, JohnJohn Rutledge" "Iredell, JamesJames Iredel removeNodes(getNodeSet(html, "//table/tr/td[2]/span")) judges = html_table(html_nodes(html, "table")[[2]]) head(judges[,2]) # [1] "James Wilson" "John Jay†" "William Cushing" "John Blair, Jr." "John Rutledge" "James Iredell" 
+8
source

You can use rvest

 library(rvest) html("http://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States")%>% html_nodes("span+ a") %>% html_text() 

This is not ideal, so you may need to refine the css selector, but it is pretty close.

+4
source

Source: https://habr.com/ru/post/980713/


All Articles