I am trying to upload Wikipedia data on judges of the US Supreme Court to R:
library(rvest) html = html("http://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States") judges = html_table(html_nodes(html, "table")[[2]]) head(judges[,2]) [1] "Wilson, JamesJames Wilson" "Jay, JohnJohn Jay†" [3] "Cushing, WilliamWilliam Cushing" "Blair, JohnJohn Blair, Jr." [5] "Rutledge, JohnJohn Rutledge" "Iredell, JamesJames Iredell"
The problem is that the data is distorted. Instead of the name showing how I see it in the actual HTML table (“James Wilson”), it actually appears twice, once as “Last name, first name”, and then again as “Last name first name”.
The reason is that each one actually contains the invisible:
<td style="text-align:left;" class=""> <span style="display:none" class="">Wilson, James</span> <a href="/wiki/James_Wilson" title="James Wilson">James Wilson</a> </td>
The same is true for columns with numeric data. I assume that this extra code is needed to sort the HTML table. However, I do not understand how to remove these spaces when trying to create data.frame from a table in R.
source share