Scrambling an html table with a range using rvest

Question

Scrambling an html table with a range using rvest

I use rvest to retrieve the table on the following page:

https://en.wikipedia.org/wiki/List_of_United_States_presidential_elections_by_popular_vote_margin

The following code works:

URL <- 'https://en.wikipedia.org/wiki/List_of_United_States_presidential_elections_by_popular_vote_margin' table <- URL %>% read_html %>% html_nodes("table") %>% .[[2]] %>% html_table(trim=TRUE)

but the column of fields and the names of the presidents have some strange meanings. The reason is because the source code has the following:

 <td><span style="display:none">00.001</span>−10.44%</td>

so instead of getting -10.44% I get 00.001 ~ 10.44%

How can i fix this?

+5

html-table r web-scraping rvest

user2246905 Mar 01 '16 at 18:30

source share

1 answer

Jota · Accepted Answer · 2016-03-01T21:59:18+0000

One option is to customize and replace the problem columns individually.

Margin columns can target xpath

 # get the html html <- URL %>% read_html() # Example using the first margin column (column # 6) html %>% html_nodes(xpath = '//table[2]') %>% # get table 2 html_nodes(xpath = '//td[6]/text()') %>% # get column 6 using text() iconv("UTF-8", "UTF-8") # to convert "âˆ'" to "-" # [1] "−10.44%" "−3.00%" "−0.83%" "−0.51%" "0.09%" "0.17%" "0.57%" # [8] "0.70%" "1.45%" "2.06%" "2.46%" "3.01%" "3.12%" "3.86%" #[15] "4.31%" "4.48%" "4.79%" "5.32%" "5.56%" "6.05%" "6.12%" #[22] "6.95%" "7.27%" "7.50%" "7.72%" "8.51%" "8.53%" "9.74%" #[29] "9.96%" "10.08%" "10.13%" "10.85%" "11.80%" "12.20%" "12.25%" #[36] "14.20%" "14.44%" "15.40%" "17.41%" "17.76%" "17.81%" "18.21%" #[43] "18.83%" "22.58%" "23.15%" "24.26%" "25.22%" "26.17%"

Do the same for the other column in the field. I used iconv to convert âˆ' to - , since this is an encoding problem, but instead you can use a replacement-based solution (e.g. using sub ).

To specify a column with the names of the presidents, you can use xpath again:

 html %>% html_nodes(xpath = '//table[2]') %>% html_nodes(xpath = '//td[3]/a/text()') %>% html_text() # [1] "John Quincy Adams" "Rutherford Hayes" "Benjamin Harrison" # [4] "George W. Bush" "James Garfield" "John Kennedy" # [7] "Grover Cleveland" "Richard Nixon" "James Polk" #[10] "Jimmy Carter" "George W. Bush" "Grover Cleveland" #[13] "Woodrow Wilson" "Barack Obama" "William McKinley" #[16] "Harry Truman" "Zachary Taylor" "Ulysses Grant" #[19] "Bill Clinton" "William Henry Harrison" "William McKinley" #[22] "Franklin Pierce" "Barack Obama" "Franklin Roosevelt" #[25] "George HW Bush" "Bill Clinton" "William Taft" #[28] "Ronald Reagan" "Franklin Roosevelt" "Abraham Lincoln" #[31] "Abraham Lincoln" "Dwight Eisenhower" "Ulysses Grant" #[34] "James Buchanan" "Andrew Jackson" "Martin Van Buren" #[37] "Woodrow Wilson" "Dwight Eisenhower" "Herbert Hoover" #[40] "Franklin Roosevelt" "Andrew Jackson" "Ronald Reagan" #[43] "Theodore Roosevelt" "Lyndon Johnson" "Richard Nixon" #[46] "Franklin Roosevelt" "Calvin Coolidge" "Warren Harding"

Scrambling an html table with a range using rvest

More articles: