Including a table in HTML in a data frame

I'm trying my best to pick up tables from Wikipedia, and I'm getting stuck. For example, I use the teams of the World Cup 2014. In this case, I want to extract the list of participating countries from the table of contents from the 2014 FIFA World Cup squads page and save them as a vector. Here is how far I got:

library(tidyverse)
library(rvest)
library(XML)
library(RCurl)

(Countries <- read_html("https://en.wikipedia.org/wiki/2014_FIFA_World_Cup_squads") %>% 
  html_node(xpath = '//*[@id="toc"]/ul') %>% 
  htmlTreeParse() %>%
  xmlRoot())

This spills out a bunch of HTML code that I will not copy / paste here. I’m specifically going to extract all the lines with the tag <span class="toctext">, such as Group A, Brazil, Cameroon, etc. And save them as a vector. What function can happen?

+4
source share
1 answer

node html_text()

url <- "https://en.wikipedia.org/wiki/2014_FIFA_World_Cup_squads"
toc <- url %>%
    read_html() %>%
    html_node(xpath = '//*[@id="toc"]') %>%
    html_text()

. \n, ( )

contents <- strsplit(toc, "\n")[[1]]

contents[contents != ""]

# [1] "Contents"                                   "1 Group A"                                  "1.1 Brazil"                                
# [4] "1.2 Cameroon"                               "1.3 Croatia"                                "1.4 Mexico"                                
# [7] "2 Group B"                                  "2.1 Australia"                              "2.2 Chile"                                 
# [10] "2.3 Netherlands"                            "2.4 Spain"                                  "3 Group C"                                 
# [13] "3.1 Colombia"                               "3.2 Greece"                                 "3.3 Ivory Coast"                           
# [16] "3.4 Japan"                                  "4 Group D"                                  "4.1 Costa Rica"                            
# [19] "4.2 England"                                "4.3 Italy"                                  "4.4 Uruguay"                               
# ---
# etc

, html- html_table(), .

url %>% 
    read_html() %>%
    html_table()
+3

Source: https://habr.com/ru/post/1682368/


All Articles