I'm trying my best to pick up tables from Wikipedia, and I'm getting stuck. For example, I use the teams of the World Cup 2014. In this case, I want to extract the list of participating countries from the table of contents from the 2014 FIFA World Cup squads page and save them as a vector. Here is how far I got:
library(tidyverse)
library(rvest)
library(XML)
library(RCurl)
(Countries <- read_html("https://en.wikipedia.org/wiki/2014_FIFA_World_Cup_squads") %>%
html_node(xpath = '//*[@id="toc"]/ul') %>%
htmlTreeParse() %>%
xmlRoot())
This spills out a bunch of HTML code that I will not copy / paste here. I’m specifically going to extract all the lines with the tag <span class="toctext">, such as Group A, Brazil, Cameroon, etc. And save them as a vector. What function can happen?
source
share