R and readLines of webpage text

Question

R and readLines of webpage text

I want to create a single data framework from the following website: http://www.arrs.net/MaraList/ML_2014.htm

Unfortunately, I'm not sure how to use what looks like tab delimiters and create data columns. The code that I have below takes and creates a few lines of characters, but I am having problems determining how to split the names containing several words into one column, as shown on the site.

library(XML) url<-"http://www.arrs.net/MaraList/ML_2014.htm" data<-readLines(url) data<-sub("</FONT></b><FONT SIZE=\"2\" <FONT COLOR=\"#00000\" FACE=\"Courier New, Courier\">","",data) data<-sub("<B><FONT COLOR=\"#0066FF\" FACE=\"Arial\">","",data) data<-read.table(textConnection(data),stringsAsFactors=FALSE) data<-data[11:40000,1]

So, not sure if any of the current code I have can make me go there. Any information or links (links) to previous posts will be appreciated.

0

r readlines

Lebeauski Nov 26 '14 at 19:11

source share

1 answer

Tyler rinker · Accepted Answer · 2014-11-26T20:09:42+0000

Here's one approach to read this (using the two packages I support, and the awesome stacksplitshape package). You will need the qdapTools dev qdapTools .

 devtools::install_github("trinker/qdapTools") library(qdapTools); library(qdapRegex); library(splitstackshape) url<-"http://www.arrs.net/MaraList/ML_2014.htm" m <- readLines(url)[-c(1:7, 2760:2767)] ## Split into lists by country x <- loc_split(m, unique(grep("<B><FONT", m))) ## Clean up country names nms <- rm_angle(sapply(x, `[`, 1)) ## remove html country name from data can convert to a data.frame dat <- list2df(setNames(lapply(x, `[`, -1), nms), "dats", "Country")[, 2:1] ## Use hand parsing technique to locate widths ## I added a # before each column in row one of data ## gregexpr tells us the location of the # characters det <- "AAR #26#Jan #King George Island # #27+25 #White Continent #4:03:30 #Steve Hibbs (USA) #4:13:02 #Suzy Seeley (54,TX/USA) " widths <- gregexpr("#", det)[[1]] ## replace those widths with # character as it is not any where else in data set for (i in widths){ substring(dat[["dats"]], i, i) <- "#" } ## split columns on # character out <- cSplit(dat, 2, sep="#") out

R and readLines of webpage text

More articles: