Get a list of website directories in R-vector using RCurl

I am trying to get a list of files in a directory on a website. Is there a way to do this similarly to dir () or list.files () for a local directory? I can connect to the site using RCurl (I need this because I need an SSL connection via HTTPS):

library(RCurl) text=getURL(*some https website* ,ssl.verifypeer = FALSE ,dirlistonly = TRUE) 

But this creates an HTML file with images, hyperlinks, etc. file list, but I just need the vector-vector R, which you get with dir (). Is it possible? Or will I need to do HTML parsing to extract the file names? Sounds like a tricky approach for a simple problem.

Thanks,

EDIT: if you can get it to work with http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGencodeV7/ then you will see what I mean.

+5
source share
3 answers

This is the last example in the help file for getURL (with updated URL):

 url <- 'ftp://speedtest.tele2.net/' filenames = getURL(url, ftp.use.epsv = FALSE, dirlistonly = TRUE) # Deal with newlines as \n or \r\n. (BDR) # Or alternatively, instruct libcurl to change \ns to \r\ns for us with crlf = TRUE # filenames = getURL(url, ftp.use.epsv = FALSE, ftplistonly = TRUE, crlf = TRUE) filenames = paste(url, strsplit(filenames, "\r*\n")[[1]], sep = "") 

Does this solve your problem?

+5
source

Try the following:

  library(RCurl) dir_list <- read.table( textConnection( getURLContent(ftp://[...]/) ) sep = "", strip.white = TRUE) 

The resulting table divides the date into 3 text fields, but this is a great start, and you can get the file names.

+1
source

I read the RCurl document and came across a new piece of code:

 stockReader = function() { values <- numeric() # to which the data is appended when received # Function that appends the values to the centrally stored vector read = function(chunk) { con = textConnection(chunk) on.exit(close(con)) tmp = scan(con) values <<- c(values, tmp) } list(read = read, values = function() values # accessor to get result on completion ) } 

followed by

 reader = stockReader() getURL('http://www.omegahat.org/RCurl/stockExample.dat', write = reader$read) reader$values() 

he says β€œnumeric” in the sample, but of course this sample code can be adapted? Read the attached document. I am sure you will find what you are looking for.

It also says

The main use of getURL (), getForm () and postForm () returns the contents of the requested document as a single block of text. It is accumulated by libcurl and merged into a single line. Then we usually traverse the contents of the document to extract information into regular data, for example. vectors and data frames. For example, suppose the document we requested is a simple stream of numbers, such as prices for a particular stock at different points in time. We load the contents of the file and then read it into a vector in R so that we can parse the values. Unfortunately, these are basically two copies of data stored in memory at the same time. This may be prohibitive or at least undesirable for large datasets. An alternative approach is to process the data in chunks as they are received by libcurl. If we can be notified every time libcurl receives data from the response and does something meaningful from the data, then we do not need to accumulate chunks. The biggest additional information we should be the biggest piece. In our example, we can take each piece and pass it to the scan () function to turn the values ​​into a vector. Then we can relate this to a vector from previously processed fragments.

0
source

Source: https://habr.com/ru/post/1482234/


All Articles