List and description of all packages in CRAN from R

Question

List and description of all packages in CRAN from R

I can get a list of all available packages using the function:

ap <- available.packages()

But how can I get a description of these packages from R, so I can have data.frame with two columns: package and description?

+7

r

adam.888 Jul 19 '12 at 12:25

source share

3 answers

Dirk provided an answer that was awesome, and after completing my decision and then seeing it, I discussed my decision for a while, fearing to look stupid. But I decided to publish it anyway for two reasons:

this is informative for beginner scraper like me.
It took me some time, and so why not :)

I came to this thinking, I needed to make a few flaps and select crantastic as the site to clear from. I will first provide the code, and then two cleansing resources, which were very useful to me when I find out:

 library(RCurl) library(XML) URL <- "http://cran.r-project.org/web/checks/check_summary.html#summary_by_package" packs <- na.omit(XML::readHTMLTable(doc = URL, which = 2, header = T, strip.white = T, as.is = FALSE, sep = ",", na.strings = c("999", "NA", " "))[, 1]) Trim <- function(x) { gsub("^\\s+|\\s+$", "", x) } packs <- unique(Trim(packs)) u1 <- "http://crantastic.org/packages/" len.samps <- 10 #for demo purpose; use: #len.samps <- length(packs) # for all of them URL2 <- paste0(u1, packs[seq_len(len.samps)]) scraper <- function(urls){ #function to grab description doc <- htmlTreeParse(urls, useInternalNodes=TRUE) nodes <- getNodeSet(doc, "//p")[[3]] return(nodes) } info <- sapply(seq_along(URL2), function(i) try(scraper(URL2[i]), TRUE)) info2 <- sapply(info, function(x) { #replace errors with NA if(class(x)[1] != "XMLInternalElementNode"){ NA } else { Trim(gsub("\\s+", " ", xmlValue(x))) } } ) pack_n_desc <- data.frame(package=packs[seq_len(len.samps)], description=info2) #make a dataframe of it all

Resources

+7

Tyler rinker Jul 19 '12 at 15:24

source share

I wanted to try to do this using the HTML scraper ( rvest ) as an exercise, since the available.packages() in the OP does not contain package descriptions.

 library('rvest') url <- 'https://cloud.r-project.org/web/packages/available_packages_by_name.html' webpage <- read_html(url) data_html <- html_nodes(webpage,'tr td') length(data_html) P1 <- html_nodes(webpage,'td:nth-child(1)') %>% html_text(trim=TRUE) # XML: The Package Name P2 <- html_nodes(webpage,'td:nth-child(2)') %>% html_text(trim=TRUE) # XML: The Description P1 <- P1[lengths(P1) > 0 & P1 != ""] # Remove NULL and empty ("") items length(P1); length(P2); mdf <- data.frame(P1, P2, row.names=NULL) colnames(mdf) <- c("PackageName", "Description") # This is the problem! It lists large sets column-by-column, # instead of row-by-row. Try with the full list to see what happens. print(mdf, right=FALSE, row.names=FALSE) # PackageName Description # A3 Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModels # abbyyR Access to Abbyy Optical Character Recognition (OCR) API # abc Tools for Approximate Bayesian Computation (ABC) # abc.data Data Only: Tools for Approximate Bayesian Computation (ABC) # ABC.RAP Array Based CpG Region Analysis Pipeline # ABCanalysis Computed ABC Analysis # For small sets we can use either: # mdf[1:6,] #or# head(mdf, 6)

However, despite the fact that for a small array / list of data (the subset) works quite well, I ran into the problem of displaying with a complete list where the data will be displayed both in columns and in non-primary ones. It would be great if it were built and correctly formatted in a new window. I tried to use the page, but I could not get it to work very well.

EDIT: the recommended method is not the above, but rather uses Dirk's suggestion (from the comments below):

 db <- tools::CRAN_package_db() colnames(db) mdf <- data.frame(db[,1], db[,52]) colnames(mdf) <- c("Package", "Description") print(mdf, right=FALSE, row.names=FALSE)

However, it still suffers from display issues ...

+1

not2qubit Sep 7 '18 at 17:42

source share

Dirk eddelbuettel · Accepted Answer · 2012-07-19T13:21:26+0000

I really think you want “Package” and “Title”, since “Description” can work up to several lines. So here is the first one, just put the “Description” in the final subset if you really want the “Description”:

 R> ## from http://developer.r-project.org/CRAN/Scripts/depends.R and adapted R> R> require("tools") R> R> getPackagesWithTitle <- function() { + contrib.url(getOption("repos")["CRAN"], "source") + description <- sprintf("%s/web/packages/packages.rds", + getOption("repos")["CRAN"]) + con <- if(substring(description, 1L, 7L) == "file://") { + file(description, "rb") + } else { + url(description, "rb") + } + on.exit(close(con)) + db <- readRDS(gzcon(con)) + rownames(db) <- NULL + + db[, c("Package", "Title")] + } R> R> R> head(getPackagesWithTitle()) # I shortened one Title here... Package Title [1,] "abc" "Tools for Approximate Bayesian Computation (ABC)" [2,] "abcdeFBA" "ABCDE_FBA: A-Biologist-Can-Do-Everything of Flux ..." [3,] "abd" "The Analysis of Biological Data" [4,] "abind" "Combine multi-dimensional arrays" [5,] "abn" "Data Modelling with Additive Bayesian Networks" [6,] "AcceptanceSampling" "Creation and evaluation of Acceptance Sampling Plans" R>

List and description of all packages in CRAN from R

More articles: