Improved feature for receiving news from Google in R

I wrote a function to capture and analyze data from Google for a given stock symbol, but I'm sure there are ways to improve it. Firstly, my function returns an object in the GMT time zone, and not in the current time zone of the user, and it fails if a number greater than 299 is transmitted (probably because google returns only 300 stories per share). This is somewhat in response to my own question about stack overflow and is heavily dependent on this blog post .

tl; dr: how can i improve this feature?

getNews <- function(symbol, number){ # Warn about length if (number>300) { warning("May only get 300 stories from google") } # load libraries require(XML); require(plyr); require(stringr); require(lubridate); require(xts); require(RDSTK) # construct url to news feed rss and encode it correctly url.b1 = 'http://www.google.com/finance/company_news?q=' url = paste(url.b1, symbol, '&output=rss', "&start=", 1, "&num=", number, sep = '') url = URLencode(url) # parse xml tree, get item nodes, extract data and return data frame doc = xmlTreeParse(url, useInternalNodes = TRUE) nodes = getNodeSet(doc, "//item") mydf = ldply(nodes, as.data.frame(xmlToList)) # clean up names of data frame names(mydf) = str_replace_all(names(mydf), "value\\.", "") # convert pubDate to date-time object and convert time zone pubDate = strptime(mydf$pubDate, format = '%a, %d %b %Y %H:%M:%S', tz = 'GMT') pubDate = with_tz(pubDate, tz = 'America/New_york') mydf$pubDate = NULL #Parse the description field mydf$description <- as.character(mydf$description) parseDescription <- function(x) { out <- html2text(x)$text out <- strsplit(out,'\n|--')[[1]] #Find Lead TextLength <- sapply(out,nchar) Lead <- out[TextLength==max(TextLength)] #Find Site Site <- out[3] #Return cleaned fields out <- c(Site,Lead) names(out) <- c('Site','Lead') out } description <- lapply(mydf$description,parseDescription) description <- do.call(rbind,description) mydf <- cbind(mydf,description) #Format as XTS object mydf = xts(mydf,order.by=pubDate) # drop Extra attributes that we don't use yet mydf$guid.text = mydf$guid..attrs = mydf$description = mydf$link = NULL return(mydf) } 
+6
source share
1 answer

Here is a shorter (and probably more efficient) version of your getNews function

  getNews2 <- function(symbol, number){ # load libraries require(XML); require(plyr); require(stringr); require(lubridate); # construct url to news feed rss and encode it correctly url.b1 = 'http://www.google.com/finance/company_news?q=' url = paste(url.b1, symbol, '&output=rss', "&start=", 1, "&num=", number, sep = '') url = URLencode(url) # parse xml tree, get item nodes, extract data and return data frame doc = xmlTreeParse(url, useInternalNodes = T); nodes = getNodeSet(doc, "//item"); mydf = ldply(nodes, as.data.frame(xmlToList)) # clean up names of data frame names(mydf) = str_replace_all(names(mydf), "value\\.", "") # convert pubDate to date-time object and convert time zone mydf$pubDate = strptime(mydf$pubDate, format = '%a, %d %b %Y %H:%M:%S', tz = 'GMT') mydf$pubDate = with_tz(mydf$pubDate, tz = 'America/New_york') # drop guid.text and guid..attrs mydf$guid.text = mydf$guid..attrs = NULL return(mydf) } 

Also, there might be an error in your code, as I tried to use it for symbol = 'WMT' , and it returned an error. I think getNews2 works fine for WMT as well. Check it out and let me know if this works for you.

PS. The description column still contains the html code. But it is easy to extract text from this. I will post an update when I find the time

+5
source

Source: https://habr.com/ru/post/886513/


All Articles