Creating a table using web clips using a loop

I'm trying webscrape tax-rates.org to get the average percentage for each county in Texas. I have a list of 255 counties in the csv file that I import as "TX_counties", this is one column table. I need to create a URL for each county as a string, so I set d1 in the first cell using [i, 1], then concatenated it into a URL string, did a cleanup, and then added +1 to [i], which does he go to the second cell for the next county name, and the process continues.

The problem is that I cannot figure out how to save the scrape results in a "growing list", which I then want to make in a table and save it in a CSV file at the end. I can only scratch one county at a time and then rewrite myself.

Any thoughts? (quite new to R and the scraper in general)

i <- 1
for (i in 1:255) {

  d1 <- as.character(TX_counties[i,1])

  uri.seed <- paste(c('http://www.tax-rates.org/texas/',d1,'_county_property_tax'), collapse='')

  html <- htmlTreeParse(file = uri.seed, isURL=TRUE, useInternalNodes = TRUE)

  avg_taxrate <- sapply(getNodeSet(html, "//div[@class='box']/div/div[1]/i[1]"), xmlValue)

  t1 <- data.table(d1,avg_taxrate)

  i <- i+1

}

write.csv(t1,"2015_TX_PropertyTaxes.csv")
+4
source share
3 answers

This uses rvest, provides a progress bar, and exploits the fact that the URLs already exist for you on the page:

library(rvest)
library(pbapply)

pg <- read_html("http://www.tax-rates.org/texas/property-tax")

# get all the county tax table links
ctys <- html_nodes(pg, "table.propertyTaxTable > tr > td > a[href*='county_property']")

# match your lowercased names
county_name <- tolower(gsub(" County", "", html_text(ctys)))

# spider each page and return the rate %
county_rate <- pbsapply(html_attr(ctys, "href"), function(URL) {
  cty_pg <- read_html(URL)
  html_text(html_nodes(cty_pg, xpath="//div[@class='box']/div/div[1]/i[1]"))
}, USE.NAMES=FALSE)

tax_table <- data.frame(county_name, county_rate, stringsAsFactors=FALSE)

tax_table
##   county_name              county_rate
## 1    anderson Avg. 1.24% of home value
## 2     andrews Avg. 0.88% of home value
## 3    angelina Avg. 1.35% of home value
## 4     aransas Avg. 1.29% of home value

write.csv(tax_table, "2015_TX_PropertyTaxes.csv")

NOTE 1. I limited the cleanup to 4 so as not to kill the bandwidth of a site that offers free data.

2: 254 , , 255.

+3
library(RCurl)
library(XML)
tx_c <- c("anderson", "andrews")

res <- sapply(1:2, function(x){
    d1 <- as.character(tx_c[x])
    uri.seed <- paste(c('http://www.tax-rates.org/texas/',d1,'_county_property_tax'), collapse='')
    html <- htmlTreeParse(file = uri.seed, isURL=TRUE, useInternalNodes = TRUE)
    avg_taxrate <- sapply(getNodeSet(html, "//div[@class='box']/div/div[1]/i[1]"), xmlValue)
    return(c(d1, avg_taxrate))
})

res.df <- data.frame(t(res), stringsAsFactors = FALSE)
names(res.df) <- c("county", "property")
res.df
#    county                 property
# 1 anderson Avg. 1.24% of home value
# 2  andrews Avg. 0.88% of home value
+2

you must first initialize the list to store the data cleared with each cycle. be sure to initialize it before going into the loop

then with each iteration add to the list before starting the next iteration. see my answer here

Web scraper in R with loop from data.frame

0
source

Source: https://habr.com/ru/post/1616305/


All Articles