Data scrambling table that does not exist in the page source

Question

Data scrambling table that does not exist in the page source

I want to clear the data table on this website .

I check the page source for this page, the table does not exist in the page source.

Then I checked the network information when updating the website, it seems that the data table was received by sending a POST request to this URL:

http://datacenter.mep.gov.cn:8099/ths-report/report!list.action

Then I tried to send a POST request, but received nothing with a status of 500.

I was wondering if there is a way to scrap this table with R?

Thank.

+4

post r web-crawler web scraping dynamic-websites

rankthefirst Oct 6 '17 at 11:06

source share

1 answer

hrbrmstr · Accepted Answer · 2017-10-06T13:48:21+0000

Good tracking!

He made a request GETfor me. This seems to be a trick. He is also trying to choose a goal that suits you:

library(httr)
library(rvest)
library(stringi)

pg <- read_html("http://datacenter.mep.gov.cn/index!MenuAction.action?name=259206fe260c4cf7882462520e1e3ada")

html_nodes(pg, "div[onclick]") %>% 
  html_attr("onclick") %>% 
  stri_replace_first_fixed('load("', "") %>% 
  stri_replace_last_regex('",".*$', "") -> report_urls

head(report_urls)
## [1] "http://datacenter.mep.gov.cn:8099/ths-report/report!list.action?xmlname=1462849093743"
## [2] "http://datacenter.mep.gov.cn:8099/ths-report/report!list.action?xmlname=1462764947052"
## [3] "http://datacenter.mep.gov.cn:8099/ths-report/report!list.action?xmlname=1465594312346"
## [4] "http://datacenter.mep.gov.cn:8099/ths-report/report!list.action?xmlname=1462844293531"
## [5] "http://datacenter.mep.gov.cn:8099/ths-report/report!list.action?xmlname=1462844935563"
## [6] "http://datacenter.mep.gov.cn:8099/ths-report/report!list.action?xmlname=1462845592195"

rpt_pg <- read_html(report_urls[1])
html_table(rpt_pg)[[2]]
# SO won't let me paste the table

Data scrambling table that does not exist in the page source

More articles: