Get comments from the site using disqus

I would like to write a scraper script to extract comments from cnn articles. For example, this article: http://www.cnn.com/2012/01/19/politics/gop-debate/index.html?hpt=hp_t1

I understand that cnn uses disqus to discuss comments. Since loading comments is not based on a web page (ie. Previous page, next page) and is dynamic (ie. You need to click “download the next 25”), I have no idea how to get all 5000+ comments for this articles.

Any idea or suggestion?

Thank you very much!

+4
source share
2 answers

An option for cleaning (another, and then getting the page), which may be less reliable (depending on your needs), but will offer a solution to the problem you have - use some kind of shell around a full-fledged web browser and literally encode the usage pattern and retrieve relevant data. Since you did not indicate which programming language you know, I will give 3 examples: 1) Watir - ruby, 2) Watin - IE and Firefox via .net, 3) Selenium - IE through C # / Java / Perl / PHP / Ruby / Python

I will give a small example using Watin and C #:

IE browser = new IE(); browser.GoTo(YOUR CNN URL); List visibleComments = Browser.List(Find.ById("dsq-comments")); //do your scraping thing Link moreComments = Browser.Link(Find.ByClass("dsq-paginate-append-text"); moreComments.click(); //wait util ajax ended by searching for some indicator Browser.WaitUntilContainsText(SOME TEXT); //do your scraping thing 

Note: I am not familiar with disqus, but it’s best to make all the comments show by looping the link and clicking on the parts of the code that I posted until all the comments are visible, and clear the List dsq-comments element

+2
source

I needed to get comments by scraping the disqus comment page via ajax. Since they were not displayed on the server, I had to call disqus api. In the source code you will need an identifier code:

 var identifier = "456643" // take note of this from the page source // this is the ident url query param in the following js request 

also look in js source code to get the page public key and forum name. Put them in the url where necessary.

I used javascript nodejs to verify this, i.e.:

 var request = require("request"); var publicKey = "pILMw27bsbJsdfsdQDh9Eh0MzAgFL6xx0hYdsdsdfaIfBHRvLGqFFQ09st"; var disqusUri = "https://disqus.com/api/3.0/threads/listPosts.json?&api_key=" + publicKey + "&thread:ident=456643&forum=nameOfForumFromSource"; request(disqusUri, function(res,status,err){ console.log(res.body); if(err){ console.log("ERR: " + err); } }); 
+5
source

Source: https://habr.com/ru/post/1391988/


All Articles