How do you clear AJAX pages?

Please advise how to clear AJAX pages.

+51
ajax web-scraping
Nov 04 '08 at 1:25
source share
10 answers

Overview:

All screen squeaks first require a manual view of the page from which you want to extract resources. When you work with AJAX, you usually just need to parse a little more than just HTML.

When working with AJAX, this simply means that the required value is not in the original HTML document that you requested, but javascript will be displayed that asks you for additional information.

That way, you can just simply parse javascript and see what kind of request javascript is doing, and just call that URL instead.




Example:

Take this as an example, suppose the page you want to clear has the following script:

<script type="text/javascript"> function ajaxFunction() { var xmlHttp; try { // Firefox, Opera 8.0+, Safari xmlHttp=new XMLHttpRequest(); } catch (e) { // Internet Explorer try { xmlHttp=new ActiveXObject("Msxml2.XMLHTTP"); } catch (e) { try { xmlHttp=new ActiveXObject("Microsoft.XMLHTTP"); } catch (e) { alert("Your browser does not support AJAX!"); return false; } } } xmlHttp.onreadystatechange=function() { if(xmlHttp.readyState==4) { document.myForm.time.value=xmlHttp.responseText; } } xmlHttp.open("GET","time.asp",true); xmlHttp.send(null); } </script> 

Then all you have to do is instead send an HTTP request to time.asp on the same server. Example from w3schools .




Advanced Scraper with C ++:

For complex use, and if you use C ++, you can also use the javascript firefox SpiderMonkey mechanism to execute javascript on the page.

Advanced Scraper with Java:

For complex use, and if you use Java, you might also consider using the javascript firefox engine for Java Rhino

Advanced Scraper with .NET:

For complex use, and if you use .Net, you can also consider using the Microsoft.vsa assembly. It has recently been replaced by ICodeCompiler / CodeDOM.

+53
Nov 04 '08 at 2:24
source share

In my opinion, the easiest solution is to use Casperjs , a framework based on phantom browsers without a WebKit browser.

The whole page loads, and itโ€™s very easy to clear any ajax related data. You can check out this basic tutorial to learn Automation and Cleanup with PhantomJS and CasperJS.

You can also look at this sample code on how to clean up google offers keywords:

 /*global casper:true*/ var casper = require('casper').create(); var suggestions = []; var word = casper.cli.get(0); if (!word) { casper.echo('please provide a word').exit(1); } casper.start('http://www.google.com/', function() { this.sendKeys('input[name=q]', word); }); casper.waitFor(function() { return this.fetchText('.gsq_a table span').indexOf(word) === 0 }, function() { suggestions = this.evaluate(function() { var nodes = document.querySelectorAll('.gsq_a table span'); return [].map.call(nodes, function(node){ return node.textContent; }); }); }); casper.run(function() { this.echo(suggestions.join('\n')).exit(); }); 
+8
Feb 09 '14 at 0:25
source share

If you can understand this, try exploring the DOM tree. Selenium does this as part of page testing. It also has functions for clicking buttons and using links that may be useful.

+7
Nov 04 '08 at 1:31
source share

The best way to clear web pages using Ajax or on shared pages using Javascript is with a browser or browser without a browser (browser without a GUI). Phantomjs is currently a well-advanced browser-free browser using WebKit. An alternative that I have used successfully is HtmlUnit (in Java or .NET via IKVM , which is a simulated browser. Another well-known option is a web automation tool such as Selenium .

I have written many articles on this topic, such as web scraping of Ajax and Javascript sites and OAuth automatic letterless authentication for Twitter . At the end of the first article, there are many additional resources that I have been collecting since 2011.

+4
May 09 '13 at 18:21
source share

Depends on the ajax page. The first part of screen shielding is determining how the page works. Is there some kind of variable that you can iterate over to request all the data from the page? Personally, I used Web Scraper Plus for many screen-related tasks, because it's cheap, not hard to get started, not programmers can make it work relatively quickly.

Side note. The terms of use are probably somewhere you can check before this. Depending on the site, iterating through everything may raise some flags.

+2
Nov 04 '08 at 1:31
source share

I like PhearJS , but it could be partly because I created it.

However, it is a service that you run in the background that says HTTP (S) and displays the page as JSON for you, including any metadata you may need.

+2
Oct 16 '15 at 15:27
source share

As an inexpensive solution, you can also try SWExplorerAutomation (SWEA). The program creates an automation API for any web application developed using HTML, DHTML or AJAX.

+1
Apr 11 2018-11-11T00:
source share

I think Brian R. Bondi's answer is useful when source code is easy to read. I prefer a simple way to use tools like Wireshark or HttpAnalyzer to capture a packet and get the URL from the Host and GET fields.

For example, I grab the package as follows:

 GET /hqzx/quote.aspx?type=3&market=1&sorttype=3&updown=up&page=1&count=8&time=164330 HTTP/1.1 Accept: */* Referer: http://quote.hexun.com/stock/default.aspx Accept-Language: zh-cn Accept-Encoding: gzip, deflate User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) Host: quote.tool.hexun.com Connection: Keep-Alive 

Then the url:

 http://quote.tool.hexun.com/hqzx/quote.aspx?type=3&market=1&sorttype=3&updown=up&page=1&count=8&time=164330 
+1
Jul 14 '13 at 9:09 on
source share

Selenium WebDriver is a good solution: you program the browser and automate what needs to be done in the browser. Browsers (Chrome, Firefox, etc.) Provide their own drivers that work with Selenium. Since it works like an automatic REAL browser , pages (including javascript and Ajax) load just like people using this browser.

The downside is that it is slow (since you most likely would like to wait for all images and scripts to load before you start working on this separate page).

+1
Dec 07 '17 at 13:17
source share

I previously contacted MIT Solvent and EnvJS as my responses to cleaning Ajax pages. These projects seem unavailable.

Out of sheer need, I invented another way to actually clear Ajax pages, and it worked on hard sites like findthecompany, which have methods to search for headless javascript engines and show no data.

This method is to use chrome extensions for cleaning. Chrome extensions are the best place to clean Ajax pages because they actually allow us to access the modified Javascript DOM. The technique is as follows: I will definitely open the source code at some point. Create a chrome extension (assuming you know how to create it, as well as its architecture and capabilities. It's easy to learn and practice, since there are many patterns),

  1. Use content scripts to access the DOM using xpath. To a large extent, get the entire list or table or dynamically rendered content using xpath in a variable as a string of HTML Nodes. (Only content scripts can access the DOM, but cannot access the URL using XMLHTTP)
  2. From a content script, using message passing, pass the entire split DOM as a string to a background script. (Background scripts can talk to URLs but can't touch the DOM). We use messaging to make them speak.
  3. You can use various events to cycle through web pages and transfer each selected content of an HTML site to a background script.
  4. Now use a background script to talk to an external server (on the local host), a simple one created using Nodejs / python. Just send all the HTML nodes as a string to the server, where the server will simply save the content on it to files with the appropriate variables to identify page numbers or URLs.
  5. You have now cleared the contents of AJAX (HTML nodes as a string), but these are partial html nodes. Now you can use your favorite XPATH library to load them into memory and use XPATH to clear information to tables or text.

Please comment if you canโ€™t understand, and I can write it better. ( First try). Also, I am trying to release a sample code as soon as possible.

0
Jun 26 2018-11-12T00:
source share



All Articles