What are the pros and cons of different ways of analyzing websites?

I would like to write code that reviews the website and its assets, and creates some statistics and a report. Assets include images. I would like to track links, or at least try to identify the menu on the page. I would also suggest that CMS created the site based on class names, etc.

I'm going to assume that the site is static enough or managed by CMS, but not like RIA.

Ideas on how I can progress.

1) Download the site in iFrame. That would be nice, because I could parse it using jQuery. Or could I? It seems that the rules of cross-site scripting interfered with me. I have seen suggestions to get around these problems, but I assume that browsers will continue to dwell on such things. Did the bookmarklet help you?

2) Firefox add-on. This will allow me to get around cross-site scripting issues, right? It seems that debugging tools for Firefox (and GreaseMonkey, for that matter) allow you to do all kinds of things.

3) Take the site from the server side. Use the libraries on the server for parsing.

4) YQL. Isn't that very good for parsing sites?

+3
source share
7 answers

. , , Id Firefox Addon.

Im . DOM , Javascript. , : Adobe AIR, Firefox Addons, userscripts ..

Fx addon , . A script , , , , , . DOM, JS/CSS/HTML/ ( !)

- Adobe AIR. - , . - DOM . - -, URL-, Javascript ( - )... , .

: Adobe AIR DOM:

  • Ajax, HTMLLoader (loadString IIRC)
  • iframe .

, , , ( , , ). , DOM. . -, JS, childSandboxBridge ( : AIR). script :

window.childSandboxBridge = {
   // ... some methods returning data
}

( - , - )

, - , HTML XHTML. . Ive Apache + PHP, / . , DOM .

.

, , - , browsershots. firefox . Mac OS X , ActionScript, .

, :

  • PHP/ script - , JS-, CSS .. .. .
  • Firefox Addon - DOM . (, , firefox - ). , .
  • Adobe AIR - , , Fx-, .
  • - , -. Linux . .:)
+2

:

a) . Perl Python: curl + bash, .

b) script, python perl. Perl WWW::Mechanize.

Python , www.feedparser.org

c) ( HTTP HEAD), . , CMS (i.d. WordPress ..).

d) Google XML API, - "link: sitedomain.com", , : Python Google. Google.

e) SQLite db, Excel.

+7

(XHTML/HTML) . . , .

iframe - HTML, . , . .

, Python, Java, PHP, , , Javascript , Firefox.

, . XHTML/HTML - , . "", HTML, "img", "object" ..

+3

, Firebug , , . , YSlow Firebug , (, , CSS Javascript-).

+3

№4 (YQL): , , , , - , , . YQL , , , .

YQL , # 2 ( firefox).

, № 1 (Iframe) - , .

, № 3 ( ), , , - , AJAX. , AJAX - ! : " -

AJAX: ajax, AJAX evalScripts: true. . , , javascript :

: http://www.prototypejs.org/api/ajax/updater

: http://www.crackajax.net/forums/index.php?action=vthread&forum=3&topic=17

, , : http://aptana.com/jaxer/guide/develop_sandbox.html

(, , ) .NET- WebRobot AJAX, Digg.com. http://www.vbdotnetheaven.com/UploadFile/fsjr/ajaxwebscraping09072006000229AM/ajaxwebscraping.aspx

PHP Curl -. , Curl AJAX: http://www.merchantos.com/makebeta/php/scraping-links-with-php/

, , :

  • AJAX.
  • .
  • , ..
  • [] .
  • .
  • [] .

^ . , ( ).

! AJAX. AJAX. Digg.com , MSN.com ..

+3

.Net , # - .Net. WebBrowser , ( GetElementsByTagName()), , .. ( BASE, ) src href URL- HttpWebRequest, HEAD- , . , , . , , /pagerank ( API Google), , HTML XHTML, URL- , , , , Google ( , ).

0

I would use a script (or a compiled application depending on the chosen language) written in a language that has strong support for network and text parsing / regular expressions.

  • Perl
  • Python
  • .NET language selection
  • Java

which language do you like best. The basic stand-alone script / application does not allow you to worry too much about browser integration and security issues.

0
source

Source: https://habr.com/ru/post/1709248/


All Articles