It may or may not be trivial, but I'm working on a piece of software that checks the "end of line" domain for ads displayed through my web application. Ideally, I have a list of domains from which I do not want to show ads (for example, Norton.com is one of them), but most advertising networks serve ads through shortened and cryptic URLs (adsrv.com), which ultimately redirect to Norton.com. Therefore, the question arises: does any one built or have an idea of how to build a scraper tool that returns the final URL of the ad.
Initial Discovery: Some ads are in Flash, JavaScript, or in plain HTML. Browser emulation is completely viable and will struggle with various ad formats. Not all Flash or JS ads have an noflash or noscript alternative. (Perhaps a browser is needed, but as said, this is great ... Using something like WatiN or WatiR or WatiJ or Selenium, etc.)
Prefer open source so I can rebuild it myself. I really appreciate the help!
EDIT * For this script, you need to click on the ad, as it can be Flash, JS or just plain HTML. So is Curl a less likely option if Curl can't click?
source
share