How to find the final destination (URL) of an ad (programmatically)

It may or may not be trivial, but I'm working on a piece of software that checks the "end of line" domain for ads displayed through my web application. Ideally, I have a list of domains from which I do not want to show ads (for example, Norton.com is one of them), but most advertising networks serve ads through shortened and cryptic URLs (adsrv.com), which ultimately redirect to Norton.com. Therefore, the question arises: does any one built or have an idea of ​​how to build a scraper tool that returns the final URL of the ad.

Initial Discovery: Some ads are in Flash, JavaScript, or in plain HTML. Browser emulation is completely viable and will struggle with various ad formats. Not all Flash or JS ads have an noflash or noscript alternative. (Perhaps a browser is needed, but as said, this is great ... Using something like WatiN or WatiR or WatiJ or Selenium, etc.)

Prefer open source so I can rebuild it myself. I really appreciate the help!

EDIT * For this script, you need to click on the ad, as it can be Flash, JS or just plain HTML. So is Curl a less likely option if Curl can't click?

+3
source share
6

PHP:

$k = curl_init('http://goo.gl');
curl_setopt($k, CURLOPT_FOLLOWLOCATION, true); // follow redirects
curl_setopt($k, CURLOPT_USERAGENT, 
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 ' .
'(KHTML, like Gecko) Chrome/7.0.517.41 Safari/534.7'); // imitate chrome
curl_setopt($k, CURLOPT_NOBODY, true); // HEAD request only (faster)
curl_setopt($k, CURLOPT_RETURNTRANSFER, true); // don't echo results
curl_exec($k);
$final_url = curl_getinfo($k, CURLINFO_EFFECTIVE_URL); // get last URL followed
curl_close($k);
echo $final_url;

- https://www.google.com/accounts/ServiceLogin?service=urlshortener&continue=http://goo.gl/?authed%3D1&followup=http://goo.gl/?authed%3D1&passive=true&go=true

. , curl_setopt(), CURLOPT_SSL_VERIFYHOST CURLOPT_SSL_VERIFYPEER, HTTPS/SSL

+4
curl --head -L -s -o /dev/null -w %{url_effective} <some-short-url>
  • --head HEAD,

  • -L curl,

  • -s ..

  • -o /dev/null curl , ( )

  • -w %{url_effective} curl url stdout

, url stdout, .

+2

URL- , , .

Net:: HTTP .

, Ruby open-uri , , URL , , .

require 'open-uri'

io = open('http://google.com')
body = io.read
io.base_uri.to_s # => "http://www.google.com/"

, URL Google /.

. - , , .

:

require 'nokogiri'

doc = Nokogiri::HTML('<meta http-equiv="REFRESH" content="0;url=http://www.the-domain-you-want-to-redirect-to.com">')

redirect_url = (doc%'meta[@http-equiv="REFRESH"]')['content'].split('=').last rescue nil
+1

cURL HTTP. , Location:, Location:, , - URL.

0

The Mechanize gem is convenient for this:

  agent = Mechanize.new {|a| a.user_agent_alias = 'Windows IE 7'}
  page = agent.get(url)
  final_url = page.uri.to_s
0
source

The solution I ended up was simulating a browser, loading an ad and clicking. Clicking was a key ingredient. The solutions offered by others were good for the given URL, but did not handle Flash, JavaScript, etc. Appreciate the help of all.

0
source

Source: https://habr.com/ru/post/1772745/


All Articles