What am I doing
I am writing a web crawler in OCaml. Using the string_of_uri function (below) defined by nlucaroni in the previous answer , to the question I posted, I can get the HTML text of the URL from the Internet.
let string_of_uri uri = try let connection = Curl.init () and write_buff = Buffer.create 1763 in Curl.set_writefunction connection (fun x -> Buffer.add_string write_buff x; String.length x); Curl.set_url connection uri; Curl.perform connection; Curl.global_cleanup (); Buffer.contents write_buff; with _ -> raise (IO_ERROR uri)
I already wrote the code to retrieve a list of all the hyperlinks in the extracted HTML (i.e. all the [LINK] parts in something like <A HREF="[LINK]">text</A> ). All of this works great.
Problem
The problem is that some pages redirect you, and I don't know how to follow the redirect. For example, my program displays 0 tags on the http://en.wikipedia.org page because Wikipedia will actually redirect you to http://en.wikipedia.org/wiki/Main_Page . If I give this last page to my program, everything will be fine. But if I give the initial one, it just returns 0 <A> tags.
Unfortunately, there is no documentation at all for ocurl , except for function names in the interface. Does anyone have an idea in how I can improve the string_of_uri function above so that it follows any possible redirects and displays the HTML pages of the last page it hit?
I noticed that applying the Curl.get_redirectcount function to a connection on http://en.wikipedia.org returns 0, which is not what I was expecting, as the page is redirected to another page ...
Thanks for any help!
All the best, Surikator.
source share