After HTTP redirect in OCaml web crawler

What am I doing

I am writing a web crawler in OCaml. Using the string_of_uri function (below) defined by nlucaroni in the previous answer , to the question I posted, I can get the HTML text of the URL from the Internet.

 let string_of_uri uri = try let connection = Curl.init () and write_buff = Buffer.create 1763 in Curl.set_writefunction connection (fun x -> Buffer.add_string write_buff x; String.length x); Curl.set_url connection uri; Curl.perform connection; Curl.global_cleanup (); Buffer.contents write_buff; with _ -> raise (IO_ERROR uri) 

I already wrote the code to retrieve a list of all the hyperlinks in the extracted HTML (i.e. all the [LINK] parts in something like <A HREF="[LINK]">text</A> ). All of this works great.

Problem

The problem is that some pages redirect you, and I don't know how to follow the redirect. For example, my program displays 0 tags on the http://en.wikipedia.org page because Wikipedia will actually redirect you to http://en.wikipedia.org/wiki/Main_Page . If I give this last page to my program, everything will be fine. But if I give the initial one, it just returns 0 <A> tags.

Unfortunately, there is no documentation at all for ocurl , except for function names in the interface. Does anyone have an idea in how I can improve the string_of_uri function above so that it follows any possible redirects and displays the HTML pages of the last page it hit?

I noticed that applying the Curl.get_redirectcount function to a connection on http://en.wikipedia.org returns 0, which is not what I was expecting, as the page is redirected to another page ...

Thanks for any help!

All the best, Surikator.

+4
source share
1 answer

This question has already been answered in the comments of this answer . The solution is to add Curl.set_followlocation connection true just above Curl.perform connection .

0
source

Source: https://habr.com/ru/post/1334863/


All Articles