Perl: HTML scraping from an authenticated website

While HTML Scraping is pretty well documented from what I see, and I understand the concept and its implementation, which is the best method for scraping the content that is hidden behind authentication forms. I refer to curettage from content that I legally have access to, so the method of automatically sending login data is what I'm looking for.

All I can think of is to create a proxy server by grabbing the bandwidth from manual login, and then configure the script to fake that bandwidth as part of the HTML scrambling. As for the language, this will most likely be done in Perl.

Has anyone had experience with this, or just a general thought?

Edit This has been answered before , but with .NET. Although it checks how I think it needs to be done, does anyone have a Perl script for this?

+4
source share
4 answers

Check out the Perl library WWW :: Mechanize - it is based on LWP to provide tools to perform exactly the kind of interaction that you call, and it can maintain state with cookies while you talk about it!

WWW :: Mechanize, or Fur for short, helps you automate interactions with a Web site. It supports the execution of a sequence of page fetching, including by link and form submission. Each selected page is analyzed and its links and forms. A link or form can be selected, the form fields can be filled in, and the next page may be implausible. The fur also keeps a history of the URLs you visited that can be requested and reviewed.

+4
source

The perl LWP Module should give you what you need.

Here's a good article here that talks about enabling cookies and other authentication methods in order to get an authorized login and allow the screen to clear you are outside the login wall.

+3
source

There are two types of authentication that are regularly used. HTTP authentication and form-based authentication.

For a site that uses HTTP authentication, you basically send the username and password as part of every HTTP request that you make to the server.

For a site that performs forms-based authentication, you usually need to visit the login page, accept and save the cookie, and then send the cookie information with any HTTP requests.

Of course, there are also sites like stackoverflow that use external authentication, such as openid or saml for authentication. They are more difficult to handle for disposal. Usually you want to find a library to handle them.

+2
source

Yes, you can use other libraries for your own language if it is not asp.net.

For example, in Java you can use httpclient or httpunit (which even handles some basic Javascript).

0
source

Source: https://habr.com/ru/post/1277642/


All Articles