Authentic website crawl

Question

Authentic website crawl

How to write a simple script (in cURL / python / ruby / bash / perl / java) that goes into okcupid and counts how many messages I received every day?

The output will look something like this:

1/21/2011 1 messages 1/22/2011 0 messages 1/23/2011 2 messages 1/24/2011 1 messages

The main problem is that I have never written a web crawler before. I have no idea how to programmatically enter the site, for example, okcupid. How to save authentication when loading different pages? etc..

As soon as I get access to the raw HTML, I will be fine with regular expressions and maps, etc.

+4

web-crawler

pokerface Jan 24 '11 at 21:09

source share

1 answer

pokerface · Answer 1 · 2011-01-25T02:28:34+0000

It uses a solution using cURL that loads the first page of the mailbox. The right solution will iterate over the last step for each page of messages. $USERNAME and $PASSWORD must be filled in with your information.

 #!/bin/sh ## Initialize the cookie-jar curl --cookie-jar cjar --output /dev/null https://www.okcupid.com/login ## Login and save the resulting HTML file as loginResult.html (for debugging purposes) curl --cookie cjar --cookie-jar cjar \ --data 'dest=/?' \ --data 'username=$USERNAME' \ --data 'password=$PASSWORD' \ --location \ --output loginResult.html \ https://www.okcupid.com/login ## Download the inbox and save it as inbox.html curl --cookie cjar \ --output inbox.html \ http://www.okcupid.com/messages

This method is explained in the cURL video tutorial .

Authentic website crawl

More articles: