How to parse a website that does not show codes in the view source?

Question

How to parse a website that does not show codes in the view source?

I'm not sure how to describe the problems correctly, but anyway, so I want to use mechanize to capture the form and get the input name. however, when I analyze the use of mechanization, it does not display the form name and input name. and if I try manually by looking at the website, I have to check the element so that I can get the input name, but still, it is dynamic, so every time I check the element, it gives me a different name. Any ideas? By the way, the site I'm trying to parse is https://www.ursa.ucla.edu/logon/logon.asp , if anyone is interested.

Here is what I tried:

br = mechanize.Browser(factory=mechanize.RobustFactory()) br.open("https://www.ursa.ucla.edu/logon/logon.asp/") br.select_form(nr=0) print br.response().read()

Thanks Advance, Richard.

+4

python parsing forms mechanize

ordinaryman09 Jan 22 '12 at 5:22

source share

1 answer

valentinas · Accepted Answer · 2012-01-22T06:36:11+0000

The webpage you are trying to parse is not directly accessible. When you visit https://www.ursa.ucla.edu/logon/logon.asp , he will do the following:

We redirect you to https://shb.ais.ucla.edu/shibboleth-idp/profile/Shibboleth/SSO?shire=https%3A%2F%2Fwww.ursa.ucla.edu%2FShibboleth.sso%2FSAML%2FPOST&time=1327213354 = cookie% 3Aa872692c & providerId = https% 3A% 2F% 2Fwww.ursa.ucla.edu% 2Fshibboeth-sp (since you can see that it has a couple of variables - cookie, time ..)
The second page will redirect you to https://shb.ais.ucla.edu/shibboleth-idp/AuthnEngine
The third page will redirect you to https://shb.ais.ucla.edu/shibboleth-idp/Authn/RemoteUser
The last page will answer 200 and send you a markup with a form and two hidden input fields. The form will send itself to download, and only after this fifth response will you receive the actual login page.

Now I don't know how python handles redirection headers. You may need to see the answer you get. In the best case, this will be the last page with hidden variables, you will need to parse them and send a POST request to the same URL to get the real login page. In the worst case, you will need to follow the headlines completely from the first page.

How to parse a website that does not show codes in the view source?

More articles: