Why can't I load this webpage in python?

Try it yourself: :)

curl http://www.windowsphone.com/en-US/apps?list=free 

result:

  <html><head><title>Object moved</title></head><body> <h2>Object moved to <a href="https://login.live.com/login.srf?wa=wsignin1.0&amp;rpsnv=11&amp;checkda=1&amp;ct=1320735308&amp;rver=6.1.6195.0&amp;wp=MBI&amp;wreply=http:%2F%2Fwww.windowsphone.com%2Fen-US%2Fapps%3Flist%3Dfree&amp;lc=1033&amp;id=268289">here</a>.</h2> </body></html> 

or

 def download(source_url): try: socket.setdefaulttimeout(10) agents = ['Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)','Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1)','Microsoft Internet Explorer/4.0b1 (Windows 95)','Opera/8.00 (Windows NT 5.1; U; en)'] ree = urllib2.Request(source_url) ree.add_header('User-Agent',random.choice(agents)) resp = urllib2.urlopen(ree) htmlSource = resp.read() return htmlSource except Exception, e: print e return "" download('http://www.windowsphone.com/en-US/apps?list=free') 

result:

 <html><head><meta http-equiv="REFRESH" content="0; URL=http://www.windowsphone.com/en-US/apps?list=free"><script type="text/javascript">function OnBack(){}</script></head></html> 

I want to download the actual source of the webpage.

+1
source share
2 answers

The reason it fails is because http://www.windowsphone.com is trying to set a cookie, which is marked on https://login.live.com , which creates another cookie and redirects back to windowsphone .com if successful.

You should take a look at http://docs.python.org/library/cookielib.html

If you want to use curl, allow it to create a cookie as follows:

 curl -so /dev/null 'http://www.windowsphone.com/en-US/apps?list=free' -c 'myCookieJar' 

Launch more myCookieJar in your shell and you will see something like this:

 # Netscape HTTP Cookie File # http://www.netscape.com/newsref/std/cookie_spec.html # This file was generated by libcurl! Edit at your own risk. .www.windowsphone.com TRUE / FALSE 0 WPMSLSS SLSS=1 login.live.com FALSE / FALSE 0 MSPRequ lt=1320738008&co=1&id=268289 

Launch (note the -b option before "myCookieJar"):

 curl -so 'windowsphone.html' 'http://www.windowsphone.com/en-US/apps?list=free' -b 'myCookieJar' 

and you will get the contents of the page in the windowsphone.html file, as you see in the browser.

+3
source

Flesk really has an answer to this (+1).

Another direct way to debug HTTP connections is Netcat , which is basically a powerful telnet utility.

So, let's say you want to debug what happens in your HTTP request:

 $ nc www.windowsphone.com 80 GET /en-US/apps?list=free HTTP/1.0 Host: www.windowsphone.com User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0) 

This will send the request header to the server (you need to press enter twice to send).

After that, the server will respond:

 HTTP/1.1 302 Found Location: https://login.live.com/login.srf?wa=wsignin1.0&rpsnv=11&checkda=1&ct=1320745265&rver=6.1.6195.0&wp=MBI&wreply=http:%2F%2Fwww.windowsphone.com%2Fen-US%2Fapps%3Flist%3Dfree&lc=1033&id=268289 Server: Microsoft-IIS/7.5 Set-Cookie: WPMSLSS=SLSS=1; domain=www.windowsphone.com; path=/; HttpOnly X-Powered-By: ASP.NET X-Server: SN2CONXWWBA06 Date: Tue, 08 Nov 2011 09:41:05 GMT Connection: close Content-Length: 337 <html><head><title>Object moved</title></head><body> <h2>Object moved to <a href="https://login.live.com/login.srf?wa=wsignin1.0&amp;rpsnv=11&amp;checkda=1&amp;ct=1320745265&amp;rver=6.1.6195.0&amp;wp=MBI&amp;wreply=http:%2F%2Fwww.windowsphone.com%2Fen-US%2Fapps%3Flist%3Dfree&amp;lc=1033&amp;id=268289">here</a>.</h2> </body></html> 

Thus, the server returns 302, which is the HTTP status code for redirection, and thus offers the β€œbrowser” to open the URL passed in the Location header.

Netcat is a great tool for debugging and tracking all kinds of network communication and really helped me when I wanted to go deeper into the HTTP protocol.

+2
source

Source: https://habr.com/ru/post/1380849/


All Articles