Web / Screen Scraping with Google App Engine - code works in python interpreter, but not in GAE

I want to make some scrapers using GAE. (Endless campus information portal, fyi). This service requires you to visit the site. I had code that worked using mechanization in regular python. When I found out that I could not use mechanization in the Google App Engine, I ended up using urllib2 + ClientForm. I couldn’t get him to log in to the server, so after several hours of processing the cookies, I ran the same code in a regular python interpreter and it worked. I found the log file and saw a lot of messages about deleting the host header in my request ... I found the source file in Google Code, and the host header was in the "untrustworthy" list and the user code was removed from all the requests.

Apparently, GAE removes the host header that IC requires to determine which school system you are logging into, so it appeared as if I could not log in.

How do I solve this problem? I can not point anything else in my fake to the target site. Why should it be a "security hole" in the first place?

+3
source share
1 answer

App Engine does not highlight the Host header: it forces it to be an exact value based on the URI request. Assuming the URI is absolute, the server is not even allowed to view the Host header in any way, for RFC2616 :

  • Request-URI URI, Request-URI. .

... , . "dummy", (, ), , GAE, , " python". ?

+2

Source: https://habr.com/ru/post/1716852/


All Articles