Setting up a python screen scraper that can run on the Google App engine

I want to configure an automatic screen scraper that will run in a Google application using python. I want it to clean the site and put the specified results into the Entity mechanism in the application. I am looking for some guidance on what to use. I saw beautifulsoup, but I wonder if people can recommend anything else that can work in the Google App engine.

+3
source share
4 answers

Beautifulsoup works great in App Engine (just make sure you're using 3.0.8, not iffy 3.1.0). The main alternative, I think, would be html5lib - I am not trying to use it in App Engine, but I believe that it works there (rather slowly - if this is a problem, I think you need to stick to BeautifulSoup), for example, this service works in App Engine and is based on html5lib.

+4
source

I had good (albeit slow) results using mechanize and BeautifulSoup. In fact, to save code space in the Google App Engine, I use the (old) version of BeautifulSoup included in mechanize.

I have a mechanism in a zip file mechanize.zip. The index of this zip file looks like this:

mechanize/
mechanize/__init__.py
mechanize/_auth.py
mechanize/_beautifulsoup.py
mechanize/_clientcookie.py
... etc

Python

import sys
sys.path.insert(0, 'mechanize.zip')

import mechanize
from mechanize._beautifulsoup import BeautifulSoup
+1

lxml, C GAE.

0

I used BeautifulSoup with great success while parsing HTML. The problem is that everything BeautifulSoup does, parses HTML. I ended up writing all http interactions using urlfetch.

To clear my target, I need a full-fledged browser with slave code that can execute javascript on my pages on the target site. I think I need to reset the python application and go to java so that I can use HTMLUnit - prototyping continues. - mattb

0
source

Source: https://habr.com/ru/post/1736039/


All Articles