I would choose Python because of the excellent libxml2 bindings, in particular such as lxml.html and pyQuery . Scrapy has its own libxml2 bindings, I did not look at them to check them out, although looking at the Scrapy documentation did not leave me very impressed (I made a lot of clips just using these parsers and manual coding). With any of these, you get a truly excellent HTML parser by querying through XPath, and with lxml.html and pyquery (also built on lxml) you get a CSS selector.
If you do a little work, scraping the forum, I would skip the framework and just do it manually - it's just parallelization, etc. not really required.
source
share