How can I get data from other sites?

Question

How can I get data from other sites?

I want to create a website that extracts information from other websites and prints them on my website, I am at the research stage, so I would like to hear some opinions and the best solution for this project?

I heard that Python using the parser can do this, I just want to know which way should I use, and which language should I use?

+4

python database parsing web-scraping

user2484278 Jun 14 '13 at 0:30

source share

5 answers

Tim bender · Answer 1 · 2013-06-14T00:49:36+0000

Python with BeautifulSoup and Urllib2 will probably serve you well. Of course, it is doubtful whether you need to clear data from other websites, and you may find yourself in a constant struggle if these websites change layouts.

Vortico · Answer 2 · 2013-06-14T00:53:31+0000

Requests is for this kind of thing.

Before using HTML, check if the website offers an API. If so, you are already in the business!

Ardy dedase · Answer 3 · 2013-06-14T01:04:00+0000

Python has great web search capabilities: urllib, BeautifulSoup, XPath, etc. This video will help you get started quickly with python web scraping: http://www.youtube.com/watch?v=Ap_DlSrT-iE - It uses urllib and BeautifulSoup to clear huffingtonposts ' in its example script.

If you need a scraper system (a scraper with a web interface and an administrator to publish your cleared content), this might be a good option for you - https://github.com/holgerd77/django-dynamic-scraper - I would really like to suggest this if you are already familiar with Django.

Brent washburne · Answer 4 · 2013-06-14T00:55:36+0000

I prefer using urllib2 to request pages by url and then using regular expressions to retrieve data. This works well if the data is in small lumps. The code reads well enough: if the line contains / regex /, save the value.

albert · Answer 5 · 2013-06-14T02:17:43+0000

You can write several web spiders to collect some data from another site. Using urllib2 or queries, you can download html from the site. Beautiful or PyQuery can help you parse the html and get the data you need.

How can I get data from other sites?

More articles: