How can I get data from other sites?

I want to create a website that extracts information from other websites and prints them on my website, I am at the research stage, so I would like to hear some opinions and the best solution for this project?

I heard that Python using the parser can do this, I just want to know which way should I use, and which language should I use?

+4
source share
5 answers

Python with BeautifulSoup and Urllib2 will probably serve you well. Of course, it is doubtful whether you need to clear data from other websites, and you may find yourself in a constant struggle if these websites change layouts.

+4
source

Requests is for this kind of thing.

Before using HTML, check if the website offers an API. If so, you are already in the business!

+2
source

Python has great web search capabilities: urllib, BeautifulSoup, XPath, etc. This video will help you get started quickly with python web scraping: http://www.youtube.com/watch?v=Ap_DlSrT-iE - It uses urllib and BeautifulSoup to clear huffingtonposts ' in its example script.

If you need a scraper system (a scraper with a web interface and an administrator to publish your cleared content), this might be a good option for you - https://github.com/holgerd77/django-dynamic-scraper - I would really like to suggest this if you are already familiar with Django.

+2
source

I prefer using urllib2 to request pages by url and then using regular expressions to retrieve data. This works well if the data is in small lumps. The code reads well enough: if the line contains / regex /, save the value.

+1
source

You can write several web spiders to collect some data from another site. Using urllib2 or queries, you can download html from the site. Beautiful or PyQuery can help you parse the html and get the data you need.

0
source

Source: https://habr.com/ru/post/1486139/


All Articles