How to Bypass Unsuccessful Scrapy Answers (Status Code 416, 999, ...)

I am writing a script using Scrapy, but I am having problems with failed HTTP responses. In particular, I'm trying to clear " https://www.crunchbase.com/ ", but I keep getting HTTP status code 416. Can sites block spiders from scraping their contents

+6
source share
3 answers

What happens is that the website looks at the headers attached to your request and decides that you are not a browser and therefore block your request.

However, there is nothing a website can do to distinguish between Scrapy and Firefox / Chrome / IE / Safari if you decide to send the same headers as the browser. In Chrome, open the Network Tools console and you’ll see exactly the headers it sends. Copy these headers into your Scrapy request and everything will work.

You might want to start by sending the same User-Agent header as your browser.

How to send these headers using your Scrapy request described here .

+6
source

You are correct http://crunchbase.com blocks bots. It still serves the HTML page "Pardon our Interruption", which explains why they think your bot is and provide a request form for unlocking (albeit with a 416 status code).

According to Distil Networks vice president of marketing, Crunchbase uses a grid of distillation networks.

https://www.quora.com/How-does-distil-networks-bot-and-scraper-detection-work

After several attempts, even my access to the browser was successfully blocked. I sent an unlock request and was turned on again. Not sure about other sites protected from distillation, but you might try asking for crystal base management.

+1
source

You will need to have β€œbr” and β€œsdch” as accepted encodings if you use Chrome as a user agent.

Here is an example:

 html_headers = { 'Accept':'*/*', 'Accept-Encoding':'gzip, deflate, br, sdch', 'Connection':'keep-alive', 'Host':'www.crunchbase.com', 'Referer':'https://www.crunchbase.com/', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36' } res = requests.get('https://www.crunchbase.com/', headers=html_headers) 

As someone else said earlier, in Chrome, open the developer console (three dots in the upper right corner β†’ Additional tools β†’ Developer console or press Ctrl + Shift + I), go to the "Network" section, tab, reload the page, click the red dot to stop recording, click on the file, and on the right you will see the tab "Request Header"

EDIT: If you want to use a real web engine like WebKit, you probably won't need the slightest trick at all. Example.

 from PyQt5.QtWidgets import QApplication from PyQt5.QtCore import QUrl from PyQt5.QtWebKitWidgets import QWebPage class Client(QWebPage): def __init__(self, url): self.app = QApplication(sys.argv) QWebPage.__init__(self) self.loadFinished.connect(self.on_page_load) self.mainFrame().load(QUrl(url)) self.app.exec_() def on_page_load(self): self.app.quit() cont = Client(url).mainFrame().toHtml() soup = bs.BeautifulSoup(cont,'lxml') 

Another advantage of this approach is that it handles JavaScript, so it turns into dynamic loading. For instance. if the Javascript called when the page loads replaces any text on the page, with this approach you can get new text

+1
source

Source: https://habr.com/ru/post/986021/


All Articles