You will need to have βbrβ and βsdchβ as accepted encodings if you use Chrome as a user agent.
Here is an example:
html_headers = { 'Accept':'*/*', 'Accept-Encoding':'gzip, deflate, br, sdch', 'Connection':'keep-alive', 'Host':'www.crunchbase.com', 'Referer':'https://www.crunchbase.com/', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36' } res = requests.get('https://www.crunchbase.com/', headers=html_headers)
As someone else said earlier, in Chrome, open the developer console (three dots in the upper right corner β Additional tools β Developer console or press Ctrl + Shift + I), go to the "Network" section, tab, reload the page, click the red dot to stop recording, click on the file, and on the right you will see the tab "Request Header"
EDIT: If you want to use a real web engine like WebKit, you probably won't need the slightest trick at all. Example.
from PyQt5.QtWidgets import QApplication from PyQt5.QtCore import QUrl from PyQt5.QtWebKitWidgets import QWebPage class Client(QWebPage): def __init__(self, url): self.app = QApplication(sys.argv) QWebPage.__init__(self) self.loadFinished.connect(self.on_page_load) self.mainFrame().load(QUrl(url)) self.app.exec_() def on_page_load(self): self.app.quit() cont = Client(url).mainFrame().toHtml() soup = bs.BeautifulSoup(cont,'lxml')
Another advantage of this approach is that it handles JavaScript, so it turns into dynamic loading. For instance. if the Javascript called when the page loads replaces any text on the page, with this approach you can get new text
source share