Python web scraper combined with asyncio

Question

Python web scraper combined with asyncio

I wrote a script in python to get some information from a web page. The code itself works flawlessly if it is inferred from asynchronous. However, since my script works synchronously, I wanted it to go through an asynchronous process, so that it would complete the task as soon as possible, ensuring optimal performance and, obviously, not blocking it. Since I have never worked with this asynchronous library, I am seriously confused about how to do this. I tried putting my script in an asyncio process, but that seems to be wrong. If someone lends a helping hand to complete this, I will really be grateful to him. Thanks in advance. Here is my error code:

import requests ; from lxml import html import asyncio link = "http://quotes.toscrape.com/" async def quotes_scraper(base_link): response = requests.get(base_link) tree = html.fromstring(response.text) for titles in tree.cssselect("span.tag-item a.tag"): processing_docs(base_link + titles.attrib['href']) async def processing_docs(base_link): response = requests.get(base_link).text root = html.fromstring(response) for soups in root.cssselect("div.quote"): quote = soups.cssselect("span.text")[0].text author = soups.cssselect("small.author")[0].text print(quote, author) next_page = root.cssselect("li.next a")[0].attrib['href'] if root.cssselect("li.next a") else "" if next_page: page_link = link + next_page processing_docs(page_link) loop = asyncio.get_event_loop() loop.run_until_complete(quotes_scraper(link)) loop.close()

After execution, I see on the console:

 RuntimeWarning: coroutine 'processing_docs' was never awaited processing_docs(base_link + titles.attrib['href'])

+5

python asynchronous python-3.x web-scraping python-asyncio

SIM Sep 05 '17 at 13:42

source share

1 answer

James wilson · Accepted Answer · 2017-09-05T13:50:03+0000

You need to call processing_docs() with await .

Replace:

 processing_docs(base_link + titles.attrib['href'])

with:

 await processing_docs(base_link + titles.attrib['href'])

And replace:

 processing_docs(page_link)

with:

 await processing_docs(page_link)

Otherwise, it tries to run the asynchronous function synchronously and gets upset!

Python web scraper combined with asyncio

More articles: