I have a Python 2.7 program that retrieves data from websites and uploads the results to a database. It follows the model of the consumer manufacturer and is recorded using the streaming module.
Just for fun, I would like to rewrite this program using the new asyncio module (from 3.4), but I cannot figure out how to do this correctly.
The most important requirement is that the program must receive data from the same website in sequential order. For example, for the URL ' http://a-restaurant.com ' he should first get ' http://a-restaurant.com/menu/0 ', then http://a-restaurant.com/menu/1 ' , then http://a-restaurant.com/menu/2 ', ... If they are not loaded, so that the site completely stops the delivery of pages, and you need to start from 0.
However, another selection for another website ( http://another-restaurant.com ') can (and should) be launched simultaneously (other sites also have sequential limitation).
The streaming processing module is suitable for this, since I can create separate streams for each website, and in each stream it can wait until one page finishes loading before retrieving another.
Here is a roughly simplified snippet of code from the streaming version (Python 2.7):
class FetchThread(threading.Threading)
def __init__(self, queue, url)
self.queue = queue
self.baseurl = url
...
def run(self)
for food in range(10):
url = self.baseurl + '/' + str(food)
text = urllib2.urlopen(url).read()
self.queue.put(text)
...
def main()
queue = Queue.Queue()
urls = ('http://a-restaurant.com/menu', 'http://another-restaurant.com/menu')
for url in urls:
fetcher = FetchThread(queue, url)
fetcher.start()
...
And this is how I tried to do this using asyncio (in 3.4.1):
@asyncio.coroutine
def fetch(url):
response = yield from aiohttp.request('GET', url)
response = yield from response.read_and_close()
return response.decode('utf-8')
@asyncio.coroutine
def print_page(url):
page = yield from fetch(url)
print(page)
l = []
urls = ('http://a-restaurant.com/menu', 'http://another-restaurant.com/menu')
for url in urls:
for food in range(10):
menu_url = url + '/' + str(food)
l.append(print_page(menu_url))
loop.run_until_complete(asyncio.wait(l))
And he extracts and prints everything in a certain order. Well, I think the whole idea of ββthese coroutines. Should I not use aiohttp and just get with urllib? But do they make discounts for the first restaurant, then block samples for other restaurants? I just think this is completely wrong? (This is just a test to try to retrieve things in sequential order. Until it got into the queue.)