What is the best way to continuously export information from a Scrapy crawler to a Django application database?

I am trying to create a Django application that functions as storage. Elements are scraped from all over the Internet and constantly update the database of the Django project (say, every few days). I use the Scrapy framework to perform the cleanup, and although there is an experimental function here ) and use them for loaddata in the Django project as XML lights (docs here ). Everything seems to be in order, because if one of the two processes is screwed, there is an intermediate file between them. Modulating the application as a whole does not seem like a bad idea either.

Some problems:

  • To make these files too large to read into memory for Django loaddata .
  • I spend too much time on this when there may be a better or simpler solution, for example, exporting directly to a database, which in this case is MySQL.
  • No one seems to have written about this process on the Internet, which is strange given that Scrapy is a great basis for connecting to the Django application, in my opinion.
  • There is no definitive guide to manually creating Django tools in Django docs - it seems to be more focused on resetting and reloading fixtures from the application itself.

The existence of the experimental DjangoItem suggests that Scrapy + Django is a popular choice for a good solution.

I would really appreciate any decisions, advice or wisdom on this matter.

+6
source share
3 answers
+2
source

This question is a bit outdated already, but I'm currently doing the proper integration of Django + Scrapy. My workflow is as follows: I installed Scrapy as a Django management team, as described in this answer . Subsequently, I simply write a Scrapy pipeline that stores the scraped element in the Django database using the Django QuerySet methods. All this. I am currently using SQLite for a database and it works like a charm. Perhaps this is still useful for someone.

+1
source

You can use django-dynamic-scraper to create and manage Scrap scrapers with easy access to Django models. So far, I have not encountered any problems that he cannot solve, which Scrapy cannot solve.

Django-dynamic-scraper documentation

+1
source

Source: https://habr.com/ru/post/893907/


All Articles