Sqlalchemy: dynamically create a table from a Scrapy object

I work with sqlalchemy 1.1 and scrapy. I am currently using a pipeline to store the extracted data in a sqlite table through sqlalchemy. I would like to dynamically create a table to place the item being cleared.

My static pipeline element looks like this:

class SQLlitePipeline(object): def __init__(self): db_path = "sqlite:///"+settings.SETTINGS_PATH+"\\data.db" _engine = create_engine(db_path) _connection = _engine.connect() _metadata = MetaData() _stack_items = Table(table_name, _metadata, Column("id", Integer, primary_key=True), Column("value", Text)) Column("value2", Text)) _metadata.create_all(_engine) self.connection = _connection self.stack_items = _stack_items def process_item(self, item, spider): try: ins_query = self.stack_items.insert().values( value=item['value'], value2=item['value2'],) self.connection.execute(ins_query) except IntegrityError: print('THIS IS A DUP') return item 

items.py:

 class Filtered_Item(scrapy.Item): value= scrapy.Field() value2= scrapy.Field() 

How can I change the pipeline above to dynamically create and insert values ​​of the filtered elements instead of being encoded as they are now?

+5
source share
3 answers

Actually there is a package that can help you with this.

Check out: dataset: databases for lazy people

Here is an excerpt from the page:

Functions

Auto circuit:

If a table or column is written that does not exist in the database, it will be created automatically.

Upserts:

Records are created or updated, depending on whether an existing version can be found. Query helpers for simple queries, such as all rows in a table or all different values ​​in a set of columns.

Compatibility:

Being built on top of SQLAlchemy, the dataset works with all major databases such as SQLite, PostgreSQL, and MySQL.

Export scripts:

Data can be exported based on configuration scripts, which makes the process simple and replicable.

+5
source

This is not a direct answer to the question, but an alternative approach to solving the problem.

How can I change the pipeline above to dynamically create and insert values ​​of the filtered elements instead of being encoded as they are now?

What I hear is that you do not want to have a predefined table schema and which database to configure for the fields that you clear. Well, that sounds just like you need a schemaless database.

Consider migrating to MongoDB or other NoSQL schemas without schemas. The Scrapy documentation even provides an example of a Python + MongoDB pipeline that inserts a scraper element into the MongoDB collection (a β€œtable” in SQL terms) as a JSON document:

 def process_item(self, item, spider): self.db[self.collection_name].insert(dict(item)) return item 

And what's important - no matter what the item fields are - there is no predefined structure for your collection document.

This is just a thought - I know little about your project requirements and possible limitations.

+2
source

Here's what I came up with based on the Alex dataset recommendation above:

 import dataset class DynamicSQLlitePipeline(object): @classmethod def from_crawler(cls, crawler): # Here, you get whatever value was passed through the "target" parameter table_name = getattr(crawler.spider, "target") return cls(table_name) def __init__(self,table_name): try: db_path = "sqlite:///"+settings.SETTINGS_PATH+"\\data.db" db = dataset.connect(db_path) self.my_table = db[table_name] except Exception: traceback.exec_print() def process_item(self, item, spider): try: self.my_table.insert(dict(item)) except IntegrityError: print('THIS IS A DUP') return item 

Hope this helps.

+1
source

Source: https://habr.com/ru/post/1261149/


All Articles