Multiprocessing bulk insert using peewee

Question

Multiprocessing bulk insert using peewee

I am working on a simple html scraper in Python 3.4, using peewee as ORM (great btw ORM!). My script takes a bunch of sites, extracts the necessary data and saves them in the database, however, each site is cleaned in a separate process to improve performance, and the stored data must be unique. Data can be duplicated not only between sites, but also on a specific site, so I want to store them only once.

Example: Message and category - many-to-many relationship. During curettage, the same category appears several times in different messages. For the first time I want to save this category in a database (create a new row). If the same category is displayed in different messages, I want to associate this message with an already created line in db.

My question is: do I need to use atomic updates / inserts (insert one record, save, get_or_create categories, save, insert new rows in the many-to-many table, save), or can I use bulk insertion somehow? What is the quick fix to this problem? Maybe some temporary tables are shared between processes that will be filled with mass at the end of the work? I am using MySQL db.

thanks for the answers and your time

+4

python mysql multiprocessing bulkinsert peewee

Paweł stysz Jan 2 '15 at 18:54

source share

1 answer

coleifer · Accepted Answer · 2015-01-02T19:35:48+0000

You can rely on a database to provide unique constraints by adding unique=Trueto fields or unique multi-column indexes. You can also check documents for get / create and bulk inserts:

Multiprocessing bulk insert using peewee

More articles: