SQLAlchemy IntegrityError and bulk data import

I am inserting several 10k entries into a database with REF integrity rules. Unfortunately, some of the data rows are duplicated (they already exist in the database). It would be too expensive to check for the existence of each row in the database before inserting it, so I intend to continue handling the IntegrityError exceptions thrown by SQLAlchemy, reporting the error and then continuing.

My code would look something like this:

# establish connection to db etc. tbl = obtain_binding_to_sqlalchemy_orm() datarows = load_rows_to_import() try: conn.execute(tbl.insert(), datarows) except IntegrityError as ie: # eat error and keep going except Exception as e: # do something else 

The assumption (implicit) that I make above is that SQLAlchemy does not translate multiple attachments into an ONE transaction. If my assumption is wrong, it means that if an IntegrityError occurs, the rest of the insert will be aborted. Can someone confirm if the above pattern pseudo code will work properly - or will I end up losing data as a result of IntegrityError discarded exceptions?

In addition, if someone has a better idea of ​​this, I will be interested to hear it.

+6
source share
2 answers

this may work if you did not start the transaction earlier, since in this case the sqlalchemy autocommit function will start working. You must explicitly specify as described in the link.

+1
source

I also ran into this problem when I parsed ASCII data files to import data into a table. The problem is that I instinctively and intuitively wanted SQLAlchemy to skip duplicate rows, allowing unique data. Or it may be the case when a random error is generated using a string due to the current SQL mechanism, for example, unicode strings are not allowed.

However, this behavior is beyond the scope of the SQL interface definition. The SQL API, and therefore SQLAlchemy, understands transactions and commits transactions and does not account for this selective behavior. In addition, it sounds dangerous to depend on the autodiscover function, since insertion stops after an exception, leaving the rest of the data.

My solution (which I'm not sure if it is the most elegant) is to process each line in loops, catch and log exceptions and commit the changes at the very end.

Assuming you somehow acquired data in a list of lists, i.e. a list of rows that are lists of column values. Then you read each line in a loop:

 # Python 3.5 from sqlalchemy import Table, create_engine import logging # Create the engine # Create the table # Parse the data file and save data in `rows` conn = engine.connect() trans = conn.begin() # Disables autocommit exceptions = {} totalRows = 0 importedRows = 0 ins = table.insert() for currentRowIdx, cols in enumerate(rows): try: conn.execute(ins.values(cols)) # try to insert the column values importedRows += 1 except Exception as e: exc_name = type(e).__name__ # save the exception name if not exc_name in exceptions: exceptions[exc_name] = [] exceptions[exc_name].append(currentRowIdx) totalRows += 1 for key, val in exceptions.items(): logging.warning("%d out of %d lines were not imported due to %s."%(len(val), totalRows, key)) logging.info("%d rows were imported."%(importedRows)) trans.commit() # Commit at the very end conn.close() 

To maximize speed in this operation, you must turn off auto-messaging. I use this code with SQLite, and it is still 3-5 times slower than my old version, using only sqlite3 , even if automatic protection is disabled. (The reason I ported SQLAlchemy was to use it with MySQL.)

This is not the most elegant solution in the sense that it is not as fast as the direct interface with SQLite. If I profile the code and find a bottleneck in the near future, I will update this answer with a solution.

0
source

Source: https://habr.com/ru/post/915693/


All Articles