Update: potential solution below
I have a large corpus of configuration files consisting of key / value pairs that I am trying to insert into the database. Many keys and values are repeated in configuration files, so I store data using 3 tables. One for all unique key values, one for all unique key values and one list of all key / value pairs for each file.
Problem: I use several parallel processes (and therefore connections) to add raw data to the database. Unfortunately, I get a lot of deadlocks when trying to add values to the key and value tables. I tried several different ways to insert data (shown below), but always ends up with a deadlock error
TransactionRollbackError: deadlock detected
DETAILS: Process 26755 is waiting for ShareLock on transaction 689456; blocked by process 26754. Process 26754 is waiting for ShareLock on transaction 689467; blocked process 26755.
I was wondering if anyone could shed light on what might cause these deadlocks, and maybe point me to some way to solve the problem. Looking at the SQL queries that I use (listed below), I really don't understand why there is any kind of joint dependency.
Thank you for reading!
Example configuration file:
example_key this_is_the_value other_example other_value third example yet_another_value
Table Definitions:
CREATE TABLE keys ( id SERIAL PRIMARY KEY, hash UUID UNIQUE NOT NULL, key TEXT); CREATE TABLE values ( id SERIAL PRIMARY KEY, hash UUID UNIQUE NOT NULL, key TEXT); CREATE TABLE keyvalue_pairs ( id SERIAL PRIMARY KEY, file_id INTEGER REFERENCES filenames, key_id INTEGER REFERENCES keys, value_id INTEGER REFERENCES values);
SQL expressions:
I initially tried to use this statement to avoid any exceptions:
WITH s AS ( SELECT id, hash, key FROM keys WHERE hash = 'hash_value'; ), i AS ( INSERT INTO keys (hash, key) SELECT 'hash_value', 'key_value' WHERE NOT EXISTS (SELECT 1 FROM s) returning id, hash, key ) SELECT id, hash, key FROM i UNION ALL SELECT id, hash, key FROM s;
But even something as simple as this causes dead ends:
INSERT INTO keys (hash, key) VALUES ('hash_value', 'key_value') RETURNING id;
- In both cases, if I get an exception because the inserted hash value is not unique, I use savepoints to roll back the changes and another expression to just select the identifier that I follow.
- I use hashes for a unique field, as some of the keys and values are too long for indexing
Full python code example (using psycopg2) with savepoints:
key_value = 'this_key' hash_val = generate_uuid(value) try: cursor.execute( ''' SAVEPOINT duplicate_hash_savepoint; INSERT INTO keys (hash, key) VALUES (%s, %s) RETURNING id; ''' (hash_val, key_value) ) result = cursor.fetchone()[0] cursor.execute('''RELEASE SAVEPOINT duplicate_hash_savepoint''') return result except psycopg2.IntegrityError as e: cursor.execute( ''' ROLLBACK TO SAVEPOINT duplicate_hash_savepoint; ''' )
Update: Therefore, I believe that I hinted at another stackexchange site:
In particular:
The UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands behave the same as SELECT in terms of searching for target lines: they will only find target lines that have been committed since the time1 command was run. However, such a target row may already be updated (or deleted or blocked) by another parallel transaction by the time it is found. In this case, the potential updater will wait for the first transaction update to commit or roll back (if it is still in progress). If the first updater rolls back, then its effects and the second updater can continue updating the originally found line. If the first updater completes, the second updater will ignore the row if the first updater has deleted it2, otherwise it will try to apply its operation to the updated version of the row.
While I'm still not entirely sure where the co-dependency is, it seems that processing a large number of key / value pairs without commits can lead to something like this. Of course, if I commit after adding each individual configuration file, a deadlock will not occur.