How to get auto-increment values ​​for a column after loading a Pandas data frame into a MySQL database

I have a Pandas DataFrame (called df ) that I would like to load into the MySql database. The DataFrame has the columns [A, B, C], and the table in the database has the columns [ID, A, B, C]. The identifier column in the database is automatically incrementing the primary key.

I can load the data frame into the database using the df.to_sql('table_name', engine) command. However, this does not give me any information about the values ​​that the database assigned to the input identifier column. The only way to get this information is to query the database using the values ​​for columns A, B, C:

 select ID, A, B, C from db_table where (A, B, C) in ((x1, y1, z1), (x2, y2, z2), ...) 

However, this query takes a very long time when I insert a lot of data.

Is there an easier and faster way to get the values ​​that the database has assigned to the input identifier column?

Edit 1: I can assign the identifier column myself, according to user response 3364098 below. However, my work is part of a pipeline that runs in parallel. If I assign an identifier column myself, there is a chance that I can assign the same id values ​​to different data frames that are loaded at the same time. That is why I would like to redefine the task of assigning an identifier to a database.

Solution: In the end, I assigned an identifier column and issued a lock in the table when loading data to ensure that no other process loads data with the same id value. Mostly:

 try: engine.execute('lock tables `table_name` write') max_id_query = 'select max(ID) FROM `table_name`' max_id = int(pd.read_sql_query(max_id_query, engine).values) df['ID'] = range(max_id + 1, max_id + len(df) + 1) df.to_sql('table_name', engine, if_exists='append', index=False) finally: engine.execute('unlock tables') 
+2
source share
2 answers

You can assign an identifier yourself:

 import pandas as pd df['ID'] = pd.read_sql_query('select ifnull(max(id),0)+1 from db_table',cnx).iloc[0,0]+range(len(df)) 

where cnx is your connection and then load your df.

+2
source
 import pandas as pd df['ID'] = pd.read_sql_query('select MAX(ID)+1 from db_table',cnx).iloc[0,0] + range(len(df)) 
-1
source

Source: https://habr.com/ru/post/1262434/


All Articles