Removing millions of rows in MySQL

I recently discovered and fixed a bug on the site I was working on, which resulted in millions of duplicate rows of data in the table that would be quite large even without them (still in millions). I can easily find these duplicate lines and execute one delete request to kill them all. The problem is that trying to delete this many rows in one shot locks the table for a long time, which I would like to avoid if possible. The only ways I can get rid of these lines without removing the site (by locking the table):

  • Write a script that will execute thousands of small delete requests in a loop. This theoretically circumvents the problem of a locked table, because other queries will be able to get into the queue and work between deletions. But it will still load the database heavily into the database and will take a lot of time.
  • Rename the table and recreate the existing table (now it will be empty). Then do the cleanup on the renamed table. Rename the new table, name it the old one, and merge the new rows into the renamed table. This requires significantly more steps, but you need to perform work with minimal interruption. The only difficult part here is that the table in question is a table of reports, so as soon as it is renamed the other way, and the empty one in its place, all historical reports go away until I put it back in place . In addition, the merge process can be a little painful due to the type of data that is stored. All in all, this is my likely choice right now.

I'm just wondering if anyone else has this problem before, and if so, how did you deal with this without having to remove the site and hopefully with minimal interruption for users? If I go with number 2 or another, similar approach, I can plan for the material to work late in the evening, and the next day the merger begins and simply informs the users in advance that this is not a huge deal. I just want to see if anyone has ideas for a better or easier way to clean.

+62
mysql sql-delete maintenance query-performance
Aug 23 '09 at 16:39
source share
11 answers
DELETE FROM `table` WHERE (whatever criteria) ORDER BY `id` LIMIT 1000 

Rinse, rinse, repeat until zero rows are affected. Perhaps in a script that sleeps for a second or three between iterations.

+125
Aug 23 '09 at 16:40
source share

I would also recommend adding some restrictions to your table to make sure that this does not happen again with you. Millions of rows, per 1000 for each shot, will pass 1000 repetitions of the script. If the script is executed once every 3.6 seconds, you will do it in an hour. Do not worry. Your customers are unlikely to notice.

+7
Aug 23 '09 at 16:50
source share

the following deletes 1,000,000 records one at a time.

  for i in `seq 1 1000`; do mysql -e "select id from table_name where (condition) order by id desc limit 1000 " | sed 's;/|;;g' | awk '{if(NR>1)print "delete from table_name where id = ",$1,";" }' | mysql; done 

you can group them together and delete table_name, where IN (id1, id2, .. idN) im sure too w / o much problems

+6
Apr 13 '16 at 18:57
source share

I had a case of deleting 1M + rows in a 25M + rows table in MySQL. Tried various approaches, such as batch removal (described above).
I found out that the fastest way (copy the required records to a new table):

  • Create a temporary table containing only identifiers.

CREATE TABLE id_temp_table (temp_id int);

  1. Insert the identifiers to be deleted:

insert into id_temp_table (temp_id) select .....

  1. Create a new table table_new

  2. Insert all the records from the table into the new_table without the extra rows that are in id_temp_table

Paste into table_new .... where table_id NOT IN (select (temp_id) from id_temp_table);

  1. Rename Tables

The whole process took ~ 1 hour. In my case, simply deleting a batch of 100 records took 10 minutes.

+6
Nov 22 '17 at 12:50
source share

I would use mk-archiver from the excellent Maatkit (a bunch of Perl scripts for managing MySQL). Maatkit is owned by Baron Schwartz, author of O'Reilly's High Performance MySQL book.

The goal is low return, only forward work to get old data from the table without affecting OLTP requests a lot. You can insert data into another table, which does not have to be on the same server. You can also write it to a file in a format suitable for LOAD DATA INFILE. Or you cannot, in this case, simply incremental DELETE.

It is already created for archiving your unwanted lines in small batches and as a bonus, it can save deleted lines to a file if you ruin the request that selects deleted lines.

No installation required, just take http://www.maatkit.org/get/mk-archiver and run perldoc on it (or read the website) for documentation.

+3
Aug 26 '09 at 0:11
source share

According to the mysql documentation , TRUNCATE TABLE is a quick alternative to DELETE FROM . Try the following:

  TRUNCATE TABLE table_name 

I tried this on 50mm lines and it was done in two minutes.

Note. Truncate operations are not transaction safe; The error occurs when one is tried during an active transaction or locks an active table.

+2
Mar 06 '16 at 15:20
source share

Do this in let batches, say 2,000 lines at a time. Agreement between them. A million rows are not many, and it will be fast if you do not have many indexes in the table.

+1
Aug 23 '09 at 16:42
source share

For us, the answer DELETE WHERE %s ORDER BY %s LIMIT %d not an option, because the WHERE criteria were slow (an unindexed column) and would fall into master.

Select from read-replica the list of primary keys that you want to delete. Export with this format:

 00669163-4514-4B50-B6E9-50BA232CA5EB 00679DE5-7659-4CD4-A919-6426A2831F35 

Use the following bash script to capture this input and put it in DELETE statements [bash ≥ 4 required due to the built-in mapfile ]:

sql-chunker.sh (remember chmod +x me and change shebang to point to your bash 4 executable):

 #!/usr/local/Cellar/bash/4.4.12/bin/bash # Expected input format: : <<! 00669163-4514-4B50-B6E9-50BA232CA5EB 00669DE5-7659-4CD4-A919-6426A2831F35 ! if [ -z "$1" ] then echo "No chunk size supplied. Invoke: ./sql-chunker.sh 1000 ids.txt" fi if [ -z "$2" ] then echo "No file supplied. Invoke: ./sql-chunker.sh 1000 ids.txt" fi function join_by { local d=$1 shift echo -n "$1" shift printf "%s" "${@/#/$d}" } while mapfile -t -n "$1" ary && ((${#ary[@]})); do printf "DELETE FROM my_cool_table WHERE id IN ('%s');\n" `join_by "','" "${ary[@]}"` done < "$2" 

Call like this:

 ./sql-chunker.sh 1000 ids.txt > batch_1000.sql 

This will give you a file with output formatting (I used batch size 2):

 DELETE FROM my_cool_table WHERE id IN ('006CC671-655A-432E-9164-D3C64191EDCE','006CD163-794A-4C3E-8206-D05D1A5EE01E'); DELETE FROM my_cool_table WHERE id IN ('006CD837-F1AD-4CCA-82A4-74356580CEBC','006CDA35-F132-4F2C-8054-0F1D6709388A'); 

Then follow these steps:

 mysql --login-path=master billing < batch_1000.sql 

For those who are not familiar with login-path , this is just a shortcut to enter the system without entering a password on the command line.

+1
Nov 29 '17 at 15:08
source share

I ran into a similar problem. We had a really large table, about 500 GB in size, without partitions, and only one index for the primary_key column. Our master was a huge computer, 128 cores and 512 gigabytes of RAM, and we had several slaves. We tried several methods to deal with large-scale row deletion. I will list them all here from worst to best that we found-

  1. Download and delete one row at a time. This is the worst thing you could do. So, we have not even tried this.
  2. Retrieve the first “X” rows from the database using the limit query for the primary_key column, then check the row identifiers for deletion in the application and run one delete request with the list of primary_key ids. So, 2 requests for "X" lines. Now, this approach was good, but when using a batch job, about 5 million rows were deleted in 10 minutes or so, which caused the slaves of our MySQL database to lag for 105 seconds. 105-second lag in 10-minute activity. So we had to stop.
  3. In this method, we introduced a 50 ms delay between our subsequent batch sampling and deletion of size 'X' each. This solved the delay problem, but now we deleted 1.2-1.3 million rows in 10 minutes, compared to 5 million in Method 2.
  4. Partition a database table and then delete entire partitions when they are not needed. This is the best solution we have, but it requires a pre-partitioned table. We completed step 3 because we had a very old partitioned table with indexing only on the primary_key column. Creating a partition would take too much time, and we were in a crisis mode. Here are some partition related links that I found helpful- The Official MySQL Reference , Daily Oracle Database Partitioning .

So IMO, if you can afford the luxury of creating a section in your table, go to option number 4, otherwise you are stuck with option number 3.

+1
Nov 03 '18 at 2:31
source share

I think the slowness is due to the MySQl "clustered index" where the actual records are stored in the primary key index - in the order of the primary key index. This means that access to write through the primary key is extremely fast, because it requires only one disk fetch, since writing to the disk is where it found the correct primary key in the index.

In other databases without clustered indexes, the index itself does not contain a record, but simply “offset” or “location” indicating where the record is in the table file, and then a second selection should be made in this file to extract the actual data.,

When deleting a record in a clustered index, you can imagine that all the records above this record in the table should be moved down to avoid creating huge holes in the index (I remember this well at least a few years ago - later versions are possible, this has changed).

Knowing above, what we found to really accelerate the uninstall in MySQL was to uninstall in the reverse order. This creates the least number of record moves, since you delete records from the end at the beginning, which means that subsequent deletions will have fewer objects to move.

0
Mar 19 '19 at 19:46
source share

I did not write anything for this, and for its proper execution a script is absolutely necessary, but another option is to create a new duplicate table and select all the rows that you want to save in it. Use a trigger to keep it up to date while this process ends. When it is synchronized (except for the rows you want to delete), rename both tables in the transaction so that the new one replaces the old one. Throw the old table and voila!

This (obviously) requires a lot of additional disk space and can overlay your I / O resources, but otherwise it can be much faster.

Depending on the nature of the data or in case of emergency, you can rename the old table and create a new empty table in its place, and also choose to “save” rows in the new table at your leisure ...

0
Apr 29 '19 at 13:21
source share



All Articles