I often write datascrubs that update millions of rows of data. Data is stored in a MySQL 24x7x365 OLTP database using InnoDB. Updates can clean every row of the table (in which the database ends with acquiring a lock at the table level) or it can simply clear 10% of the rows in the table (which can still be in millions).
To avoid creating large transaction sizes and minimizing competition, I usually try to split my single massive UPDATE statement into a series of small UPDATE transactions. Thus, I end up writing a loop that restricts my UPDATE WHERE clause as follows:
(warning: this is just pseudo code to get the point)
@batch_size=10000;
@max_primary_key_value = select max(pk) from table1
for (int i=0; i<=@max_primary_key_value; i=i+@batch_size)
{
start transaction;
update IGNORE table1
set col2 = "flag set"
where col2 = "flag not set"
and pk > i
and pk < i+@batchsize;
commit;
}
This approach simply sucks for many reasons.
UPDATE , . , UPDATE . 1/2 ... , . , - , - , .
, , , .
?
Matthew Quinlan