I have been trying to fix relational db for a month, but I cannot find an effective solution.
This is my problem: I have 534 M lines of Relational Db with lots of foreig (30) keys.
I can handle normal duplicates with a union ... group by ... havin count (*) = 1 by inserting, but there are also duplicates with different keys.
Example:
Table 1
id | key1 | value 1 | 11 | a1 2 | 22 | a1
table 2
key1 | value 11 | a2 22 | a2 Foreign key table1(key1) references table2(key1)
I try to find, remove duplicate, fix parents. I tried 3 different ways,
1: PHP Script, arrays
export tables (dump) -> array_unique, find duplicates, fix parent array -> import tables
Its pretty fast, but it needs 80 GB of memory, which may be a problem in the future
2: PHP Script, SQL Query
tables exporrt (dump) -> find duplicates -> send queries to the parent table
There is no need for memory, but the tables are really large and 5 queries take 1 second, 50 M duplicates will take days, months, years
3: Enabling DUPLICATE UPDATE KEY: I added one duplicate column to store duplicate keys, and I defined all columns except the key as a unique key,
insert ... to the duplicate update concat (duplicate, ';', VALUES (key)).
But some tables have more than 1 key, and sometimes I have to define 24 columns as a unique index and memory problem again
I hope I can explain my problem. Do you have any ideas?