I have large datasets with millions of XML records. These datasets are complete dumps of database data up to a point in time.
Between two dumps, new records can be added, and existing ones can be changed or deleted. Assume that the schema remains unchanged and each record has a unique identifier.
What would be the best way to determine the delta between these two datasets (including deletions and updates)?
My plan is to upload everything to the DBMS and go from there.
First load the old dump. Then load the new dump into another schema, but in doing so I will check if the record is new or is an update of the existing record. If so, I will register the identifier in a new table (s) called “changes”.
After all this is done, I will look at the old dump going through all the records and see if they have the corresponding record (that is: the same ID) on the new dump. If not, log in.
Assuming searching for a record by identifier is an O(log n) operation, this should allow me to do everything in O(n log n) time.
Since I can tell the difference by looking at the presence or absence of records with only an identifier and the last modified date, I could also load everything into main memory. The time complexity will be the same, but with the added benefit of smaller disk I / O, which should make it faster by an order of magnitude.
Suggestions? (Note: this is more a performance issue than anything)
source share