How to determine the difference between two large data sets?

I have large datasets with millions of XML records. These datasets are complete dumps of database data up to a point in time.

Between two dumps, new records can be added, and existing ones can be changed or deleted. Assume that the schema remains unchanged and each record has a unique identifier.

What would be the best way to determine the delta between these two datasets (including deletions and updates)?


My plan is to upload everything to the DBMS and go from there.

First load the old dump. Then load the new dump into another schema, but in doing so I will check if the record is new or is an update of the existing record. If so, I will register the identifier in a new table (s) called “changes”.

After all this is done, I will look at the old dump going through all the records and see if they have the corresponding record (that is: the same ID) on the new dump. If not, log in.

Assuming searching for a record by identifier is an O(log n) operation, this should allow me to do everything in O(n log n) time.

Since I can tell the difference by looking at the presence or absence of records with only an identifier and the last modified date, I could also load everything into main memory. The time complexity will be the same, but with the added benefit of smaller disk I / O, which should make it faster by an order of magnitude.

Suggestions? (Note: this is more a performance issue than anything)

+6
source share
5 answers
+1
source

Take a look at DeltaXML.

(added because StackOverflow does not allow short responses)

+1
source

As an unusual suggestion, consider using git . Bring the first data set under version control, then clean the working directory and copy it to the second data set. git grows fast when playing the difference.

0
source

Take a look at this post on MSDN, which provides a solution for getting the differences between the two DataTables. He should point you in the right direction:

How to compare two data tables:
http://social.msdn.microsoft.com/Forums/en/csharpgeneral/thread/23703a85-20c7-4759-806a-fabf4e9f5be6

You can also take a look at this SO question:
Comparing two data tables to define rows in one and not the other

I also saw that this approach has been used several times:

 table1.Merge(table2); DataTable changesTable = table1.GetChanges(); 
0
source
 select coalesce(a.id, b.id) as id, case when a.id is null then 'included' when b.id is null then 'deleted' when a.col != b.col then 'updated' end as status from a full outer join b on a.id = b.id where a.id is null or b.id is null or a.col != b.col 
0
source

Source: https://habr.com/ru/post/896733/


All Articles