What Java algorithms or libraries are available for the N-way recursive diff / merge directories?
I need to be able to generate a list of folder trees that have many identical files, and have subdirectories with many similar files. I want to be able to use two-way merge operations to quickly remove as much redundancy as possible.
Objectives:
- Find directory pairs that have many similar files between them.
- Create a short list of directory pairs that can be synchronized with two-way merge to eliminate duplicates
- Should work recursively (there may be nested duplicates of higher level directories)
- Runtime and storage should be O (n log n) in the numbers of directories and files.
- It should be able to use the built-in database or page on the disk to process more files than in memory (100,000 +).
- Optional: generate a pedigree and change the set between folders
- Optional: Sort merge operations by the number of duplicates that they can eliminate.
I know how to use hashes to find duplicate files in approximately O (n) space, but I don’t understand how to do this to find partially overlapping sets between folders and their children.
EDIT: some clarifications The hard part is the difference between the “exact” content (otherwise, hashing hashing will work) and “looks like” (which won't). Basically, I want to feed this algorithm into a set of directories and return to it a set of two-way merge operations that I can perform to minimize duplicates with a minimum of conflicts. It effectively creates an ancestor tree showing which folders are produced from each other.
The ultimate goal is to let me include a bunch of different folders in one common tree. For example, I may have a folder in which programming projects are stored, and then copy part of its contents to another computer to work on it. Then I can create a backup and an intermediate version on a USB flash drive. In addition, I can have 8 or 10 different versions, with slightly different organizational structures or folder names. I need to be able to combine them one step at a time, so I can choose how to incorporate changes at every step of the way.
In fact, this is more or less what I intend to do with my utility (combine a bunch of scattered backups from different points in time). I believe that if I can do it right, I can also publish it as a small open source source. I think the same tricks can be useful for comparing XML trees.
source share