N-way recursive merge / decomposition algorithm for directory tree?

What Java algorithms or libraries are available for the N-way recursive diff / merge directories?

I need to be able to generate a list of folder trees that have many identical files, and have subdirectories with many similar files. I want to be able to use two-way merge operations to quickly remove as much redundancy as possible.

Objectives:

  • Find directory pairs that have many similar files between them.
  • Create a short list of directory pairs that can be synchronized with two-way merge to eliminate duplicates
  • Should work recursively (there may be nested duplicates of higher level directories)
  • Runtime and storage should be O (n log n) in the numbers of directories and files.
  • It should be able to use the built-in database or page on the disk to process more files than in memory (100,000 +).
  • Optional: generate a pedigree and change the set between folders
  • Optional: Sort merge operations by the number of duplicates that they can eliminate.

I know how to use hashes to find duplicate files in approximately O (n) space, but I don’t understand how to do this to find partially overlapping sets between folders and their children.

EDIT: some clarifications The hard part is the difference between the “exact” content (otherwise, hashing hashing will work) and “looks like” (which won't). Basically, I want to feed this algorithm into a set of directories and return to it a set of two-way merge operations that I can perform to minimize duplicates with a minimum of conflicts. It effectively creates an ancestor tree showing which folders are produced from each other.

The ultimate goal is to let me include a bunch of different folders in one common tree. For example, I may have a folder in which programming projects are stored, and then copy part of its contents to another computer to work on it. Then I can create a backup and an intermediate version on a USB flash drive. In addition, I can have 8 or 10 different versions, with slightly different organizational structures or folder names. I need to be able to combine them one step at a time, so I can choose how to incorporate changes at every step of the way.

In fact, this is more or less what I intend to do with my utility (combine a bunch of scattered backups from different points in time). I believe that if I can do it right, I can also publish it as a small open source source. I think the same tricks can be useful for comparing XML trees.

+4
source share
1 answer

It seems advisable to just work with file names and sizes (and timestamps if you find that they are reliable), so as not to read all these files, not to hash and not distinguish between them.

This is what comes to mind.

  • Download all data from the file system. It will be big, but it will fit in memory.

  • Make a list of candidate directory pairs with similarity ratings. For each directory name that appears in both trees, score 1 point for all directory pairs that share this name. For each file name that appears in both trees (but not so often that it makes no sense), score 1 point for all pairs of directories containing a file with that name. Score bonus points if both files are identical. Score bonus points if the file name is not displayed anywhere else. Each time you give points, you also point some points to all pairs of ancestors, so if a / x / y / foo.txt is similar to b / z / y / foo.txt, then the pairs (a/x/y, b/z/y) and (a/x, b/z) and (a, b) all get points.

  • If desired, discard all pairs with values ​​too low to worry, and critically examine the other pairs. So far, we have looked at similar directories. Look again and fine pair directories that show signs of a lack of a common pedigree. (The general way to do this is to calculate the maximum score that two directories could have if they both had all the files and they were all the same, and reject the pair if only a small fraction of this possible score was reached. But it’s better to do that something cheap and heuristic, or skip this step completely.)

  • Choose the best pair of candidate directories. Bring him out. Eliminate these directories and all their subdirectories from approval. Repeat.

Choosing the right data structures is left as an exercise.

This algorithm does not try to find similar files with different file names. You can do this through large files using something like the rsync algorithm, but I'm not sure if you need it.

This algorithm makes no serious attempt to determine whether the two files are really alike. He simply scores 1 point for the same file name and bonus points for the same size and time stamp. Of course, you could distinguish them in order to assign a more accurate score. I doubt it is worth it.

+2
source

Source: https://habr.com/ru/post/1300084/


All Articles