Git diff of identical files in two directories always leads to "renaming"

git diff --no-index --no-prefix --summary -U4000 directory1 directory2

This works as expected in that it returns a diff of all files between two directories. Files that are added as expected, files that are deleted also lead to the expected differential output.

However, since diff takes the file path as part of the file name, files with the same name in two different directories will output diff with the changed name instead of the changed one.

  • Is there a way to tell git not to consider the full path to the file in diff and look only at the file name, as if the files came from the same directory?

  • Is there a way for git to find out if a copy of the same file in another directory was really renamed? I don’t see how if it has no way to compare md5s files somehow or something (maybe a bad assumption lol).

  • Will using branches instead of directories easily solve this problem, and if so, what version of the branch of the above command?

0
source share
1 answer

Here are a few questions whose answers are intertwined. Start by defining rename and copy, then go to branches.

Rename Detection

However, since diff takes the file path as part of the file name, files with the same name in two different directories will output diff with the changed name instead of the changed one.

This is not entirely correct. (The text below is for solving both of your questions 1 and 2.)

Although you use --no-index (presumably to make Git work with directories outside the repository), Git diff code behaves the same in all cases. To compare two files in two trees, Git must first determine the file identifier. That is, there are two sets of files: those that are in the "left side" or in the source tree (name of the first directory), as well as in the "right side" or in the destination tree (second name of the directory). Some files on the left are the same file as some files on the right. Some files on the left are different files that do not have the corresponding right file, i.e. They have been deleted. Finally, some files on the right are new, i.e. They were created.

Files that are “the same file” do not have to have the same path name. In this case, these files have been renamed.

Here's how it works in detail. Note that the "full path name" changes slightly when using git diff --no-index dir1 dir2 : the "full path name" is what remains after removing the dir1 and dir2 prefixes.

When comparing a tree on the left and on the right, files that have the same full path name are usually considered "the same file." We put all of these files in the “files to be distinguished” queue and no one will be renamed as renamed. Pay attention to the word "usually" here - we will return to this in a moment.

This leaves us with the two remaining file lists:

  • that exist on the left, but not on the right: a source without a destination
  • that exist on the right but not the left: destination without source

Naïvely, we can simply declare that all these source files have been deleted and all these destination files have been created. You can tell git diff to behave as follows: set the --no-renames flag to disable rename detection.

Or, Git may continue to use a more reasonable algorithm: set the --find-renames and / or -M <threshold> flag to do this. In Git versions 2.9 and later, the rename feature is enabled by default.

Now, how does Git decide that the source file has the same identifier as the target file? They have different paths; which right file has a/b/c.txt on the left, matches? It can be d/e/f.bin , or d/e/f.txt , or a/b/renamed.txt , etc. The actual algorithm is relatively simple and in the past did not introduce the final component of the name (I'm not sure what it is doing now, Git is constantly evolving):

  • If there are source and target files whose contents match exactly, merge them. Since the content of Git hashes, this comparison is very fast. We can compare the left side of a/b/c.txt with our hash identifier with each file on the right, just by looking at all our hash identifiers. Therefore, first we run all the source files, find the destination files that match, placing new pairs in the diff queue and pulling them from two lists.

  • For all other source and target files, run an efficient but inappropriate git diff output algorithm for calculating file similarities. The source file, which is at least <threshold> similar to some kind of destination file, causes pairing, and this pair of files is deleted. The default threshold is 50%: if you enable rename detection without selecting a specific threshold, two files that are still in the lists to this point and look like 50% receive a pair connection.

  • All remaining files are deleted or created.

Now that we have found all the pairs, git diff goes to different pair files with the same identification and tells us that the deleted files are deleted and newly created files are created. If the two path names for files with the same identifier are different, git diff says the file has been renamed.

The arbitrary file code is expensive (even if the same-name-give-a-pair code is very cheap), so Git has a limit on the number of names in these sources and pairing address lists. This limit is configured through git config diff.renameLimit . By default, the number of users has grown by several years and now amounts to several thousand files. You can set it to 0 (zero) so that Git can use its own internal maximum at any time.

Breaking couples

Above, I said that files with the same name are usually paired automatically. This is usually correct, therefore it is Git default. In some cases, however, the left file with the name a/b/c.txt is not actually associated with the right file with the name a/b/c.txt , it really refers to the right side of a/doc/c.txt , for example. We can tell Git to break pairs of "too different" files.

We saw the "similarity index" used above to form pairs of files. The same affinity index can be used to split files: -B20%/60% , for example. These two numbers should not contain up to 100%, and you can actually omit either one or both: for each of them the default value is set if you set the -B mode.

The first number is the point at which the file with the default file already installed can be placed on rename lists. If -B20% , if the files are 20% dissonant (i.e. only 80%), the file goes to the "source for renaming" list. If it is never accepted as a rename, it can reconnect with its automatic assignment, but at that moment the second number that comes after the slash comes into effect.

The second number sets the point at which pairing is definitely broken. For example, at -B/70% , if the files have 70% dissoles (i.e. only 30%), pairing is broken. (Of course, if the file was deleted as the renaming source, pairing is already broken.)

Copy detection

Besides the usual pairing and renaming, you can ask Git to find copies of the source files. After running all the usual pairing code, including looking for renames and breaking pairs, if you specify -C , Git will look for “new” (ie Unpaired) destination files that are actually copied from existing sources. There are two modes for this, depending on whether you specify -C twice or add --find-copies-harder : only the source files that have been changed are considered (this is the only case -C ), and the one that considers each source file (two cases -C or --find-copies-harder ). Please note that this “was modified by the source file” means that in this case the source file is already in the pairing queue, if not, it is not “modified” by definition, and the corresponding destination file has a different hash identifier (again, this is very inexpensive a test that helps keep one option -C cheap).

Branches don't matter

Will using branches instead of directories easily solve this problem, and if so, what version of the branch of the above command?

The branches here do not matter.

In Git, the term branch is ambiguous. See What exactly do we mean by "branch"? For git diff , however, the branch name simply solves one commit, namely fixing the tooltip of that branch.

I like to draw Git branches like this:

 ...--o--o--o <-- branch1 \ o--o--o <-- branch2 

Small round o each represents a fixation. The names of the two branches are just pointers in Git: they point to one specific commit. The name branch1 indicates the right-most commit in the top row, and the name branch2 indicates the right-most commit in the bottom row.

Each commit in Git points to a parent or parents (most commits have only one parent, and merging is just a commit with two or more parents). This is what forms the chain of fixations, which we also call the “branch”. The name of the branch points directly to the end of the chain. 1

At startup:

 $ git diff branch1 branch2 

all that git does is resolve each name to a corresponding commit. For example, if branch1 names are commit 1234567... and branch2 names are commit 89abcde... , this just does the same thing:

 $ git diff 1234567 89abcde 

Git diff takes two trees

Git doesn’t even care that they are really committed. Git just requires a tree on the left or a source tree, as well as a tree with a right or end point. These two trees can come from a commit because the commit calls the tree: the tree of any commit is the original snapshot taken when you made this commit. They can come from a branch because the name of the branch names the commit, which names the tree. One of the trees can be obtained from Git "index" (aka "intermediate region" aka "cache), since the index is basically a flattened tree. 2 One of the trees may be your work -tree. One or both trees may even be outside the element Git controls (hence the --no-index flag).

Of course, Git can just distinguish between two files.

If you run git diff --no-index /path/to/file1 /path/to/file2 , Git will simply distinguish between two files, i.e. treat them as a couple. This completely eliminates all join and rename code. If no amount of messing around with --no-renames , --find-renames , --rename-threshold , etc., the parameters do the trick, you can explicitly use the file paths, not the directory (tree) path. For a large set of files, this, of course, will be painful.


1 There may be more borrowings beyond this point, but it is still the end of its chain. Moreover, multiple names may point to a single commit. I paint this situation as:

 ...--o--o <-- tip1 \ o--o <-- tip2, tip3 

Note that commits that are “behind” more than one branch name essentially refer to all of these branches. Thus, both bottom line fixations are on both branches tip2 and tip3 , while both upper line commits are on all three branches. However, each branch name is resolved by one and only one. Commit.

2 In fact, to create a new commit, Git simply converts the index, just like now, into a tree using git write-tree , and then commits that name to this tree (and it uses the current commit as its parent, has an author and committer and commit message). The fact that Git uses an existing index is why you should git add update the work tree files in the index before committing.

There are some short cuts that let you tell git commit to add files to the index, for example, git commit -a or git commit <path> . They can be a little complicated, as they do not always give the index that you might expect. For example, --include vs --only for git commit <path> . They also work by copying the main index into a new, temporary index; and this may have unexpected results, because if the commit succeeds, the temporary index is copied back to the regular index.

+3
source

Source: https://habr.com/ru/post/1259949/


All Articles