The impact of a large number of branches in a git repository?

Question

The impact of a large number of branches in a git repository?

Does anyone know what effect repo has with git, which has many branches (2000+)? Does git pull or git fetch increase due to the fact that it has many branches? If there is a difference, indicate benchmarks.

+12

git

ajma Mar 04 '15 at 8:04

source share

4 answers

March 2015: I don't have benchmarks, but one way to ensure that git fetch remains reasonable, even if there is a large set of branches in the backflow repository, will be a specific less general refspec than the default one.

 fetch = +refs/heads/*:refs/remotes/origin/*

You can add as many remote refspecs to the remote device as you want, effectively replacing the comprehensive refspec above with more specific specifications to just include the branches you really need (even if there are thousands of them in the remote repo)

 fetch = +refs/heads/master:refs/remotes/origin/master fetch = +refs/heads/br*:refs/remotes/origin/br* fetch = +refs/heads/mybranch:refs/remotes/origin/mybranch ....

April 2018: git fetch will improve with Git 2.18 (Q2 2018).

See the commit 024aa46 (March 14, 2018) from Takuto Ikuta ( atetubou ) .
^{(Merged by Junio C Hamano - [TG45] - in commit 5d806b7 , 09 Apr 2018)}

fetch-pack.c: use oidset to check for a free object
When retrieving from a repository with a large number of links, because to check for the presence of each link in the local repository for packed and free objects, 'git fetch' ends up making a lot of lstat(2) free form non-existent, which makes it slow.
Instead of making as many lstat(2) calls as how many links on the far side are advertised to see if these objects exist in free form, first list all existing free objects in hashmap in advance and use them to check if there are more links, than the number of loose items.
With this patch, the number of lstat(2) calls in git fetch reduced from 411412 to 13794 for the chrome storage, it has more than 480000 deleted links.
I took the temporary git fetch statistics when the build for chrome storage happens 3 times on Linux with SSD.

 * with this patch 8.105s 8.309s 7.640s avg: 8.018s * master 12.287s 11.175s 12.227s avg: 11.896s

On my MacBook Air with a slower lstat (2).

 * with this patch 14.501s * master 1m16.027s

git fetch on a slow disk will be greatly improved.

Note that the hashmap used in packfile improves in Git 2.24 (Q4 2019)

See commit e2b5038 , commit 404ab78 , commit 23dee69 , commit c8e424c , commit 8a973d0 , commit 87571c3 , commit 939af16 , commit f23a465 , commit f0e63c4 , commit 6bcbdfb , commit 973d5ee , commit 26b455f , commit bb6ee5 , comb6bbee5 , comb6bbee5 commit d22245a , commit d0a48a0 , commit 12878c8 , commit e010a41 (October 6, 2019) from Eric Wong ( ele828 ) .
Suggested by: Philip Wood ( phillipwood ) .
^{(Merged by Junio C Hamano - [TG416] - in commit 5efabc7 , 15 Oct 2019)}

For instance:

packfile : use hashmap_entry in delta_base_cache_entry
^{Signed-off-by: Eric Wong}
^{Reviewed-by: Derrick Stolee}
This hashmap_entry_init function hashmap_entry_init intended to get the hashmap_entry structure hashmap_entry , not the hashmap structure pointer.
This was not noticed because hashmap_entry_init accepts the argument " void * " instead of " struct hashmap_entry * ", and the hashmap structure is larger and can be converted to the hashmap_entry structure without damaging the data.
This has a beneficial side effect - reducing the size of delta_base_cache_entry from 104 bytes to 72 bytes on 64-bit systems .

+5

Vonc Mar 04 '15 at 9:09

source share

Yes Yes. Locally, this is not a problem, although it still affects several local commands. In particular, when you try to describe a commit based on available links.

On the Git network, the initial ref declaration is executed when connecting to it for updates. This can be found in the pack protocol document. The problem here is that your network connection may be rough or hidden, and as a result, the initial announcement may take some time. The elimination of this requirement was discussed, but, as always, compatibility issues complicate the situation. The last discussion of this issue is here .

You will probably want to see a recent discussion of Git scaling . There are many ways you can want Git to scale, and he has discussed most of them so far. I think this gives you a good idea of what Git is good at and where it can use some work. I would set it out for you, but I do not think that I could do it fairly. There is a lot of useful information.

+4

John szakmeister Mar 04 '15 at 9:13

source share

To answer your question, you need to know how Git handles branches. What are branches?

A branch is just a reference to fixing a local repo, creating branches is very cheap. The .git directory contains directories containing the metadata that Git uses, when creating a branch, the link is created in the local branch and a history log is created. In other words, creating branches creates files and links; the system can easily process 2000 files.

I advise you to go through 3.1 Git Branching - branches in a nutshell , it contains information that can help you better know how branches are processed.

+3

Maroun Mar 04 '15 at 8:13

source share

Magnus bäck · Accepted Answer · 2015-03-04T09:20:19+0000

As others noted, branches and other links are just files in the file system (except that this is not entirely true due to packed links ) and are pretty cheap, but that does not mean that their number cannot affect performance. See Bad push with lots of refs links on the Git mailing list for a recent (December 2014) example of Git performance hit by the presence of 20k refs in the repository.

If I recall correctly, some of the ref processing was O (n²) a few years ago, but it has been very good ever since. There is a repo-discuss thread from March 2012 , which contains some potentially useful details, if possible, dated and specific to JGit.

Several versions of Scaling Gerrit discuss (among other things) potential problems with a high number of links, but it is also noted that several sites have gits with over 100k refs. We have Git with ~ 150k refs, and I don’t think we are seeing performance issues.

One aspect of having a lot of links is the size of the ref advertisement at the beginning of some Git transactions. The advertising size of the above 150k ref Git is about 10 MB, that is, each git fetch operation will load this amount of data.

So yes, do not ignore the problem completely, but you should not lose sleep for 2000 refs.

The impact of a large number of branches in a git repository?

fetch-pack.c: use oidset to check for a free object

`packfile` : use `hashmap_entry` in `delta_base_cache_entry`

More articles:

The impact of a large number of branches in a git repository?

fetch-pack.c: use oidset to check for a free object

packfile : use hashmap_entry in delta_base_cache_entry

More articles:

`packfile` : use `hashmap_entry` in `delta_base_cache_entry`