Git is really slow for 100,000 objects. Any corrections?

Question

Git is really slow for 100,000 objects. Any corrections?

I have a "fresh" git-svn repo (11.13 GB) that contains over 100,000 objects.

I prepared

git fsck git gc

on the repo after the initial check.

Then i tried to do

 git status

The time required for git status is anywhere: from 2m25.578s and 2m53.901s

I tested git status by issuing a command

 time git status

5 times, and all the times ran between the above twice.

I am doing this on Mac OS X, locally not through a virtual machine.

You can’t do this for a long time.

Any ideas? Help?

Thank.

Edit

I have an interlocutor sitting next to me with a comparable box. Less RAM and Debian with the jfs file system. Its git status works in .3 on the same repo (this is also git-svn checkout).

In addition, I recently changed my file permissions (up to 777) in this folder, and this significantly reduced the time (why, I have no idea). Now I can do it somewhere between 3 and 6 seconds. It is manageable, but still a pain.

+46

performance git git-svn

manumoomoo Jul 22 '10 at 22:10

source share

12 answers

The git status should look at every file in the repository every time. You can tell him not to look at trees that you are not working with.

 git update-index --assume-unchanged <trees to skip>

source

From the man page:

When these flags are specified, the object names recorded for paths are not updated. Instead, these parameters are set and disabled without changes "for paths. The" accept without changes "bit is turned on, git ceases to check the working tree files for possible changes, so you need to manually disable the bit to tell git when changing the working tree file. Sometimes it is useful when working with a large project on a file system with a very slow lstat (2) system call (e.g. cifs).
This option can also be used as a rough file level to ignore uncommitted changes in tracked files (akin to what .gitignore does for non-tracked files). git will fail (gracefully) if it changes this file in the index, for example. when merging in fixation; Thus, in the case if it is assumed that the file without preliminary investigation has been modified upstream, you will need to edit the situation manually.
Many of the operations in git depend on your lstat (2) file system for efficient operation, so st_mtime information for the working tree can be cheaply checked to find out if the contents of the file have changed from the version recorded in the file index. Unfortunately, some file systems have inefficient lstat (2). If your file system is one of them, you can set the "accept unchanged" bit on the path to you have not changed, to call git do not do this check. Please note that setting this bit parameter on the path does not mean that git will check the contents of the file to see if it has changed - it does git skip any check and assume that it has not changed. When you make changes to the working files of the tree, you must explicitly tell git about this by dropping the “assume no changes” bit before or after changing them.
...
To set the "assume unchanged" bit, use the -assume-unchanged option. to unset, use --no-assume-unchanged.
The team looks at the core.ignorestat configuration variable. When true, paths updated with git are index-update paths ... and updated paths with other git commands that update both the index and working tree (e.g. gitapply -index, git checkout-index -u, and git read -tree -u) are automatically designated as "assume unchanged". Note that the “suppose unchanged” bit is not set if gitupdate-index --refresh finds the working tree file matches the index (use git update-index --really-refresh if you want to mark them as “presumably unchanged”).

Now, obviously, this solution will only work if there are parts of the repo that you can easily ignore. I am working on a project of a similar size, and there are certain trees that I do not need to check on a regular basis. The semantics of git-status makes it generally an O (n) problem (n in the number of files). You need to optimize for the domain to do better.

Please note that if you work in a stitching pattern, that is, if you integrate changes from the upstream by merging instead of rebase, then this solution becomes less convenient, because changing the object of the unchanged unchanged object from the upstream becomes a merge conflict. You can avoid this problem when working with a reboot.

+15

masonk Jul 25 '10 at 23:45

source share

One long-term solution is the git extension for caching the state of the file system.

Karsten Blees did this for msysgit, which dramatically improves performance on Windows. In my experiments, changing it took the "git status" time from 25 seconds to 1-2 seconds on my Win7 machine running on a virtual machine.

Karsten changes: https://github.com/msysgit/git/pull/94

Discussion of the caching approach: https://groups.google.com/forum/#!topic/msysgit/fL_jykUmUNE/discussion

+5

Chris Kline Oct 17 '13 at 15:29

source share

All in all, my mac is ok with git, but if there are a lot of free objects, it becomes much slower. It seems that hfs is not very good with a lot of files in one directory.

 git repack -ad

Further

 git gc --prune=now

Creates a file with one batch and deletes the remaining remaining objects. This may take some time.

+4

slobobaby Mar 06 '14 at 20:13

source share

You can try passing the --aggressive switch to git gc and see if this helps:

 # this will take a while ... git gc --aggressive

Alternatively, you can use git filter-branch to delete old commits and / or files if you have things that you don't need in your history (like old binaries).

+3

David Underhill Jul 22 2018-10-22T00:

source share

git status should be faster in Git 2.13 (Q2 2017), due to:

optimization around string optimization array (see " ways to improve git status performance )
better read cache management.

At this last point, see commit a33fc72 (April 14, 2017) Jeff Hostetler ( jeffhostetler ) .
^{(merger of Junio C Hamano - gitster - to commit to cdfe138 , April 24, 2017)}

read-cache : force_verify_index_checksum
Teach Git to skip checking the SHA1-1 checksum at the end of the index file in verify_hdr() , which is called from read_index() if the global variable " force_verify_index_checksum " is not set.
Learn fsck to force this check.
Checksum checks are designed to detect disk damage, and for small projects, the time taken to calculate SHA-1 is not that important, but for giant repositories this calculation adds significant time to each team.

Git 2.14 again improves Git's performance by taking into account the “ non-screen cache ”, which allows Git to skip reading unprepared directories if their stat data has not changed using the mtime field of the mtime structure.

See Documentation/technical/index-format.txt .

See commit edf3b90 (May 08, 2017) by David Turner ( dturner-tw ) .
^{(the merger of Junior With Hamano - gitster - to commit fa0624f , May 30, 2017)}

When git checkout , git merge , etc. manipulates the internal index, various pieces of information in the index extension are discarded from the initial state, since this is usually not the case, is constantly updated and synchronized with the operation on the main index.
The cache extension without a trace is now copied in these operations, which will speed up the "git status" (while the cache is not valid).

More generally, writing to the cache will also be faster with Git 2.14.x / 2.15

See commit ce012de , commit b50386c , commit 3921a0b (August 21, 2017) by Kevin Wilford (``) .
^{(Merger of Junio With Hamano - gitster - at commit 030faf2 , August 27, 2017}

Previously, we spent more than necessary on allocating and freeing cycles of a piece of memory when writing each index entry.
It has been optimized.
[That] will save somewhere between 3-7% when the index had more than a million records without sacrificing performance on small repositories.

December 2017 Update: Git 2.16 (Q1 2018) will offer an additional improvement, this time for git log , since the code to iterate over the lost object files has just received optimization.

See commit 163ee5e (December 04, 2017) by Derrick Stole ( derrickstolee ) .
^{(merger of Junio C Hamano - gitster - on commit 97e1f85 , December 13, 2017}

sha1_file : use strbuf_add() instead of strbuf_addf()
Replace using strbuf_addf() with strbuf_add() when listing free objects in for_each_file_in_obj_subdir() . Since we already check the length and hexadecimal values of the string before using the path, we can prevent additional level calculations.
One consumer for_each_file_in_obj_subdir() is an abbreviation for the code. OID ( object identifiers ) abbreviations use a cached list of free objects (for a subdirectory of objects) to quickly execute repeated requests, but there is significant cache loading time when there are many free objects.
Most repositories do not have many unnecessary objects before repackaging, but in the case of GVFS (see " GVFS (Git Virtual File System) declaration ), repositories can grow to have millions of free objects.
Profiling 'git log' in Git For Windows in repo mode with GVFS support with ~ 2.5 million lost objects, 12% of the time was detected; CPU spent on strbuf_addf() .
Add a new performance test on p4211-line-log.sh , which is more sensitive to cache loading.
By limiting 1000 commits, we more accurately resemble the user waiting time when reading a history in a pager.
For a copy of the Linux repo with two packages ~ 512 MB in size and ~ 572K free objects, running git log -oneline -parents -raw -1000 'had the following performance:

  HEAD~1 HEAD ---------------------------------------- 7.70(7.15+0.54) 7.44(7.09+0.29) -3.4%

+3

VonC Apr 27 '17 at 21:12

source share

For what it's worth, I recently discovered a big mismatch between the git status command between my leading and dev branches.

To shorten the long history, I tracked the problem down to one 280 MB file in the project root directory. This was a random check of a database dump, so it would be nice to delete it.

Here before and after:

 ⚡ time git status # On branch master nothing to commit (working directory clean) git status 1.35s user 0.25s system 98% cpu 1.615 total ⚡ rm savedev.sql ⚡ time git status # On branch master # Changes not staged for commit: # (use "git add/rm <file>..." to update what will be committed) # (use "git checkout -- <file>..." to discard changes in working directory) # # deleted: savedev.sql # no changes added to commit (use "git add" and/or "git commit -a") git status 0.07s user 0.08s system 98% cpu 0.157 total

I have 105,000 objects in the store, but it seems that large files are more a threat than many small files.

+2

Brendon McLean Oct 02 '11 at 16:40

source share

You can also try git repack

+1

baudtack Jul 22 2018-10-22T00:

source share

Maybe you are using an antivirus? I tested several large projects here on Windows and Linux - it was damn fast!

I don't think you need to do git gc in a cloned repo (it should be clean).

Is your hard drive OK? IOPS and R / W per second? Maybe it is damaged?

0

Andreas Rehm Jul 22 2018-10-22T00:

source share

the spotlight may be trying to index files. Maybe turn off the spotlight for your code. Check the Activity Monitor and see which processes are running.

0

neoneye Jul 22 2018-10-22T00:

source share

I would create a partition using a different file system. HFT + has always been sluggish for me compared to similar operations on other file systems.

0

srparish Jul 24 2018-10-18T00:

source share

Try to run the Prune command, it will get rid of unnecessary objects

git remote source draft

0

Devnegikec Mar 10 '16 at 11:49

source share

manumoomoo · Accepted Answer · 2010-07-26 23:03

It came down to a few points that I can see right now.

git gc --aggressive
Opening file permissions 777

There must have been something else, but that was what clearly had the greatest impact.

Git is really slow for 100,000 objects. Any corrections?

`read-cache` : `force_verify_index_checksum`

More articles:

Git is really slow for 100,000 objects. Any corrections?

read-cache : force_verify_index_checksum

More articles:

`read-cache` : `force_verify_index_checksum`