Why can't Git process large files and large repositories?

Question

Why can't Git process large files and large repositories?

Dozens of questions and answers on SO and elsewhere emphasize that Git cannot handle large files or large repositories. Several workarounds are suggested, such as git-fat and git-annex , but ideally Git will handle large files / repositions initially.

If this restriction has existed for many years, is there a reason that the restriction has not yet been removed? I assume that there is some kind of technical or design problem baked in Git that makes a large file and great repo support extremely difficult.

Lots of interconnected questions, but no one seems to explain why this is such a big hurdle:

+6

git

Mechanical species Apr 01 '15 at 14:27

source share

3 answers

It is not true that git "cannot handle" large files. It’s just that you probably don’t want to use git to manage the large binary repository, since the git repository contains the complete history of each file, and delta compression is much less efficient for most kinds of binary files than in text files. The result is a very large repo that takes a lot of time to clone, uses up a lot of disk space, and can be unacceptably slow for other operations due to the huge amount of data that it has to go through.

Alternatives and add-ons, such as git -annex, store large binary file changes separately, in a way that violates git's usual assumption that every previous state of the repository is available offline at any time, but avoids sending so much data.

+2

hobbs Apr 01 '15 at 15:05

source share

This is because each check contains every version of each file.

Now there are git ways to mitigate this problem, such as binary differences and sparse clones, but, of course, each client will have at least two copies (one in the work tree, one in the repository) of each file. Whether this will be a problem for you depends on your circumstances.

0

Andrew Aylett Apr 01 '15 at 14:34

source share

Paul draper · Accepted Answer · 2015-04-01T14:58:13+0000

Basically, it comes down to compromise.

One of your questions has an example of Linus himself:

[...] CVS, ie in fact, it is heavily focused on the “one file at a time” model.
It’s good that you can have a million files, and then you can check only some of them - you won’t even see the influence of other 999,995 files.
Git basically never looks for less than the whole repo. Even if you are limiting a bit (for example, checking only a part or returning to the story a bit), git ends up always taking care of all this, and carries the knowledge around.
So git scales very much if you force it to look at everything as one huge repository. I do not think that this part is really fixed, although we can probably improve it.
And yes, then a "large file" arises. I really don't know what to do with huge files. We suck them, I know.

Just as you will not find a data structure with access and O (1) insertion, you will not find a content tracker that does everything fantastically.

Git deliberately chose to be better at some things, to the detriment of others.

Disk usage

Since git is a DVCS version control system ( distributed ), everyone has a copy of the entire repo (unless you are using a relatively recent minor clone).

It has some really nice benefits, so DVCS, such as git, have become insanely popular.

However, the 4 TB repo on a central server with SVN or CVS is controllable, whereas if you use Git, everyone will not be thrilled to port it.

Git has excellent mechanisms to minimize the size of your repo by creating delta chains ("diffs") between files. git is not limited to paths or commits orders when they are created, and they really work very well ... sort of like gzipping the entire repo.

Git puts all these small differences in packfiles. Delta chains and packfiles force objects to be retrieved for a while, but it is very effective to minimize disk usage. (There are these tradeoffs again.)

This mechanism also does not work for binary files, since they tend to differ quite a bit, even after a "small" change.

History

When you register in a file, you have it forever and always. Your grandchildren grandchildren grandchildren will upload your cat gif every time they clone your repo.

This, of course, is not unique to Git, since DCVS makes the consequences more significant.

Although files can be deleted, a git content-based design (each object identifier is a SHA of its contents) makes deleting these files difficult, invasive and destructive to the story. In contrast, I can remove a twofold entity from an artifact or S3 bucket repository without affecting the rest of my content.

Complexity

Working with really large files requires a lot of careful work so that you minimize your operations and never load all this into memory. This is extremely difficult to do reliably when creating a program with such a complex set of functions as git.

Conclusion

Ultimately, developers who say "don't put large files in Git" are a bit like those who say "don't put large files in databases." They don’t like it, but any alternatives have disadvantages (Git integration in one case, ACID and FKs matching with another). In fact, it usually works fine, especially if you have enough memory.

It just doesn't work as well as with what it was intended for.

Why can't Git process large files and large repositories?

More articles: