What is the mathematical structure representing the Git repository

I will learn about Git, and it would be great if I describe the mathematical structure that represents the Git repository. For example: it is a directed acyclic graph; its nodes are fixations; its nodes have labels (no more than one label per node, no labels are used twice), which represent branches, etc. (I know this description is incorrect, I'm just trying to explain what I'm looking for.)

+6
source share
2 answers

In addition to the links in the Nevik Rehnel comment (copied here for the request: eagain.net/articles/git-for-computer-scientists and gitolite.com/gcs ) and indicate that the fix graph forms the Merkle Tree , I will add a few notes.

  • There are four types of objects in object storage: commit, tree, annotated-tag and blob (file).
  • The commit object contains exactly one tree-ref (which, of course, can point to more trees), possibly an empty list of parent SHA-1 hashes (which should be more than commits), author (name, and timestamp), committer ( same form as the author), and commit text.
  • The tree object contains a list (mode, sub-object, file_name) repeating 0 or more times. If the sub-object is another tree, the file name is a directory. If it is a blob, it represents the file. The mode looks like a POSIX file mode, and if it is 120000 (file mode for a symbolic link), then the "contents" of the file is really the purpose of the symbolic link. Some mode value (ab) is used for submodules, but I forget that. R and W mode bits are not saved, only X bits (and even then they are ignored if they are ignored in the repo configuration).
  • An annotated tag object contains a link to an object, a tagger (name, email address and timestamp) and tag text. The object referenced is usually a commit, but the tag object can point to any object (even another tag object).
  • Shortcuts (branches and tags and reflog links, etc.) are located outside the object repository. For annotated tags, there is a label on the outside that points to the annotated tag object inside the object store. For a light tag, an external label indicates a commit.
  • There is no limit to the fact that there will be only one root commit. Any fixation without parents is the root.
  • Git almost never creates an empty tree (which will be an empty directory), except for two cases: there is always an empty tree in each repo, and if you make an initial empty commit (with git commit --allow-empty ) it uses this empty wood. (Since an empty tree has no sub-objects, its SHA-1 value is a constant .)
  • The description "DAG" is usually intended for trees formed by closing parental rights hashes. However, the tree object should not contain itself at all in any of its subtrees, and if you manage to create a cyclic tree structure, you would not be able to check it (because it is infinitely recursive). Assuming you cannot create two different trees with the same checksum (if you could break git), you won't find a tree T1 that contains a tree T2 that contains another tree whose checksum is T1. Thus, trees are also implicitly DAGs, and when attached to commit-DAGs, they form a large DAG. :-)
  • Objects without references in the object store will receive garbage collected by git gc . An empty tree seems to be immune to harvest. Everything that is specified in the refs/ and logs/ directories, and the packed-refs file (in .git , or for bare repositories or when $GIT_DIR is specified, anywhere) acts as a link, as well as special names ( HEAD , ORIG_HEAD etc.); I'm not sure if other random files created in .git and containing valid SHA-1 will act as links or not.
  • The index has some format that I never broke into. It contains references to objects in the object store. When you git add file, git drops the file in the object store and places the (non-text) SHA-1 hash in the index file. These are valid links that prevent garbage collection.
+8
source

I think the most relevant answer should include the most important characteristic of Git revision trees: a cryptographic signature (each revision includes a hash of the parent revision and commit details).

This is known as the Merkle Tree: http://en.wikipedia.org/wiki/Merkle_tree


See an earlier answer for some background: ( Git: how to handle commit so that the file versions exist as a whole (and not just as differences) )

Background

Delta storage has been popularized by RCS, CVS, Subversion, and others (SourceSafe?). Mostly because the model has simplified the transfer of sets of changes because they would already be in the form of a delta. Modern VCS-es (mostly distributed) have evolved from this and put emphasis on data integrity .

Data integrity

Due to the design of the object database, Git is very reliable and will detect any corrupted data bit anywhere in the snapshot or the entire repo. See this post for more information on the cryptographic properties of Git repositories: Linux - Git vs. data corruption?

In techno-chatter: capturing stories form cryptographically strong measurement trees. When the sum sha1 of the commit (HEAD) vertex matches, it mathematically follows that

  • tree contents
  • branch history (including all accounts and committer / author credentials)

identical. This is a huge security feature of Git (and other SCMs that use this design feature)

+5
source

Source: https://habr.com/ru/post/953033/


All Articles