How are version control stories stored and calculated?

Consider this simple python code that demonstrates a very simple version control option for a dictator:

def build_current(history): current = {} for action, key, value in history: assert action in ('set', 'del') if action == 'set': current[key] = value elif action == 'del': del current[key] return current history = [] history.append(('set', '1', 'one')) history.append(('set', '2', 'two')) history.append(('set', '3', 'three')) print build_current(history) history.append(('del', '2', None)) history.append(('set', '1', 'uno')) history.append(('set', '4', 'four')) print build_current(history) for action, key, value in history: if key == '2': print '(%s, %s, %s)' % (action, key, value) 

Please note that using the history list you can restore the current dictionary to any state that it once existed. I consider this a “forward assembly” (due to the lack of a better term), because to create the current dictionary you need to start from the very beginning and process the entire history list. I consider this the most obvious and direct way.

As I’ve already heard, early version control systems used this “direct build” process, but they weren’t optimal, as most users care more about the latest build versions. In addition, users do not want to download the entire story when they only care about seeing the latest build.

Then my question is, what other approaches exist for storing stories in a version control system? Perhaps you can use "reassembly"? This may allow users to download only the latest versions without requiring the whole story. I also saw several different formats for storing history, namely: changes, snapshots and patches. What are the differences between change sets, snapshots, and patches?

Of today's popular version control tools available, how do they store their stories and what are the benefits of their various designs?

+11
git python version-control svn mercurial
Jan 11 '12 at 18:26
source share
3 answers

You mentioned these 3 methods of storing (file) history:

  • patch : patch is (usually textual, but binary patches also possible) representing the difference between two files. This is the output of the unix diff command and can be applied using the unix patch. Many version control systems use patches to store file history (for example, SVN, CVS, GIT ..). Sometimes these patches are technically called "delta" as the Greek letter "Δ" describing the difference between two things.
  • changeet : a set of changes is a term that combines changes that are “related to each other” to different files in the same object. Not all version control systems support changes (the most notable CVS and SourceSafe). The developer uses a set of changes to avoid broken assemblies (for example: change the method signature in one file, change the call in the second file. You must have both changes to start the program, otherwise you will receive an error message). See also the difference between a set of changes and a patch .
  • snapshots : these are full copies of the state of this file / file system up to this point. They are usually quite large, and their use depends on the performance characteristics. A snapshot is always redundant for the list of patches, however, for faster loading of information, sometimes version control systems mix or combine patches and snapshots.

Subversion uses aka patches in FSFS repositories and backward deltas in BDB repositories. Please note that these implementations have different characteristics:

  • advanced deltas are quickly fixed, but slow during checks (as the "current" version needs to be restored)

  • Deltas back are quickly checked, but slow to fix like new Deltas must be built to build a new current and rewrite the previous "current" as a bunch of deltas

Also note that FSFS uses the "skipping delta" algorithm, which minimizes transitions to restore a specific version. However, this throughput is not size-optimized like mercury snapshots; it simply minimizes the number of "revisions" needed to create the full version regardless of the overall size.

Here's a little ascii art (copied from spec) of a 9-version file:

 0 <- 1 2 <- 3 4 <- 5 6 <- 7 0 <------ 2 4 <------ 6 0 <---------------- 4 0 <------------------------------------ 8 <- 9 

where "0 <- 1" means that the delta base for revision 1 is version 0.

The number of transitions does not exceed log (N) for N revisions.

Also a very good effect for FSFS is that the older version will be written only once , after which they will be read only by further steps. Therefore, subversion repositories are very stable: as long as you do not have an HW crash on your hard drive, you should be able to get a working repository, even if some corruption occurred in the last commit: you still have old versions.

In BDB Backend, you constantly rewrite the current version of checks / commits, which makes this process prone to data corruption. Also, since you only save the full text in the current version, corrupting commit data is likely to destroy large parts of your story.

+10
Jan 11 2018-12-12T00:
source share

I think subversion made a few attempts back. But I can explain what I know better: Mercury shots.

Mercurial uses a forward assembly scheme. But in order for each revision to be easily reconfigurable, there are resynchronization points: every time when the size of all the deltas needed to rebuild the revision is more than two times the text, the full text version (compressed image) and all subsequent deltas are saved computed relative to this new snapshot.

This means that you never need to read more than 3 times the size of the text to get any version.

You can find more information in the Hg Book .

+8
Jan 11 2018-12-12T00:
source share

As a more general answer, you need to distinguish between CVCS (centralized VCS like CVS, SVN, Perforce, ClearCase, ...) from DVCS ( Distributed VCS like Git or Mercurial ).
These include various workflows and uses .

In particular, data exchange between the CVCS client and its server will be more important than with DVCS (which really needs a delta when clicking or pulling the entire repo)

This is why delta is very important for most operations in CVCS, which is important only for certain operations and for various reasons in DVCS.

The deltas are described in the book by Eric Snick in two books:

Repository = File System * Time

A tree is a hierarchy of folders and files. Delta is the difference between two trees. Theoretically, these two trees do not have to be connected. However, in practice, the only reason we calculate the difference between the two is because one of them is derived from the other. Some developers started with the N tree and made one or more changes, resulting in the N + 1 tree.

We can think of the delta as a set of changes. In fact, many SCM tools use the term change set for this purpose. A change set is simply a list of changes that express the difference between two trees.

The delta value is important (see this thread ): delta of a triangle or inverse delta.

Some SCM tools use some kind of compromise design. In one approach, instead of storing just one full tree and representing any other tree as a delta, we will cover several more full trees along the way.

You can see the evolution for the “old” VCS in Eric Raymond Understanding Version Control Systems .

Many modern version control tools use binary delta files to store storage.
One popular delta file algorithm is called vcdiff .
It lists the byte ranges that have been changed. This means that it can process any files, binary files or text. As an added benefit, the vcdiff algorithm compresses data at the same time.

Keep in mind that triangle control also affects Directed Acyclic Graphs (DAGs) created to represent the story (see the direction of the arrows in the ProGit book and is inconvenient behind the DAG ).

You can find the specification for managing a triangle for:

Veracity supports two types of DAG:

  • The DAG tree stores the history of the directory structure versions from the file system. Each node from the DAG represents one version of the entire tree.

  • The database (or "db") of the DAG stores the version history of the database or a list of records. Each node DAG represents one state of a complete database.

These last points illustrate that the third (fourth?) Generation of VCS should deal with the distribution of not only files (tree), but also databases (for various purposes)

+4
Jul 15 '12 at 9:18
source share



All Articles