Effectively rewrite (rebase -i) a lot of history with git

Question

Effectively rewrite (rebase -i) a lot of history with git

I have a git repository containing about 3,500 commits and 30,000 different files in the latest version. It represents about three years of work from several people, and we got permission for all this with open source code. I try to release the whole story, not just the latest version. To do this, I'm interested in "returning in time" and inserting the license header at the beginning of the files when they are created. Actually, it works for me, but it takes about 3 days completely from ramdisk and still requires a little manual intervention. I know this can be a lot faster, but my git-fu doesn't quite fit the task.

Question: how can I accomplish the same thing much faster?

What am I doing now (automated in a script, but please carry me ...):

Identify all the commits into which the new file was added to the repository (there are only 500 of them, fwiw):
```
git whatchanged --diff-filter=A --format=oneline 
```
Define the GIT_EDITOR environment variable as my own script, which replaces pick with edit only once in the first line of the file (you will see why in the near future). This is the core of the operation:
```
 perl -pi -e 's/pick/edit/ if $. == 1' $1 
```
For each commit from git whatchanged above, invoke an interactive rebase starting just before the commit that added the file:
```
 git rebase -i decafbad001badc0da0000~1 
```

My custom GIT_EDITOR (this perl one-liner) changes pick to edit , and we drop it into the shell to make changes to the new file. Another simple header-inserter script looks for a known unique template in the header that I am trying to insert (only for known file types (*. [ChS] for me)). If he is not there, he inserts it and a git add file. This naive technique does not know which files were actually added during the current commit, but it ends up doing the right thing and is idempotent (it is safe to run several times against the same file), and this is not the case when all this process is a bottleneck.

At this point, we are happy that we updated the current commit and called:

  git commit --amend git rebase --continue

rebase --continue is the expensive part. Since we call a git rebase -i once for each revision on whatchanged output, this is a lot of reloads. Almost all the time this script traffic is running, the increment of the "Rebasing (2345/2733)" counter is monitored.

It is also not just slow. Periodically, conflicts arise that need to be resolved. This can happen, at least in such cases (but most likely): (1) when the "new" file is actually a copy of the existing file with some changes made to its very first lines (for example, #include ). This is a real conflict, but in most cases it can be automatically resolved (yes, there is a script that deals with this). (2) when deleting a file. This is trivially solvable by simply confirming that we want to remove it with git rm . (3) There are places where it seems that diff just misbehaving, for example, where the change is just adding an empty line. Other more legitimate conflicts require manual intervention, but overall they are not the biggest bottleneck. The biggest bottleneck is absolutely just sitting looking at "Rebasing (xxxx / yyyy)".

Currently, individual resets are initiated from newer commits to older commits, i.e. starting at the top of git whatchanged . This means that the first rebase affects yesterday, and that ultimately we will reload commits from 3 years ago. The transition from "newer" to "older" seems contradictory, but so far I'm not sure if this is important if we do not change more pick to edit when calling rebase. I am afraid to do this because conflicts do come, and I do not want to deal with a tidal wave of conflicting ripples from trying to throw everything in one go. Maybe someone knows how to avoid this? I could not come up with this.

I started to look at the internal workings of git objects 1 ! It seems like there should be a much more efficient way to go around the objects graph and just make the changes I want to make.

Please note that this repository is obtained from the SVN repository, where we did not actually use tags or branches (I have already canceled them by git filter-branch ), so we have the convenience of direct history. No git branches or merges.

I'm sure I missed some critical information, but this post already seems too long. I will do my best to provide additional information upon request. In the end, I may just need to publish my various scripts, which is an opportunity. My goal is to figure out how to rewrite history this way in the git repository; Do not discuss other viable licensing and code release methods.

Thanks!

Update 2012-06-17: Blog post with all the details of gory.

+6

git git-rebase perl

jonny0x5 Jun 06 '12 at 15:59

source share

2 answers

Klobs are addressable. You cannot modify a single file in isolation without changing its hash, which changes the blob block referenced by any commit that includes it, and therefore any commits that go off of it. Basically, you should rewrite the world as I understand the problem. I suppose I can imagine an algorithm that did all this in the reverse order of the DAG, with a large hash table of hashes with original modifications that rewrite each object only once.

If you already have the right solution (even one that takes three days), is it really worth trying to optimize it? I can’t imagine that this code is being debugged and working correctly enough to produce the results in less than three days, which the naive solution received.

-1

Andy ross Jun 06 '12 at 18:13

source share

KurzedMetal · Accepted Answer · 2012-06-06T18:21:01+0000

Using

 git filter-branch -f --tree-filter '[[ -f README ]] && echo "---FOOTER---" >> README' HEAD

Essentially it will add a footer line to the README , and the story will look like it exists since the file was created, I'm not sure if it will be effective enough for you, but this is the right way to do it.

Create a custom script and you will probably get a good project history, too much “magic” (rebase, perl, script editors, etc.) can end up losing or changing the history of projects in unexpected ways.

jon (OP) used this basic template to achieve the goal with significant simplification and acceleration.

 git filter-branch -d /dev/shm/git --tree-filter \ 'perl /path/to/find-add-license.pl' --prune-empty HEAD

A few critical observations.

Using the -d <directory> parameter, pointing to the ramdisk directory (for example, /dev/shm/foo ), will significantly improve speed.
Perform all changes from one script using the built-in language functions, forks that are executed using small utilities (for example, find ) will slow down the process many times. Avoid this:
```
 git filter-branch -d /dev/shm/git --tree-filter \ 'find . -name "*.[chS]" -exec perl /path/to/just-add-license.pl \{\} \;' \ --prune-empty HEAD 
```

This is the sanitized version of the perl script used by OP:

 #!/usr/bin/perl -w use File::Slurp; use File::Find; my @dirs = qw(aDir anotherDir nested/DIR); my $header = "Please put me at the top of each file."; foreach my $dir(@dirs) { if (-d $dir) { find(\&Wanted, $dir); } } sub Wanted { /\.c$|\.h$|\.S$/ or return; # *.[chS] my $file = $_; my $contents = read_file($file); $contents =~ s/\r\n?/\n/g; # convert DOS or old-Mac line endings to Unix unless($contents =~ /Please put me at the top of each file\./) { write_file( $file, {atomic => 1}, $header, $contents ); } }

Effectively rewrite (rebase -i) a lot of history with git

More articles: