I have a git repository containing about 3,500 commits and 30,000 different files in the latest version. It represents about three years of work from several people, and we got permission for all this with open source code. I try to release the whole story, not just the latest version. To do this, I'm interested in "returning in time" and inserting the license header at the beginning of the files when they are created. Actually, it works for me, but it takes about 3 days completely from ramdisk and still requires a little manual intervention. I know this can be a lot faster, but my git-fu doesn't quite fit the task.
Question: how can I accomplish the same thing much faster?
What am I doing now (automated in a script, but please carry me ...):
Identify all the commits into which the new file was added to the repository (there are only 500 of them, fwiw):
git whatchanged
Define the GIT_EDITOR environment variable as my own script, which replaces pick
with edit
only once in the first line of the file (you will see why in the near future). This is the core of the operation:
perl -pi -e 's/pick/edit/ if $. == 1' $1
For each commit from git whatchanged
above, invoke an interactive rebase starting just before the commit that added the file:
git rebase -i decafbad001badc0da0000~1
My custom GIT_EDITOR (this perl one-liner) changes pick
to edit
, and we drop it into the shell to make changes to the new file. Another simple header-inserter
script looks for a known unique template in the header that I am trying to insert (only for known file types (*. [ChS] for me)). If he is not there, he inserts it and a git add
file. This naive technique does not know which files were actually added during the current commit, but it ends up doing the right thing and is idempotent (it is safe to run several times against the same file), and this is not the case when all this process is a bottleneck.
At this point, we are happy that we updated the current commit and called:
git commit
rebase --continue
is the expensive part. Since we call a git rebase -i
once for each revision on whatchanged
output, this is a lot of reloads. Almost all the time this script traffic is running, the increment of the "Rebasing (2345/2733)" counter is monitored.
It is also not just slow. Periodically, conflicts arise that need to be resolved. This can happen, at least in such cases (but most likely): (1) when the "new" file is actually a copy of the existing file with some changes made to its very first lines (for example, #include
). This is a real conflict, but in most cases it can be automatically resolved (yes, there is a script that deals with this). (2) when deleting a file. This is trivially solvable by simply confirming that we want to remove it with git rm
. (3) There are places where it seems that diff
just misbehaving, for example, where the change is just adding an empty line. Other more legitimate conflicts require manual intervention, but overall they are not the biggest bottleneck. The biggest bottleneck is absolutely just sitting looking at "Rebasing (xxxx / yyyy)".
Currently, individual resets are initiated from newer commits to older commits, i.e. starting at the top of git whatchanged
. This means that the first rebase affects yesterday, and that ultimately we will reload commits from 3 years ago. The transition from "newer" to "older" seems contradictory, but so far I'm not sure if this is important if we do not change more pick
to edit
when calling rebase. I am afraid to do this because conflicts do come, and I do not want to deal with a tidal wave of conflicting ripples from trying to throw everything in one go. Maybe someone knows how to avoid this? I could not come up with this.
I started to look at the internal workings of git objects 1 ! It seems like there should be a much more efficient way to go around the objects graph and just make the changes I want to make.
Please note that this repository is obtained from the SVN repository, where we did not actually use tags or branches (I have already canceled them by git filter-branch
), so we have the convenience of direct history. No git branches or merges.
I'm sure I missed some critical information, but this post already seems too long. I will do my best to provide additional information upon request. In the end, I may just need to publish my various scripts, which is an opportunity. My goal is to figure out how to rewrite history this way in the git repository; Do not discuss other viable licensing and code release methods.
Thanks!
Update 2012-06-17: Blog post with all the details of gory.