We know a few things about Git here at the Linux Foundation -- after all, we employ Linus Torvalds, who is unquestionably the father of Git. This same Linus once famously said, back in 1996: “Only wimps make backups: real men just upload their important stuff on ftp, and let the rest of the world mirror it.” If you’re cringing, he did add a smiley face after that -- but he was only half-joking, and Git in itself is proof that his views did not change much in the 10 years that separated that quote and the birth of Git.
That’s what is so awesome about Git -- it’s distributed. Every time someone clones a repository, not only do they get the latest source code, but also the entire history of the project. Every commit, every branch, every tag -- everything. As long as someone somewhere in the world has a functioning clone of a git repository, they have in their hands a full, tamper-evident backup of the entire thing, from start to finish. If tomorrow the entirety of Github’s infrastructure is suddenly sucked into a black hole, it will certainly inconvenience a lot of people, but all software projects hosted on Github will survive.
However, despite the fact that Git repositories do not require centralized infrastructure, it is handy to have some place that is considered the “golden master” for the latest and greatest code. For example, git.kernel.org is known to be such a golden master for Linux kernel development; android.googlesource.com is the same for Android development, etc. It makes sense; while it’s true that anyone can host a Git repository on a random server in their basement, doing that for large projects will quickly become problematic.
This piggie had RAM
Take linux.git, for example. Did you know that whenever someone issues a fresh “git clone” against linux.git, git-daemon eats up 1.5GB of RAM on the server and then sends out 750MB of data? If 20 people decide to clone linux.git at the same time, that’s 30GB of RAM and 15GB of traffic. Unless you own a few datacentres with your own fibre links, or have the money to put things into the cloud (or both), you’ll realize that it would be really nice if you could spread the load around. In other words, you’ll want to set up a few mirrors.
On the surface, mirroring with Git is really easy. You just do “git clone --mirror” on the remote system and after it’s done, you have a full mirror of the repository. However, that’s just the easy first step. The real difficulty is keeping that mirror updated so it receives the commits from the master on a regular basis. You could set up a cron job that would do “git remote update” every 5 minutes, and for one or two repositories that would be just fine. However, if you are attempting to mirror hundreds and hundreds of repositories, not only will that cronjob not finish in 5 minutes, but the administrator of the master server will want to have a few sharp words with you.
More importantly, there is no way to discover when new repositories are added on the master, so there will always be intervention required to add new repositories and delete obsolete ones.
What you really need is a way for the master to let the mirrors know “hey, the following repositories have been updated -- download the latest updates now.” Projects like gitolite solve this by setting up a pre-arranged trusted framework of masters and slaves, with the master issuing “git push --mirror” to each slave. This generally works great, but has the following important disadvantages:
The master-slave setup must be centrally managed. There is no way for someone who just wants a mirror for their school or geographical area to set one up without arranging a trusted relationship with the administrators running the master.
If a slave is down temporarily, it will miss all the updates pushed out by the master. The mirror admins will need to remember to do a full “git remote update” in each repository to pull in all the latest changes (assuming their remote origin is set right), or the master admins will need to issue a “gitolite mirror push” to it, or there must be a trusted agreement between the slaves and the master that would allow slave administrators to connect to the master via ssh and request a mirror push.
Chaining replication is tricky. For example, if you want to mirror from master to public-slave-1, and then from public-slave-1 to internal-slave-2, setting that up is non-trivial.
Basically, gitolite mirroring is well-suited for replicating Git repositories between a cluster of trusted systems and comes up short when the goal is to offer Git repository mirroring to anyone who wants it.
We really needed something that would allow anyone in the world to quickly and efficiently mirror our Git repositories. The Linux Kernel is considered the most successful software project of all time, and this is largely due to the fact that its source is available to anyone interested in participating. The tool we developed, which we called Grokmirror (because “grok” is a mirror of “korg”) makes it easy to reproduce all of kernel.org repositories and keep them up-to-date -- easy not just for us, but also for everyone who is interested in running such a mirror on their end. All you have to do is install “grok-pull” on your system and point it to the kernel.org manifest file.
Grokmirror is not just limited to kernel.org, of course -- we invite all other projects that are hosting large collections of Git repositories (KDE, Gnome, Fedora), to start providing their own grokmirror manifests. It requires a little bit more work on the master side, but once it’s done, you should be able to offer hassle-free mirroring to anyone who is interested without any extra work required from the master mirror administrators.
Here are some of the features offered by grokmirror:
Manifest.js.gz is a static file, so http setup is very simple.
Clients only download the new manifest if it’s newer than theirs.
Clients only pull Git repositories that have actually changed.
Grok-pull efficiently handles shared repositories to save space.
Grok-pull can keep track of as few or as many repositories as you want.
Grok-fsck can be set up to routinely check your Git mirrors for any corruption.