May 12, 2004

Bitkeeper after the storm - Part 2

Author: Joe Barr

In Part 1 of this interview, we learned just how much Linus Torvalds and others have increased their productivity through the use of Bitkeeper to handle kernel patches. In this conclusion to the interview, we examine the consequences of that increase. Is it good or bad for the Linux kernel that more patches than ever are being applied? Both Larry McVoy, author of Bitkeeper, and Linus Torvalds, creator of Linux, offer their opinions.

NF: Thinking back to the chant "Linus doesn't scale" and having clearly demonstrated that with the right tools he has scaled, is there any
concern on your part that the accelerated pace of Linux development we're seeing today might be taking too great a toll on Linus, or that the quality of Linux might suffer?

McVoy:
Good questions. I'm going to answer in opposite order because the first
one is a longer answer.

I don't think that the quality is suffering, we run our company on Linux
and we see Linux steadily improving. There are definitely things going into
the kernel that I don't agree with (mostly fake realtime stuff or
fine-grained threading that less than .001% of the machines in the world
will ever use) but I'm not the guy who gets to choose. So if I leave
my personal views aside and try to be objective about it, it certainly
seems to me that the kernel keeps getting better. 2.6 looks pretty good
and the rate of change is dramatically higher. If the faster pace was
going to cause problems, I suspect it would have done so by now.

The first question is more involved. The short answer is that I think
that rather than taking a toll, Linus is more relaxed and able to spend
more time doing what he should do, educating people, teaching them good taste,
acting as a filter, etc. He and I talk periodically and he certainly seems
more relaxed to me. I've seen him take interest in people issues that he
would have let slide when he was under more pressure.

The longer answer, which addresses why the increased pace is not
taking a toll on Linus, requires some background. If you look at software
development, there are two common models, each optimizing one thing at
the expense of the other. I call the two models "maintainer" and
"commercial."

Development models

The maintainer model is one where all the code goes through one person who
acts as a filter. This model is used by many open source projects where
there is an acknowledged leader who asserts control over the source base.
The advantage of this model is that the source base doesn't turn into
a mess. The bad changes are filtered out. The disadvantage is that it
is slow; you are going only as fast as the maintainer can filter.

The commercial model is one where changes are pushed into the tree as
fast as possible. This could be called the "time to market model."
Many commercial efforts start out in maintainer mode but then switch
to commercial mode because in the commercial world, time to market
is critical. The advantage of the commercial model is speed (gets to
market first) and the disadvantage is a loss of quality control.

  • Commercial model: Very fast, lower quality
  • Maintainer model: Slow, higher quality
  • Maintainer+BK: Fast, higher quality

Scaling development

Everyone knows that small team development works well but problems
emerge as the team grows. With a team of five or six people, filtering all
changes works fine -- one person can handle the load.

What happens when you try to grow the team? Commercial and open source
efforts diverge at this point, but both have growing pains.

The commercial approach is to abandon the filtering process and move
quickly to get something out the door. It's simply not effective to
try and filter the work of a few hundred developers through one person;
nobody can keep up with that load. The commercial world has tried many
different ways to have their cake and eat it too. Management would love
to have speed and quality, but the reality is that if they get speed then
they sacrifice quality.

The maintainer-model process has scaling problems as well. It works as long
as the maintainer can keep up and then it starts to fall apart. For a
lot of open source projects, it works really well because the projects
never get above five or six people. That may seem small, but the reality is
that most good things have come from small teams. But some projects are
bigger than that: the Linux kernel, X11, KDE, Gnome, etc. Some projects
are much larger -- the 2.5/2.6 branch of the Linux kernel shows more than 1,400 different people who have committed using BitKeeper.

It is obvious that trying to keep up with the efforts of more than
1,000 people is impossible for one person, so how do maintainer-model
projects scale? They divide and conquer. Imagine a basic building block
consisting of a set of workers and a maintainer. I think of these as
triangles with the maintainer at the top and the workers along the bottom.
You can start out with a maintainer and a couple of workers and you keep
adding until you can't fit any more in the triangle. When the triangle is
full you create another layer of maintainers. The top triangle is filled
with the ultimate maintainer who then delegates to sub-maintainers.
So what were workers are now the first line of maintainers. Each of
those sub-maintainers is leader of a second level triangle, and there are
several of those below the top triangle. All I'm describing is a log(N)
fan-in model where the same process of filtering is applied in layers.

The Linux kernel had moved to this model before they started using
BitKeeper and it was troublesome. What is not explicitly stated in the
layered maintainer model is that as you add these layers the workers are
farther away from the authoritative version of the tree and all versions
of the tree are changing. The farther away from the tree the more merging
is required to put all the versions together. The sub-maintainers
of Linux, who are the usual suspects like Dave Miller, Greg KH, Jeff
Garzik, etc., were in "merge hell" every time Linus did a new release.
Maintainer mode worked quite well for small teams but as it scaled up,
the divide and conquer solution forces the sub-maintainers pay the price
in repeated and difficult merging.

Scaling maintainer mode with BitKeeper

BitKeeper was designed with the maintainer model in mind, to enable that model
(among others) by removing some of the repeated work such as merging.
We knew that the maintainer model would be dominated by trees with various
differences being merged and remerged constantly, so good merging had to
be a key BitKeeper feature. BitKeeper is enough better at merging that it
allows the model described above to work and to scale into the hundreds
or thousands of developers. The fact that BitKeeper works well in this
model is a big part of why the sub-maintainers all thought things were
ten times better. For them, it was easily ten times better because they
were doing much less work, because BitKeeper was doing all the merging
for them. The sub-maintainers were doing more work and BitKeeper made
most of that work go away, so the improvement for them was dramatic.

The fan-in/fan-out variation of the maintainer model is the way that
Linus reduces his load. A sub-maintainer emerges as someone who can be
trusted, a sub-section of the kernel is spun off as a somewhat autonomous
sub-project, Linus works with that person to make sure that the filtering
is done well, and the development scales a little further.

The point of this long-winded response to your question is to explain
why the increased rate of change hasn't taken a toll on Linus. If a
tool can support the maintainer plus multiple sub-maintainers (and even
sub-sub-maintainers and so on) then the top-level maintainer can learn
over time which of his sub-maintainers can be trusted to do a good job
of filtering. There are some people from which Linus pulls changes
and more or less trusts the changes without review. He's counting on
those sub-maintainers to have filtered out the bad change and he has
learned which ones actually do it. There are other people who send
in patches and Linus reads every line of the patch carefully.

If I've done a good job explaining, then you can see how this model can
scale. It's log(N), and log(N) approaches can handle very big Ns easily.
The goal of the model is to make sure that changes can happen quickly
but be carefully filtered even with a large number of developers.
Without BitKeeper doing a lot of the grunt work, a project has to choose
between the faster commercial model and the more careful maintainer model,
but with BitKeeper you get to have your cake and eat it too. The process
moves fast, close to as fast as the commercial model, but without losing
the quality control that is so crucial to any source base, large or small.

To some extent, Linus's job becomes one of working with sub-maintainers
to make sure they are as good as he is at filtering. He still does a
lot of "real work" himself but he is scaling by enabling other people
to do part of his job.

NF: Linus, since the number of patches handled has gone up so dramatically, do you still have time to give them the same sort of attention you did the old way?

Torvalds: Larry already answered, I'll just throw in my 2 cents'.

To me, the big thing BK allows me to do is to not worry about the people I
trust, and who actively maintain their own subsystems. Merging with them
is easier on both sides, without losing any history or commentary.

So the answer to your question is that to a large degree BK makes it much
easier to give attention to those patches that need it, by allowing me
to not have to care about every single patch. That, in turn, is what makes
it possible for me to take many more patches.

So in that sense, I don't give the "same sort of attention" that I did in
the old way. But that's the whole point -- allowing me (and others, for
that matter) to scale better, exactly because I can direct the attention.

A lot of my time used to be taken up by the "obvious" patches -- patches
that came in from major subsystem maintainers that I trusted. That has
always been the bulk of the work, and the patches that require attention
are comparatively few. But when everything was done with patches, I
basically needed to do the same thing for the "hard" cases as for the
"easy" ones. And a fair amount of the work was just looking at the email
to decide into which category it fell.

That's where BK helps.

There is another part to it too -- BK allows me to give much more control
to the people I trust, without losing track of what is going on.

Traditionally, when you have multiple people working on the same source
tree, they all have "write access" to whatever source control management
system they use. That in turn leads to having to have strict "commit"
rules, since nobody wants anybody else to actually make changes without
having had a chance to review the changes. That in turn tends to mean that
the limiting point becomes the "commit review" process. Either the
process is very lax ("we'll fix the problems later," which never works),
or the process is so strict that it puts a brake on everybody.

In contrast, the distributed nature of BK means that I don't give any
"write access" to anybody up-front, but once they are done and explain
what they did, we can both just synchronize, and there is no issue of
patches being "stuck" in the review process.

So not only does BK allow me to concentrate my attention on stuff I feel I
need to think about (or the other side asks me to think about, which is
actually more common), but it also allows me to literally give people more
control. That makes it much easier to pass the maintenance around more,
which is, after all, what it's all about.

Category:

  • Linux
Click Here!