March 10, 2008

Better source control for your coding projects

Author: Travis Snoozy

The proper use of source control systems is a critical skill for programmers to have, and something that many of them have to pick up through observation, trial, and error in the workplace. For students, or people who primarily program as a hobby, the learning process can be particularly slow and painful. Here are some examples and discussion on the best practices you can use to avoid common source control pitfalls.

The basic purpose of a source control system is to allow you to work without worrying. If you break the software you're working on, decide the changes you're making aren't such a good idea, or otherwise make a mistake, source control allows you to go back to the last version you checked in. It also enables multiple people to work on the codebase at once without destroying each other's work. These two properties alone make source control critical for any software developed by more than one person.

Many source control systems, both open source and proprietary, are available for use today. Which one you use will depend on several factors, though in many situations, the decision will have already been made for you. Usually, the specific source control system in use is not terribly important; most modern systems are almost interchangable, save for the occasional niche feature. However, if you find that your source control system is hindering rather than helping your team, you may want to evaluate alternatives. You can read more about CVS, darcs, and Subversion on their respective Web pages. Subversion has a book-length manual that is particularly good; while it is Subversion-specific, it still covers many aspects of source control use that can be applied to other systems as well.

A critical mistake that many programmers make is to not check in frequently enough. It can be difficult to fight the urge to not check in -- because you know the code is broken, in an incomplete state, or for some other reason. While this is a good reason to not check code into a main development area, it is not a good reason to forgo check-ins altogether. The main problem is that you put all of your changes at risk whenever you make further alterations to your code. The more changes you make without checking in, the more you put at risk if you make a mistake.

In order to check in without worrying about the completeness of a large set of changes, you need to understand the concept of branches. A branch lets you pretend that you have a separate source control repository set up for a specific purpose. However, branches are better than separate repositories, because everyone on your team can have access to the changes that you make in your branch, and conversely, you can have direct access to the changes that others make to the main line of development. When you are finished to the point of feeling comfortable doing a "real" check-in, you can merge your branch back into the main line of development. A branch used in this fashion is usually called a personal branch (if you are the only one checking into it) or a feature branch (if more than one person is collaborating on the work).

The darcs code management system considers every checkout to be its own branch. Just check out your code and record your changes as patches at appropriate intervals.

darcs get your-local-branch

Subversion has a straightforward syntax for merging branches, but a lot of manual bookkeeping goes into determining the correct values to pass to the -r parameter.

$ svn merge svn:// svn:// -r5:9

Branching and merging tend to be the most difficult tasks you can perform in source control systems, but also one of the most useful. If you can master merging, you can make excellent use of the power of source control.

Another common mistake that people make with source control is making check-ins that do more than one thing. Checking in frequently helps to reduce this problem, but even a check-in that changes only two lines of code can do too many things -- if those two lines of code are unrelated to each other, and fix two different problems. Each check-in should do one, and only one, logical thing. A good rule of thumb is that each bug should usually have one check-in associated with fixing it (although some wide-reaching bugs may require more than one check-in to fix). Also, most small features should have one check-in associated with them. Medium or larger features should be done in multiple check-ins, at logical points in a branch, with a single merge back into the main development area once the feature is operational.

The major reason for having a one-to-one mapping with logical changes and check-ins is to save time when bugs arise: Quality Assurance has an easier time identifying where a bug might have been introduced if each check-in does only one thing, and that one thing is clearly explained in a single sentence in the check-in log. Having each check-in to the mainline be a complete, self-contained change also makes it easier to undo these problematic check-ins if the bugs can't be fixed in a reasonable way. If a check-in had two logical changes, or (even worse) one-and-a-half logical changes, then reverting the entire check-in would undo more changes than necessary -- possibly taking out unrelated features or even bugfixes.

Darcs has a check-in procedure that lets you select exactly which changes you want to check in -- even if those changes are all in the same file. This can be handy if you frequently forget to check in an old change before starting on a new one.

$ darcs record
hunk ./README 7
Shall I record this change? (1/?) [ynWsfqadjkc], or ? for help: y
hunk ./TODO 1
Shall I record this change? (2/?) [ynWsfqadjkc], or ? for help: n
hunk ./TODO 23
Shall I record this change? (3/?) [ynWsfqadjkc], or ? for help: y
What is the patch name? Bazbar
Do you want to add a long comment? [yn]n
Finished recording patch 'Bazbar'

Having check-ins change only one thing at a time is beneficial to developers as well, because it tends to encourage them to not keep too many modified files in their copies of the source tree. In my experience, the leading cause of build breaks tends to be devs either forgetting to check something in, or checking in something that they didn't mean to. By restricting yourself to making one change at a time, all of the modified files in your source tree can be checked in without you having to worry about breaking the build.

The last common mistake is actually from a management perspective -- specifically, about how releases and the source control system relate to one another. Almost all projects have the notion of a release: some version of the software that's considered good enough to send off to users. Most projects should also have the concept of development and stable releases; that is, a development version that gets new features, and a stable version that only has bug fixes applied to it. For folks who aren't used to releasing, the obvious way to do any kind of release is to simply make a tarball when the code in the source repository looks good. However, the source control system can (and should) be used to enhance the release process.

Rather than simply cut a tarball for every release, every release should correspond to a tag in your source control system. Tags are what the name implies -- little snippets of text attached to, in this case, a specific point in time in your repository. When the code is at the point where you want to make your tarball, you tag that exact instance before you make the tarball. This allows you to regenerate the exact same release (if you, say, accidentally delete your tarball), as well as keep track of what check-ins occurred between any two releases.

Tagging is usually a simple operation to execute; even in CVS, it's easy to make a tag if your local checkout represents the release you want to make:

$ cvs tag release_1_0_0

Tagging is the easy part. The more difficult issue that your source control system can assist with is ensuring that only bug fixes are applied to your progressive stable releases. The proper approach here involves the judicious use of branches. When you're ready to make a release that you intend to stabilize (e.g., 1.0), you should make a new branch, and tag that branch right off the bat. Then, you can continue to add new features on the mainline, and apply only bug fixes to your stable branch, continuing to tag releases (1.0.1, 1.0.2, etc.) at appropriate points on that branch. Eventually, you'll want to start stabilizing the new features that you've written in the mainline, at which point you simply need to make a new branch (1.1) and repeat the stabilization process. In this manner, you can continue to support many different versions of your software as your project or business model requires.


  • Programming
Click Here!