November 5, 2010

Weekend Project: Get to Know Your Source Code with FOSSology


If you work with open source software of any kind — whether at work or as a volunteer — then you understand the importance of license compliance and keeping track of copyright ownership. But as a project grows, those tasks can get tricky, even when everyone is on the same page. That is exactly the problem that led Hewlett Packard (HP) to create FOSSology, an open source tool you can use to analyze a source code tree for this type of metadata and more. This weekend, why not set it up and dig into your source code — you might be surprised at what you find.

FOSSology was originally built as an internal tool at HP, to help engineers follow the large company's IT governance policies when working with open source software written elsewhere. Even if your company or project isn't as big as HP, any time you blend code from different authors or want to borrow a routine from another open source project, it can get tricky to maintain all the rules. Certain licenses are compatible to combine in one executable, while others need to be separate processes. If you customize an open source application for internal use, you may also need to keep track of authorship — even more so if you send patches upstream.

FOSSology is a client-server application with a Web-based front end that takes care of most of the nuts-and-bolts of these tasks for you. Users can upload individual files or package archives, then schedule analysis jobs. The server can unpack and rapidly scan through thousands of files, logging copyright statements, license statements, and other metadata. The canonical uses for FOSSology include locating files within a larger code tree that have missing, incomplete, or incompatible licensing — events that can happen accidentally, but cause huge headaches further down the road.

The latest release is version 1.2.1, from October of 2010, and it can recognize more than 600 different open source licenses, including many distinct versions and variations on common license choices. It does this by matching the wording of the license itself, catching inline references (such as "This software is released under the GPL2"), or even abbreviated license statements (such as a CC-BY-NC). It can also detect code authorship, through inline copyright statements, names and email addresses in comments, and external AUTHORS files. The authorship feature is new in the 1.2.x-series, and shows that FOSSology is growing into a more robust code auditing tool.

Installing FOSSology

FOSSology comes packaged for most popular Linux distributions, including Red Hat Enterprise Linux, Debian, and Ubuntu. The 1.2.1 release might be too new for your distro's package management system, though, in which case you should consider installing it from source. This is particularly important for Ubuntu 10.4 users, because the current Debian FOSSology package is misconfigured for Ubuntu. Specialized instructions for correcting the packaging problem are available on the FOSSology Web site.

To install from source, you will need an Apache Web server, the PostgreSQL database, and a recent PHP build. In addition, there are numerous lower-level dependencies that the server side process uses to unpack and process source code. Most of these will already be installed on a production Linux server, but there is a full list on the project site.

The installation process includes pre-install and post-install scripts that check the dependencies, set up the PostgreSQL database and tables, and install the license templates that form the backbone of the license scanner. This process actually takes longer than the source-compilation step, but it is fun to watch the list of licenses fly by — if you thought the open source ecosystem consisted of just GPL-versus-BSD, you may find it educational.

A brief snippet from FOSSology's cavalcade of open source licenses...

Creating default canonical name: Free Art License v1.2
Creating default canonical name: FreeBSD
Creating default canonical name: Free clause
Creating default canonical name: Free Software License B v1.0
Creating default canonical name: Free Software License B v1.1
Creating default canonical name: FreeType
Creating default canonical name: Free use no change clause
Creating default canonical name: Free with copyright clause
Creating default canonical name: Free with files clause
Creating default canonical name: FSF
Creating default canonical name: GFDL
Creating default canonical name: Giftware
Creating default canonical name: Glide
Creating default canonical name: GNU Free Documentation License v1.1
Creating default canonical name: GNU Free Documentation License v1.2
Creating default canonical name: gnuplot
Creating default canonical name: Government clause

Once the post-install script is finished, you can run a test (as root) with /usr/local/lib/fossology/fossology-scheduler -t. This will check your configuration and report any errors. Last but not least, you will need to configure your Apache server to serve up the FOSSology front-end. The app uses the rather atypical location /usr/local/share/fossology/www/ for its Web content; the INSTALL file has Apache VirtualHost configuration examples to get you started.

Analyze This

FOSSology's default admin username and password are "fossy"/"fossy" — so the first thing you should do once the Web interface is running is change them. It is also a good (though not mandatory) idea to create a separate user account in addition to the admin account.

The Web interface has six main functions listed across the top of the screen: Search, Browse, Upload, Organize, Jobs, and Admin. There is also an "Obsolete" menu item that holds deprecated functions, but you should not get used to seeing them. You can upload a source code package from the Upload page — either uploading a local file, supplying a remote file URL, or specifying a file already on the Web server (though it must be readable by the Apache process for this to work).

When you upload a file, you have the option to schedule any of FOSSology's analytical jobs against the file, so you can add them to the job queue without delay. The current release offers six analysis "agents" — licenses, copyrights, MIME-types, metadata, packaging, and "buckets." Buckets are user-defined pools of licenses that you can use to simplify your analysis (say, "free" and "non-free," or "compatible with MyProject" and "incompatible with MyProject"). You can also schedule jobs after the upload is complete by visiting the Job -> Agents menu item.

Also under the Jobs menu you will find the Queue, where you can track the progress of the analysis jobs scheduled on the server, and MyJobs, which shows you just the tasks that you personally have scheduled. If you have administrator-level privileges, you can reset, delete, or alter the priority of specific jobs.

The Browse, Search, and Organize features are there to offer users access to a persistent library of FOSSology's data analysis. You can group uploaded files into nested folders (which may be very helpful for businesses with different projects residing on the same server), or search for a specific file anywhere in the source repository.

In practice, you may be most concerned at any one moment with finding files that match a certain set of criteria — say, files for which FOSSology detected no license at all. The file browser is designed to help you do that. You can navigate through the uploaded files and get views on the data based on all of FOSSology's analysis jobs: sorting by license, by your user-defined buckets, or by copyright ownership.

Features and limitations

Despite the long list of 600+ known licenses, it is important to remember that FOSSology can not automatically detect every license possible, because any individual can rephrase or fork an existing license and slap it on to his or her source code at any time. A corollary is that FOSSology gives you a good picture of the license and copyright make up of a block of code, but even its advanced heuristics are not a substitute for examining the code yourself when there are problems.

In a large project, you might have C or C++ source, shell scripts, SQL, Python, and user-contributed macros. FOSSology might detect GPLv3 licensing on the main application, no license information on the macros, and a loosely-defined "No commercial usage" statement in the SQL. How those fit together cannot be automatically determined; it may be that because the SQL is not necessary for executing the application, it is not a GPL violation to combine it in one package with the GPL'ed C/C++. On the other hand, if it used as part of the build process, it probably would be considered part of a "derivative work" under the GPL, and you would need to contact the author to ask for relicensing. If the author's email address or URL is located somewhere in the comments, you are in luck; otherwise you may have to excise and rewrite the code.

FOSSology does a good job of working around these areas of uncertainty. First, although its license templates contain the ability to match specific versions of licenses (say the GPLv2 versus the GPLv3), it does not attempt to strictly categorize every file it finds, if the result is ambiguous. Instead, it includes broader categories for the files with indeterminate licensing. Second, the "bucket" system allows you to do your own categorization. Buckets can be defined to match on several parameters, using logical operators, and you can always edit your existing bucket definitions and re-scan the code repository. It might take a few tries to define your own categorization scheme and make it fit perfectly.

The last thing to note about FOSSology is that although the documentation uses the term "repository," the system is not a version control or source code management tool. It is an auxiliary tool, one that your company or project can use in conjunction with a VCS, but it primarily designed to mine data from an existing body of source code. Whatever policy you have in place that FOSSology helps you audit — license compliance, copyright assignment, etc., — FOSSology can help you check-in and check-out a large volume of work. But it won't make your contributors automatically follow all of the rules.

Extra Credit: Agents of the Future

Without a doubt, keeping track of licenses and copyright data for a software project is vitally import to open source — copyright law is the foundation of what makes open source work and keeps free software free. But FOSSology is a flexible enough system that it could do other things in addition to these important tasks.

The application is modular in design; everything from the file-unzip-and-unpacker to the license scanner is just an "agent" that FOSSology's scheduler manages in the job queue. There are intriguing new possibilities on the project's roadmap; other agents that we may see in future releases.

Some extend the existing functionality, such as the ability to attach or modify licenses to selected files directly within the FOSSology application. Others build the app out in new directions, like performing a diff between two source trees, or tagging branches and files. It may be a little far-fetched, but it is possible that future versions of the tool will offer developers tighter integration with VCSes and IDEs. That would certainly be welcome news; any tool that makes license compliance easier is of benefit to entire open source community.

Click Here!