Beagle and Tracker are projects that allow you to index your files so you can quickly search filesystems. Both projects started out with the intention of being used with the GNOME desktop, but have recently made a push to be desktop-independent and work with KDE and other desktop environments. Over two days, we'll compare their usability and performance.
Both projects are not only standalone tools, but often incorporated in other utilities. If you are interested in which applications use them, you can read the lists for Beagle and Tracker. Depending on how your distribution packages Beagle support, you might find that an extension for Firefox is available so that Beagle can index every Web page you view.
Beagle is written in Mono and Tracker in C. I'll avoid the language war here about what is most efficient or uses more RAM because it is a discussion that requires more than an article in itself. The choice of language alone does not make one project a clear winner nor should it necessarily mean that Tracker is more efficient.
Both projects rely on a daemon that runs in the background and indexes your files. Both daemons attempt to keep the index up to date in real time as you change the filesystems that you want to index. Both projects provide an RDF interface for querying your index. Both projects default to indexing your home directory when started. Both projects can index mounted media, though there might be some setup required depending on how Beagle is shipped for your distribution.
In its default configuration, Beagle maintains some system-wide indexes of public directories such as /usr/share/doc and shares these indexes among all users on the system. To see which system-wide indexes are available, execute
Beagle makes use of extended attributes (EA) to store some metadata for each file, but can fall back to an SQLite database when EAs are not available, but the Beagle FAQ says that using SQLite "as the primary store is slow and noticeably degrades performance." This could be an issue if you intend to index filesystems mounted with NFS, as there is no EA support for NFS in Linux kernel 2.6 by default. The Beagle Web site recommends using static indexes on the server that exports NFS filesystems to avoid the "extremely slow operation" of indexing files over NFS.
The Beagle Web site is more extensive than the Tracker site, offering more in-depth information on how to configure and tweak Beagle.
Installation and usability
Beagle is available as a 1-Click install for openSUSE 11, in universe for Ubuntu Hardy, and in the standard Fedora 9 repositories. Tracker is available in the Fedora and Ubuntu Hardy repositories, and packaged a few times with the openSUSE Build System. In this article I'll use Beagle version 0.3.7-4.fc9.x86_64 and Tracker version 0.6.6-2.fc9 on a Fedora 9 x86_64 installation. If you're using Fedora, you'll want to install the tracker-search-tool package as well in order to use Tracker.
For casual use of either tool, the difficulty in specifying your personal preferences for what you would like to be indexed is likely to be a major factor in deciding which tool to use. Shown below are screenshots of the preferences tools of both Beagle (beagle-settings) and Tracker (tracker-preferences).
While Beagle includes support for "static indexes" that are not updated as the filesystem changes but only when explicitly crawled, the preferences dialog does not expose this functionality for configuration. As the Beagle web site notes that indexing NFS is extremely slow when not using these static indexes, not allowing users to configure this is an unfortunate omission.
Although searching other Beagle installations over the network is considered an experimental feature, it is exposed through the Beagle preferences, but it's not yet ready for use. I could not set a custom password for exposing my index for network search, and when I exposed my index I could not see it when I attempted to add a remote search-enabled host using the same preferences. Perhaps the dialog explicitly removes anything that is running on any network interface that is on localhost.
While the Tracker preferences include options for controlling its use when running on battery power, Tracker does not include any option to enable more aggressive indexing when the screensaver is active. For many systems the screensaver being active is not an indication that the system is idle, as background compiles may still be occurring in development environments. The Tracker preferences allow you to directly specify directories that are watched for changes and ones which are checked for changes only at startup (static indexing).
An interesting configuration option that is available in the Tracker preferences but not Beagle is the ability to specify maximum amounts of text and unique words counts per file. Although the defaults are conservative, being able to set explicit upper limits per file might be attractive when you are running Tracker on a large filesystem that might contain a few rare files that look like text to Tracker but which are in fact not text. The number of unique words is likely to be less of an optimization on the total number of words in the index and more a second test to ensure that no single file holds the Tracker indexing daemon up for an extended period of time.
I used the Linux HOWTO text files in both HTML (103MB) and PDF (78MB) to test how well each of the indexing packages worked. Because both tools are designed to work in the background, apart from using Beagle's static indexing functionality, attempting to get precise speed measurements for each tool on these directories is not simple. To do so, I started both daemons and allowed them to index the home directory contents. From there, I added the individual HTML and PDF directories in turn, and took the difference in the time field reported by
ps between before the directory was added and after CPU activity died down as how long the daemon required for that indexing task.
To add a new directory to watch using tracker-preferences required the tracker daemon to be restarted, making the difference calculation simply the time field once the daemon had settled down again. Note that there was a short delay after the daemon was restarted before it started to index the files. Since this delay is not counted in the
ps time field, it does not affect the overall comparison.
By default Beagle seemed to temper its indexing much more than Tracker so I set
export BEAGLE_EXERCISE_THE_DOG=1 for the benchmark to force Beagle to run at full speed. The bulk of the indexing time was spent in the beagled-helper process rather than the beagled one. The performance page in the Tracker preferences was set to index files at maximum speed and use additional memory for faster indexing.
|Daemon||Time To Index HTML||Time To Index PDF|
To see what information will be indexed, and thus available when searching for files, you can use
tracker-extract. I found that the Beagle tool provided more details on the PDF and HTML files that I used to test than the Tracker tool. I had to supply the mimetype to
tracker-extract to get any information at all.
$ tracker-extract /home/howto/pdf/Automount.pdf
$ tracker-extract /home/howto/pdf/Automount.pdf application/pdf
$ beagle-extract-content /home/howto/pdf/Automount.pdf
Filter: Beagle.Filters.FilterPdf (determined in .29s)
Timestamp = 2008-07-11 06:11:05 (Utc)
beagle:FileType = document
dc:appname = htmldoc 1.8.21 Copyright 1997-2002 Easy Software Products, All Rights Reserved.
dc:title = Automount mini-Howto
fixme:page-count = 9
Automount mini Howto
Automount mini Howto
Table of Contents
Automount mini Howto ........
Tracker sports a multilingual word stemmer that might be an advantage if you have documents in many human languages. A word stemmer converts many forms of a word into a common form that can be used to find all variants. For example, removing the "ing" ending in English words might form part of a stemming algorithm. Tracker includes support for tagging files so that you can find them again using the tags. The
tracker-tag tool makes it simple to add, remove, and list the tags associated with a file from the command line.
Tomorrow I'll compare the search interfaces offered by both tools, and their query syntax.
As a disclaimer, I should mention that I work on a "competing" open source metadata extraction and indexing project: libferris. I am unaffiliated with either Tracker or Beagle and will only be considering these two projects in this article and do so in an unbiased manner. If you are a KDE user, you might also like to consider Strigi for your index and search needs.