April 23, 2007

Recoll: A search engine for the Linux desktop

Author: Dmitri Popov

Desktop search engines are all the rage these days. While Beagle may be the most popular desktop search engine for Linux, there are alternatives. If you are looking for a lightweight and easy-to-use yet powerful desktop search engine, you might want to try Recoll. Unlike Beagle, Recoll doesn't require Mono, it's fast, and it's highly configurable. Recoll is based on Xapian, a mature open source search engine library that supports advanced features such as phrase and proximity search, relevance feedback, document categorization, boolean queries, and wildcard search.

Recoll can handle plain text, HTML, OpenOffice.org documents, Mozilla Thunderbird and Evolution email messages, and Lyx and Scribus files. In addition to those native formats, Recoll can also work with other file types by using external helper applications. For example, the Xpdf software provides support for PDF files, while Word, PowerPoint and Excel documents are handled by Antiword and catdoc. If you want to enable support for document types that require external helpers, you have to install the helper apps separately using your distro's package manager (a list of the required external helpers is available at Recoll's Web site).

Recoll stores all internal data in Unicode UTF-8 format, but it can index files with different character sets, encodings, and languages into the same index.

Since Recoll's Web site provides binary packages for most major Linux distributions -- such as Fedora, SUSE, Ubuntu, and Debian -- you can install it easily using your distro's package manager. You can then launch Recoll by choosing Recoll from the Applications -> Accessories menu (in Ubuntu) or running the recoll command in a terminal window.

During the first run, you will be prompted to create a default set of configuration files that will contain all Recoll's settings. Recoll doesn't provide a GUI configuration tool, so you have to edit the configuration files manually. Fortunately, Recoll's user manual provides a detailed description of the configuration options that you can tweak. However, since Recoll's default settings cover all the basics, you might not need to edit them.

Like any desktop search engine, Recoll must index documents before it can search them. By default, Recoll indexes the files in your home directory, but you can specify another or additional locations. During the first run Recoll performs a full indexing, which can take some time. Once Recoll has built an index, you can update it manually using the recollindex command. You can also run recollindex as a cron job. Alternatively, you can run the recollindex -m command, which runs as a daemon that indexes modified files in real time.

Recoll results - click to enlarge

Once the files have been indexed, Recoll is ready to go. To perform a simple search, enter a search term or terms into the search field and press the Search button. Besides the search for all or any specified term, Recoll also allows you to search for file names as well as perform more advanced searches using wildcards and boolean operators. Recoll supports three type of wildcards. The * wildcard can be used to match one or several characters (e.g. writ* returns writer, written, and writing). The ? wildcard matches just a single character (e.g. b?ll returns ball, bull, and bell). The [] wildcard allows you to specify a set of matching characters, e.g. [a-h] or [1-5]. To perform a boolean search, select the Query Language item from the drop-down menu next to the search field. You can then use boolean operators to construct more complex searches. For example, the following search from:"tristram shandy" linux AND openoffice -windows finds documents containing the word "tristram shandy" in the from field (useful when searching email messages) as well as the words "linux" and "openoffice" but not the word windows.

The Advanced Search feature can be used to create even more advanced queries. The default fields (called Clauses) allow you to specify a wide range of criteria, such as proximity, unlimited number of search terms (you can add extra fields by pressing the Add clause button), excluded words, and wildcards. You can also narrow your search to specific file types or a specific directory.

When you perform a search, Recoll displays the results in the main window. Each search result contains a file type icon, relevance in %, and context surrounding the search term. There are also two links: the Preview link allows you to quickly preview the document in a separate window, while the Edit link opens the file for editing in an appropriate application.

Finally, Recoll also features a Term Explorer tool (Tools -> Term Explorer) that can come in handy when you don't remember the exact spelling of a particular search term. Basically, it acts as a mini search engine that searches the index. This allows you to see all the derivatives of the entered search terms and select the one you need.

Although Recoll looks deceptively simple, it is indeed a powerful desktop search engine. To get the most out of it, make sure to read Recoll's user manual, paying particular attention to the tips and tricks section.

Dmitri Popov is a freelance writer whose articles have appeared in Russian, British, US, German, and Danish computer magazines.

Click Here!