Linux.com

Home News Software Applications Linux Desktop Search Engines Compared

Linux Desktop Search Engines Compared

Print PDF

I have a large electronic library (over 15,000 books) and I was looking for a way to cope with this mass of information. I didn't like the idea of a special catalog, since it would take a lot of manual work to enter the metadata. Besides, my books are in various formats, from HTML, to RTF, to DOC, to PDF, to DjVU. These files lack metadata way too often and I thought a local indexing service with a full-text search might solve my problem. I knew there are more options to choose from than just Google, but I could not find a good modern comparison. Even the table in Wikinfo's Comparison of desktop search software contained too many errors, as I discovered.

I had to compare them myself.

My task imposed certain restrictions on the one hand, but made the others irrelevant on the other hand. So, I was especially interested in a wide gamut of file types, in the ability to add new ones (Epub, fb2, html.zip) and in extensive query language. All software, except for GDS and DocFetcher, was installed from Ubuntu 9.10 repositories.

I have no special preferences regarding the backend, it may be Xapian- or Lucene-based tool, or even a custom backend. On the other hand, Xapian usually requires more disk space, and there is never too much space on desktops.

Beagle

http://beagle-project.org

The list of supported file types is quite large, and Beagle includes typical office files, source code, LaTeX source, images, audio and video files, RPM and DEB packages, e-mail from Evolution, Thunderbird and Kmail, IM and IRC logs, RSS feeds and many more. Plus, you are free to extend it. I could add new file types by editing one file: /etc/beagle/external-filters.xml.

The indexing process can run in two ways: CPU-lenient and CPU-intensive (using EXERCISE_THE_DOG environment variable). The search engine is based on Lucene.Net. I have no idea why the developers chose this exotic platform to implement Beagle, but Beagle works, and it works well.

Beagle understands limited (very limited, actually) regexps (*). You can search for phrases, exclude words (-word), use OR operator, specify dates when the file was created (on, before, after and between!), limit the search with a file type and define the directory where to look for the files. Unfortunately, you cannot point at the directory under which Beagle should search.

You can even use the metadata of audio and image files, as in the examples from the manual:
artist:Beatles ext:mp3 OR ext:ogg -album:"Abbey Road"
You can specify to search in mail attachments, to search by music genres, mailing lists, IM correspondents and much, MUCH more.

Beagle tends to create huge log files in ~/.beagle/Logs.

Beagle has a web interface. It's very easy to start using it, but not so easy to make use of it, since the alleged links to the results are not exactly links.

The Beagle web site includes information on the query syntax and extending Beagle, but finding the information is next to impossible unless you use Google. Description of query syntax is here.

The index for a 45-Gb home partition was only about 700 Mb.

Google Desktop Search

http://desktop.google.com/linux/

Google Desktop Search supports OpenOffice.org and MS Office files, PDF, HTML, TXT, audio and image files, and email from Thunderbird. Strangely enough it does not index zipped archives.

I could not add new file types, not even plain text with a different extension. I was pretty sure that GDS supports stemming, but not regexps. To my surprise, stemming did not work in GDS. Nor did regexps. It does not even support AND and OR keywords.

Otherwise, the query syntax is acceptable. You can point at the directory where the file you are looking for is located, or the directory, under which the file is supposed to be. You can search for phrases or exclude words. I was using GDS for some years and it works great as long as you use it in the way Google intended it to work. While suitable for and average office cubicle, it was next to useless for my purposes.

The index size was about 1.7 Gb for 50 Gb of data.

Recoll

http://www.lesbonscomptes.com/recoll/

A large number of file types is supported natively, including plain text, HTML, maildir and mailbox files, OpenOffice.org, MS Office 2007, Abiword, LyX, Kword and Scribus files, and GAIM logs. Many more are supported with external helpers: DOC, XLS, PDF, DjVU, MP3, image files, and so on. Feel free to add to the list, it's easy: one file establishes associations between extension and mime type, another one specifies how the data is extracted from a file of a certain MIME type, and the third one defines applications used to open MIME types.

Recoll is built around the Xapian engine.

I had an impression that the indexing process takes much longer with Recoll than with the other tools. When indexing RTF with unrtf, Recoll created a heap of WMF files in my home directory. Recoll has no indexing daemon that would run in the background all the time. Instead, Recollindex is to be launched from time to time (with cron, for example).

The manual mentions stemming support, but also points that this is done the other way round. Stemming is not included in the database, as in other indexing engines, but the query is stemmed instead. Unfortunately, my version gave different results when searching for plural 'notebooks' and singular 'notebook,' so I assume stemming does not work in my installation of Recoll. Recoll understands regexps pretty well, which to a certain degree compensates for the problems with stemming.

Rich query language, modeled after Xesam End User Search language (see here). Like with Beagle, you can use the dir: prefix to limit the search path to one directory, but you cannot specify a directory tree. Alas! Other useful prefixes include title, author, ext (for file type), etc.

The search client, recoll, is a GUI program, but with the -t option it runs in text mode. It means that instead of specifying a directory tree, I can just grep the results for a string, like this:
recoll -t -q \"jack london\"|grep /library/fiction/adventure
Note that for the command line client, you have to escape quotation marks to denote a phrase search.

Recoll, unlike some other tools, has a decent user manual, containing information on query syntax and adding support for new file types.

The index size threw a damper on me. For a 50-Gb home directory it was more than 5 Gb.

Strigi

http://strigi.sourceforge.net/

Strigi supports regular expressions. Theoretically, Strigi should support plain text files, PDF, DEB, and RPM packages, OpenOffice.org documents, and zipped files. Besides, Strigi was the only program that successfully indexed EPUB files without customization, interpreting them as just plain ZIP-archives with HTML, NCX, etc. inside.

There's little I can say about this program. The daemon kept crashing when I tested it so I could not even finish building the index for my home directory. The client erroneously classified a lot of hits as being "email."

The incomplete(?) index size was about 750Mb.

Tracker

http://projects.gnome.org/tracker

Tracker is a part of GNOME Project and it tries to adhere to various useless technologies, like DBus. Tracker introduces the concept of file tags, thus overcomplicating the task of file management. I admit that the notion of file tags might be reasonable, but only if it is supported universally, if tags are freely backed up, copied, etc. Now, fortunately, the tags are not obligatory for Tracker.

The full list of supported file types is unavailable, but the web site talks about image, audio, video, text files, source code, applications, playlists, IM conversations, and so on. No email, nor bookmarks, nor contacts as yet, though. The indexing daemon would segfault occasionally and I could not finish indexing.

As a matter of fact, Tracker was designed as a metadata search tool (and its full name is MetaTracker), but the normal use case is just full text search. Tracker was written to work well even on machines with 128 or 256 Mb RAM. Judging by the slowness of indexing, this statement could be true. I was wrong, Recoll was not the slowest indexer, it was Tracker.

I could not find a good user manual.

DocFetcher

http://docfetcher.sourceforge.net/en/index.html

Supported file types: HTML, plain text, PDF, Microsoft Office (doc, xls, ppt), Microsoft Office 2007 (docx, xlsx, pptx), OpenOffice.org Writer, Calc, Draw, and Impress, RTF, AbiWord (abw, abw.gz, zabw), CHM, Visio, SVG.

DocFetcher is written in Java. Fast and CPU-sparing indexing. DocFetcher comes in two flavors: a binary installable package and a "portable" version, which you can run right from your home directory.

DocFetcher supports regular expressions (at least * and ?). Phrase search, AND and OR keywords, search in content or in metadata: author and title fields are supported. It does not index zipped files. It is easy to add new filename extensions that are treated as yet another text file or HTML, but I could not add a new file type which is to be treated in a special way. For me this means that I cannot process custom XML to convert the content to the proper charset. It's a problem.

An interesting query feature is boosting terms: "You can assign custom weights to words, thus increasing or decreasing the level of matching for a particular document if the weighted word occurs in it. This allows you to influence the relevance sorting of the result page. Example: dog^4 cat will bring up the documents with "dog" in it on the top of the result page.

The manual can be found in the downloaded archive, but it is very brief.

Pinot

http://pinot.berlios.de/

Like Tracker and Strigi, Pinot is built for DBus. Its indexing engine uses the same Xapian engine as Recoll, so I could use Pinot text-mode client to query the database built by Recoll indexer. Pinot can use other databases, but I was not interested in this option. The crawler takes a huge share of RAM and CPU. It ate up 70% of RAM on my PC, causing some other programs to crash, so I had to leave it for a night to complete indexing.

The documentation consists of one Readme file and a couple of web pages. Quoting these web pages, "The following document types are supported internally :
  • plain text
  • HTML
  • XML
  • mbox, including attachments and embedded documents
  • MP3, Ogg Vorbis, FLAC
  • JPEG
  • common archive formats (tar, Z, gz, bzip2, deb)
  • ISO 9660 images
"The following document types are supported through external programs:
  • PDF (pdftotext required)
  • RTF (unrtf required)
  • OpenDocument/StarOffice files (unzip required)
  • MS Word (antiword required)
  • PowerPoint (catppt required)
  • Excel (xls2csv required)
  • DVI (catdvi required)
  • DjVu (djvutext required)
  • RPM (rpm required)"

Indeed, new file types are defined in the file external-filters.xml very similar (but not identical, Pinot developers warn) the the file with the same name used by Beagle.

I have to say that these external programs made indexing of PDF, RTF, and other files a difficult task. Indexing a PDF document took up to two minutes.

Conclusion

Recoll and Pinot may be considered good alternatives to Beagle, but the size of the Xapian index database leaves just one choice for me: Beagle.
Comments (23)Add Comment
Rene Bon Ciric
Thanks for the review
written by Renich, December 01, 2009
Nice review. I'd like to see some improvement from Tracker. GNU should provide an integration (which they will for sure) with the rest of gnome...

I'm curious about Pinot too...

Thanks for the great review! smilies/grin.gif
Michele Barbiero
Good review.
written by Michele Barbiero, December 02, 2009
Yes, it's a good review, I didn't know some engines. It's strange your experience with Strigi. For me works great Strigi 0.7.0 (with come in KDE 4.3.3 release): works also with tags and comments (Nepomuk). I didn't run all KDE stuff, but the command "nepomukserver" load all necessary programs needed by desktop search. For 6.1 Gb the index file is 16.4 Mb.
Toby Haynes
Tracker may improve
written by Toby Haynes, December 02, 2009
Tracker is still a relative newcomer to the search scene (currently at version 0.6) but recent developments for GNOME 3.0 mean that Tracker will become a critical part of the GNOME environment to support Zeitgeist. I suspect this means that there will be a lot of pressure put on the Tracker team as it moves towards a 1.0 release.

I'm intrigued that beagle was stable for you. I ran it fairly constantly for a year or so and gave up. When it worked, it worked well. However, it kept tripping on files it couldn't understand, hanging the indexing threads and consuming all the CPU. Maybe it's improved since 2008.
Ulrik Mikaelsson
What about RAM and disk Usage?
written by Ulrik, December 02, 2009
Interesting roundup, although I'm not sure of the relevance of the usual application of a "Desktop Search Engine". Most notably, I'm missing info on how heavy each of the indexers were on the RAM and Disk.

For your particular need, you may need quick indexing of a big library, (and small on-disk-index) more than anything else, but personally, I'm more interested in a lightweight indexer, that does not bring the rest of the system to a Crawl while running in the background, either by consuming ridiculous amount of RAM, or forcing out commit:s to disk constantly. That brought Beagle out of the question the last time I tried it, chewing 100s of megs of RAM, while doing nothing.

It would be interesting with the same tests, in a controlled but realistic environment (you don't mention what other load the machine experienced while indexing), and with some quantitative numbers attached. (initial index time, partial index update, query-time, and the RAM, CPU, and disk-load over the different tests).
Alister Fiend
I get entirely different results with Recoll
written by machiner, December 03, 2009
Just to detail...

"I had an impression that the indexing process takes much longer with Recoll than with the other tools. When indexing RTF with unrtf, Recoll created a heap of WMF files in my home directory. Recoll has no indexing daemon that would run in the background all the time. Instead, Recollindex is to be launched from time to time (with cron, for example)."

Not so with my install on Lenny. Having recently tried GDS, Tracker, Beagle and my usual, Recoll I can report that Google was about to take days to index my data, with Tracker only just behind it and I didn't bother to allow Beagle to finish after some hours. In fact, I just rebuilt my desktop today, reassembled collections from current and the past (data) and my ~/ is currently 101GB. lol, yet to be pruned...

I set to index as I walked out the door today. Returning almost exactly an hour later my ~/ was indexed. 'Course, I don't know how long it actually took.

Having noticed that I was missing some external helpers I installed those and reindexed. 20 minutes later it was finished. Now, ogginfo is a sore spot...

I saw no such wmf files created to my ~/ although I only have ~100 rtf files.

Recoll has an indexing daemon. Moreover you can start recollindex -m and there ya go, real-time indexer. It's pretty wicked.

I was under the impression (perhaps from documentation) that a cron job runs the indexer every night but to tell the truth I'm not too sure and I never checked.

My index is 1.5Gb on 101GB of data. Put that in your Beagle and smoke it! smilies/wink.gif

I remember writing a lame article once about Beagle and how much it rocked. Then I found Recoll. Boy was my impression of "rocks!" changed. The really cool thing is, though, that we both have a tool that we like using because it fits how we want to use it. Rock on.
minaev
Thanks, guys
written by Dmitri Minaev, December 05, 2009
Glad it was helpful to you smilies/smiley.gif

Michele,

Indeed, I was sure Strigi would score much better than it did. There
must be something wrong with my setup.

Note that I run neither Gnome nor KDE. Perhaps, there's some problem
with package dependencies in Ubuntu and there are some libraries
missing or like?


Ulrik,

The most lightweight crawlers were Google and Beagle. Strigi did not
stress my system, either, but it never worked long enough to make any
serious conclusions. Could be another option, though, in a more
suitable (KDE'fied) environment, I assume.

machiner,

Ubuntu annoys me more and more every day and I think I will return to
Debian again quite soon. Hope, I will have a better experience with
Recoll then.
Fabrice Colin
Thanks !
written by Fabrice Colin, December 06, 2009
Hi,
I am the developer of Pinot. Thanks for the article !
What are the specs of your machine (memory, CPU) ?
You mentioned up to two minutes to index a PDF document. How big was that particular PDF ? Was it mainly text, did it include graphics ?
How big was the final index ?
J.F. Dockes
Recoll questions
written by J.F. Dockes, December 06, 2009
Hello,

I'm the developper for Recoll. It really seems that this is the place to be for desktop search developers these days (hi Fabrice smilies/smiley.gif). Thanks for taking the time for performing this very interesting comparison. I have two questions about the Recoll issues:

- About the "dir:" directive. This is intended to select the whole subtree, and seems to work for me. I'd be interested by more detail about how it doesn't work for you.

- With stemming on, searching for an english word or its plural should yield the same results (!) I'd be quite interested by the output of the "Show query" links (at the top of the GUI result list) for the singular and plural cases.

You can reach me as jfd at recoll dot org if you wish.

Thanks,

jf
minaev
...
written by Dmitri Minaev, December 07, 2009
Fabrice,

I tested Pinot on Lenovo M52 PC with Intel Core2 Duo 4300 CPU and 1Gb RAM. The final index size was comparable to that of Recoll, around 5Gb, if I remember correctly.

Today I tried to re-create the index with Pinot, launched pinot-dbus-daemon and left home. Later, I connected to the test machine with ssh, but I found that the daemon crashed. I tried to restart it but it was impossible:

"Couldn't open bus connection: /bin/dbus-launch terminated abnormally with the following error: Autolaunch error: X11 initialization failed."

This is one of the reasons why I dislike dbus so much.

Unfortunately, I don't remember what file it was that took two minutes to be indexed, but most probably it was one of those 100Mb PDFs. Never mind, I believe it is hardly possible to increase the speed of external pdf-to-text converters.
minaev
...
written by Dmitri Minaev, December 07, 2009
J.F., I am really sorry -- dir: keyword does search under the specified directory. I cannot understand what I did wrong when I could not make it work.

As for the stemming, it doesn't work for me. The queries look pretty straightforward:

(notebooksmilies/sad.gifwqf= 11))

returns 127 results and

(notebookssmilies/sad.gifwqf= 11))

gives 45 hits.

The reason is, probably, shown below, in the error message I get when I try to search from the command line:

:2:../rcldb/stemdb.cpp:284:stemExpand: error accessing stem db. dbdir [/home/minaev/.recoll/xapiandb] lang [english]
J.F. Dockes
Recoll stemming
written by J.F. Dockes, December 07, 2009
About the "dir:" thing: please don't be sorry, these tools are really difficult to test..

About stemming, yes, the error message explains why it doesn't work, the two queries should be identical. It looks like the indexing did not finish properly, either because it crashed, or hit a disk occupation limit, or was interrupted, and it did not create the stemming "dictionary", which is normally done at the end.

It's difficult to know the reason without looking at the logs. The current version is a little too greedy about what it tries to index, for example it's not afraid of going after multi-GB ".xsession-errors" files, which usually ends up in smoke. The next version will be a bit more prudent (and efficient).
minaev
...
written by Dmitri Minaev, December 07, 2009
Well, I don't think the indexer crashed. I tried to run `recollindex -s' to rebuild the stem database after the indexer had finished, but the error persisted.
J.F. Dockes
...
written by J.F. Dockes, December 07, 2009
Sorry about the trouble.

Could you please try to bump the verbosity level to 4, either through the GUI:
Preferences->Indexing configuration->Verbosity level, or by setting "loglevel=4" in ~/.recoll/recoll.conf,
then retry "recollindex -s english", this is quite unusual.
minaev
...
written by minaev, December 07, 2009
Sure. I sent the results by e-mail.

Thanks a lot to you all for what you, Fabrice and other developers do!
minaev
...
written by minaev, December 07, 2009
I knew it was my fault smilies/smiley.gif

The man page says that "At the time of this writbng, the following languages (abbreviations) are recognized: ...english (en)". So, I used the abbreviated language name, en. When the full language name was given, the correct stemming database was created and it worked. For Russian, stemming still doesn't work, though.
J.F. Dockes
...
written by J.F. Dockes, December 07, 2009
This is at the very minimum a documentation bug (and the software should detect the problem too).
The reason why this area is so raw is that most users never have to deal with stemmer names. The basic mechanism is that the right stemmer is determined from the locale, and the stemming database automatically created at the end of indexing, without any need for user input.

If it is needed to specify the stemmer (which may mostly make sense if there are separate indexes for document sets in different languages), the normal approach is to use the GUI index configuration tool, where the names are selected from a list. Anyway, I'll clarify the doc and try to make recollindex a bit smarter. If you still have time for the russian stemmer issue, I'll be glad to take a look (maybe by email?)
Cheers,
jf
Fabrice Colin
...
written by Fabrice Colin, December 08, 2009
Dmitri, thanks for your reply.
The latest release can actually be built without dbus if necessary.
The FAQ (should be installed with the README, latest at http://svn.berlios.de/wsvn/pin...rev=0&sc=0) has tips about reducing memory usage, and instructions on how to compact indexes.
Feel free to post on pinot-discuss if you have any comment/question.
Pawel
...
written by Pawel, December 18, 2009
Strange results. I always found Beagle slow compared to Strigi and I wouldn't call it lightweight.
minaev
...
written by Dmitri Minaev, December 18, 2009
Pawel,

Unfortunately, I had no chance to really test Strigi. I gave it another chance recently, but it still crashes. Must be something wrong with my Ubuntu.

Beagle is not the engine I would recommend if someone asked me for an advice on really lightweight indexer. But compared to the alternatives, it treats CPU and HDD with care.
Pawel
...
written by Pawel, December 20, 2009
@Dmitri Minaev

Thank you for explaining this smilies/smiley.gif
Ron Liebman
...
written by Ron_, December 22, 2009
I am new to Linux, but I have had no success in trying to index e-mail content using Beagle under Ubuntu. It seems to capture Thunderbird header information, but no content or attachments. Beagle does index file content in my home directory and (with Firefox add-on) content from my visited web pages. But until it can capture e-mail content as well, it seems like a flawed tool.
minaev
...
written by Dmitri Minaev, December 30, 2009
Ron,

You should check Beagle's configuration. The backend responsible for the Thunderbird mail is disabled by default (at least, in Ubuntu). Open the Beagle search tool, select Search -> Preferences -> Data Sources and put a tick at "Thunderbird".
Ron Liebman
...
written by Ron_, January 03, 2010
Thanks, Dmitri. That's not it. Thunderbird data source was enabled.

Write comment
You must be logged in to post a comment. Please register if you do not have an account yet.

busy
Become an Individual Member

Who we are ?

The Linux Foundation is a non-profit consortium dedicated to the growth of Linux.

More About the foundation...

Frequent Questions

Join / Members / Staff / Board