Index and search with KDE’s new Strigi

273

Author: Ben Martin

The Strigi project is the core of the index and search technology for KDE 4. Strigi is designed to be small and fast, and it can be installed and used with or without KDE 4, as we’ll see.

Strigi uses plugins to handle its indices, filetypes, and metadata extraction. Currently the filesystem index can be stored in SQLite 3, Xapian, CLucene and Hyper Estraier. The filetype plugins allow Strigi to get at the text content of non plain text files, such as PDF or office file formats. The metadata extraction plugins can tell Strigi information about files — for example, the ID3 metadata tags from audio files.

The Strigi distribution includes the main indexing daemon strigidaemon as well as a small collection of clients: xmlindexer, strigiclient, strigicmd, deepfind, and deepgrep. Strigiclient has a GUI and allows you to start and stop the daemon as well as populate and search your indices.

For those using Fedora 8, Strigi is available from the standard updates repository and can be installed by issuing yum install strigi.

The strigi-libs-0.5.7 RPM I tested did not include CLucene indexing support. This is a major setback, because CLucene is what Strigi tried to use by default. This leads to a fatal error when you try to start the daemon:

$ /usr/bin/strigidaemon Unknown backend type: clucene

Hopefully this flaw will be fixed in a later package. To get around it, I installed the CLucene, Qt4, and exiv2 development packages and built Strigi from source. Strigi uses cmake instead of the autotools to configure its build environment. Building on a Fedora 8 64-bit platform I found that I had to pass a -D option to cmake in order to get CLucene detected. The commands below will install Strigi from sources on a Fedora 8 64-bit machine:

# rpm -e strigi strigi-libs # yum install clucene-core-devel qt4-devel cmake file-devel exiv2-devel ... $ tar xjvf .../strigi-0.5.7.tar.bz2 $ cd strigi-0.5.7 $ cmake -G "Unix Makefiles" -DLIB_SUFFIX=64 $ make ... [100%] Built target indextester [100%] Built target strigiclient $ sudo make install

By default Strigi comes with a no-frills graphical client for searching your filesystems. For those running KDE or GNOME, strigiapplet allows you to run a Strigi query from your desktop panel. KDE users also get a KIO slave that allows you to search directly from within Konqueror. As the KIO slave and applet are distributed in the strigiapplet package, you should install kdebase-devel and kdelibs-devel to build the applet and KIO slave.

Getting started

Now let’s take a look at how to use Strigi to extract metadata, create indexes, and start searching. The xmlindexer command is handy for seeing the world as Strigi does. Running xmlindexer on a file will cause Strigi to extract all the information from that file that it might choose to index. An example is shown below for a simple text file. If you specify an absolute path to xmlindexer then the core#url value will include the full file path. Running xmlindexer against an audio file might also show you audio.title, audio.artist, audio.album, and content.genre metadata.

[~]$ cd /tmp [tmp]$ date > testfile1.txt [tmp]$ xmlindexer testfile1.txt nthreads: 2 <?xml version='1.0' encoding='UTF-8'?> <metadata> <file uri='testfile1.txt' mtime='1200635133'> <value name='http://freedesktop.org/standards/xesam/1.0/core#url'>testfile1.txt</value> <value name='http://freedesktop.org/standards/xesam/1.0/core#size'>29</value> <value name='http://freedesktop.org/standards/xesam/1.0/core#sourceModified'>1200635133</value> <value name='http://strigi.sf.net/ontologies/0.9#depth'>0</value> <value name='http://freedesktop.org/standards/xesam/1.0/core#fileExtension'>txt</value> <value name='http://freedesktop.org/standards/xesam/1.0/core#name'>testfile1.txt</value> <value name='http://strigi.sf.net/ontologies/0.9#parentUrl'></value> <value name='http://strigi.sf.net/ontologies/0.9#depth'>0</value> <text>Fri Jan 18 15:45:33 EST 2008 </text> </file> </metadata>

Running the strigiclient command brings up the user interface shown here. The edit menu allows you to set filters to tell Strigi which files you wish to have indexed and which to exclude. Filters use globs against file names and can tell strigi to ignore or index a file matching the glob. A glob allows you to use a * character to match any string and the ? character to match any single character. For more information see glob(7). The filters are evaluated in order, and if none match, then the file will be indexed by default. You should also be able to list the names of all the indexed files from the edit menu, though this didn’t seem to work for me.

The stop daemon button works as a toggle and allows you to start and stop the Strigi daemon. You can set the directories that Strigi will index with the add and remove directory buttons at the button of the GUI. The string at the bottom right of the window allows you to submit a query.

The strigicmd program allows you to manipulate Strigi indexes and search them from the command line. To query, you must supply the index type and its location, along with the actual query you wish to perform. In the query examples below I first search for a file by an exact match on its file size. The second query looks for any files with “alice” in their file name. The final examples find any files that were modified after the alice13a.txt file. The final query restricts the results to only those modified within 24 hours of alice13a.txt. Such a query can be handy when you know that you have modified a collection of files at a similar time but can’t recall the exact list of files you changed. Unfortunately, I could not get Strigi to accept human-readable time values; it uses the raw Unix time_t seconds epoch number.

$ strigicmd query -t clucene -d ~/.strigi/clucene 'size=153477' n backends: 1 Results for search "size=153477" "/home/ben/guten/alice13a.txt" matched - mimetype: - sha1: - size: 153477 - mtime: Sat Jan 12 13:23:59 2008 - fragment: - http://freedesktop.org/standards/xesam/1.0/core#fileExtension: txt - http://freedesktop.org/standards/xesam/1.0/core#name: alice13a.txt - http://strigi.sf.net/ontologies/0.9#depth: 0 - http://strigi.sf.net/ontologies/0.9#parentUrl: /home/ben/guten Query "size=153477" returned 1 results $ strigicmd query -t clucene -d ~/.strigi/clucene 'name:*alice*' | grep 'http://freedesktop.org/standards/xesam/1.0/core#name' n backends: 2 - http://freedesktop.org/standards/xesam/1.0/core#name: alice13a.txt.desktop - http://freedesktop.org/standards/xesam/1.0/core#name: alice13a.txt - http://freedesktop.org/standards/xesam/1.0/core#name: alice13a.txt - http://freedesktop.org/standards/xesam/1.0/core#name: alice-copy.txt $ strigicmd query -t clucene -d ~/.strigi/clucene 'sourceModified>=1200108239' | grep 'http://freedesktop.org/standards/xesam/1.0/core#name' n backends: 2 - fragment: - http://freedesktop.org/standards/xesam/1.0/core#name: konq_history - http://freedesktop.org/standards/xesam/1.0/core#name: katepartindentjscriptrc ... $ strigicmd query -t clucene -d ~/.strigi/clucene 'sourceModified>=1200108239 sourceModified

You can also search for files by their names or parts thereof. If you do not specify a prefix such as size or name, then you will perform a fulltext search on your index, as shown in the second example below. You can use xmlindexer on files to get an idea of what prefixes are available for use in queries.

$ strigicmd query -t clucene -d ~/.strigi/clucene 'name:alice*' n backends: 1 Results for search "name:alice*" "/home/ben/guten/alice13a.txt" matched - mimetype: - sha1: - size: 153477 - mtime: Sat Jan 12 13:23:59 2008 - fragment: - http://freedesktop.org/standards/xesam/1.0/core#fileExtension: txt - http://freedesktop.org/standards/xesam/1.0/core#name: alice13a.txt - http://strigi.sf.net/ontologies/0.9#depth: 0 - http://strigi.sf.net/ontologies/0.9#parentUrl: /home/ben/guten Query "name:alice*" returned 1 results $ strigicmd query -t clucene -d ~/.strigi/clucene 'mine' ... Query "mine" returned 3 results

At this stage I noticed that a search for "Wonderland" did not find the novel "Alice in Wonderland". I downloaded this novel from Project Gutenberg into alice13a.txt. Not finding this file means that full text searching is not finding all the files that it rightly should. To debug this I built the SQLite 3 Strigi back end, and found that the information from alice13a.txt was not making it to the index.

Knowing that Strigi uses libmagic and having worked on file indexing code in the past I suspected that perhaps Strigi was ignoring alice13a.txt on purpose. The libmagic library is used to detect the filetype and MIME type of files. Its command-line tool is file. Using file -i to see what MIME type file reports I saw that two files were reported as application/octet-stream, one of which was alice13a.txt. Creating a copy of alice13a.txt to alice-copy.txt with only the first part of the file, including the string "Alice in Wonderland" I was able to convince the file command (and thus libmagic) that this was indeed a text document.

I then updated my Strigi index by clicking "start indexing" in strigiclient. After this I could successfully find alice-copy.txt by typing Wonderland into the search widget. Reliably detecting plain text files can be an issue, given the various character encodings that might be used to store plain text; that said, I did not expect that a Project Gutenburg text would be ignored by Strigi.

Strigi's deepfind and deepgrep commands allow you to crawl into archive files. Deepfind only accepts a single optional command-line argument that specifies where to start the recursive find; if omitted, the current directory is used. Each file that Strigi can find has its path printed, one path per line. Deepfind is useful if you know the file name you are after and you want to find all versions of that file, even when some of them are inside a tar.gz file.

Deepgrep is meant to allow you to use the same archive crawling as deepfind but work more like the standard grep(1) command. There was a bug in deepgrep in many releases prior to last week that caused deepgrep to not function at all. The fix will be included in the next Strigi release. Deepgrep is a handy tool for finding out which archives contain information that might be of interest without having to expand any archives:

$ deepgrep documentation /tmp/strigi-0.5.7.tar.bz2 /tmp/strigi-0.5.7.tar.bz2/strigi-0.5.7/cmake/FindQt4.cmake: # ask qmake for the documentation directory /tmp/strigi-0.5.7.tar.bz2/strigi-0.5.7/src/streamanalyzer/analyzerconfiguration.h: * TODO: write proper documentation of the pattern syntax. /tmp/strigi-0.5.7.tar.bz2/strigi-0.5.7/src/streamanalyzer/analyzerconfiguration.h: * See the documentation for the non-const version of this function. ...

Strigi is quick at adding files into its index, but it's not perfect. It would be nice for the graphical search clients to offer more assistance in setting up common searches. For example, the strigiapplet allows you to type in a search string but offers no assistance in selecting which fields might be used to form your query. Having the search clients make an effort to parse human-readable time strings would make searching for files that were modified within a given period simpler.

All in all, Strigi is handy for finding files based on either their metadata, such as file size, ID3 tags, or file name, or their text content.

Categories:

  • Tools & Utilities
  • Desktop Software