Adding search to your Web site with Xapian and Omega

1289

Author: Ben Martin

With Xapian and Omega you can quickly build a powerful search interface for your Web site. You’ll be able to index your HTML, PDF, and PHP content and search for it by metadata or words contained in the documents.

The shared library that implements the actual index is called Xapian. Omega is a set of tools built by the Xapian team to let you use the library for index and search if you are not a software developer. Since Omega uses Xapian, if your distribution’s package repository includes Omega, then when you install it you’ll install Xapian as a dependency.

Omega is packaged as a 1-Click install for openSUSE and is available in Ubuntu Hardy Universe, but only Xapian is packaged for Fedora 9 as xapian-core-devel. I installed the Xapian package, and built version 1.0.8 of Omega from source on a 64-bit Fedora 9 machine with the usual ./configure; make; sudo make install commands. Along with the application, this installs the documentation, configuration file, and manual pages. You should also execute the below commands to set up the Web search interface.

# mkdir -p /var/lib/omega/data/default # cp -av xapian-omega-1.0.8/templates /var/lib/omega # chown -R root.apache /var/lib/omega # find /var/lib/omega -exec chmod +s {} ; # cp -av /usr/local/lib/xapian-omega/bin/omega /var/www/cgi-bin

First, the Omega site templates are copied to the directory in /var/lib that Omega expects them to be in by default.
The permissions of that directory are set to allow the apache user access, and the sticky bit is set so that new files Omega creates will also be in the apache group. The final command places the omega binary into the cgi-bin directory so that you can execute Omega from the Web browser.

Getting started

Omega uses the omindex command to create, populate, and update content indexes. Omega allows you to have many indexes, each one being stored in its own directory. I’ll use the default index location in the commands below. You should specify the path where Omega is to store an index, the URL or relative path for the Web site you are indexing, and the filesystem path containing the files you wish to index. The invocation of omindex shown below will create an index containing every file in your Web DocumentRoot. Because we’re indexing the entire Web server, the --url parameter can simply be /.

# grep DocumentRoot /etc/httpd/conf/httpd.conf ... DocumentRoot "/var/www/html" ... # omindex --db /var/lib/omega/data/default --url / /var/www/html

If you host many Web sites in your DocumentRoot, or if would like to be able to limit your search results to specific sites, you can use the --url together with the path that is supplied as the final argument to tell Omega that it is indexing distinct Web sites. The --url can be an absolute URL, or, more conveniently, you can supply just the relative part of the URL.

For example, assume that you have the Linux HOWTO files and your business Web site being served from a single DocumentRoot. The below commands will index these two sites with Omega, preserving the information that they are two sites, which will allow you to search for results only on a specific site.

# omindex --db /var/lib/omega/data/default --preserve-nonduplicates --url /howto /var/www/html/howto # omindex --db /var/lib/omega/data/default --preserve-nonduplicates --url /mysite /var/www/html/mysite

When you run Omega it will normally delete any document from the index that was not seen in the filesystem. If you always run omega over the whole Web site, this behaviour is exactly what you want, since when a file is deleted from the filesystem it should also be deleted from the index. When you are running omega multiple times as shown here, you will want to tell it to leave things in the index even though it has not seen them in the current indexing operation; that’s the reason for the --preserve-nonduplicates option. This way, when you index the second URL, mysite, omega will not delete all the index entries for the howto files because it does not see those files during the current index run.

Another advantage of using the above commands is that you can update the Web site index for both sites independently. If one of the sites is particularly large, you can also update only a specific subdirectory for that site using the invocation shown below. Note that the URL and the path to find the site at must remain the same; you simply append the subdirectory within the site that you would like to freshen the index for.

# omindex --db /var/lib/omega/data/default --url /mysite --preserve-nonduplicates /var/www/html/mysite quickly/changing/dir

Omega natively supports indexing a large range of file types. HTML, plain text, and PHP are of course supported, along with document formats like PDF, PostScript, OpenOffice.org, Microsoft Office, AbiWord, RTF, DjVu, and more esoteric formats like Perl POD and TeX DVI files.

Once you have added documents to the Omega index, you can check whether you are able to search for them using the omega command again as shown below. You need to make sure you have the right omega binary. I found that the texlive package included an omega binary that was in my path, and my system used it in preference to the omega binary shipped with Omega, so I had to use the full path to the command I wanted. The below command will perform a ranked full-text query for the word “load” on the index created above.

# /usr/local/lib/xapian-omega/bin/omega 'P=load' HITSPERPAGE=10

Because the Omega binary used above is the same one used for the CGI interface, you have to pass the full text query using the P= CGI parameter prefix. Once you know that Omega is finding your index and you can execute a query successfully, you can try using the Web interface. Verifying that Omega is working from the command line before trying the Web interface takes one factor out of the equation if the Omega Web interface doesn’t work right off the bat. To use the Web interface, load the http://localhost/cgi-bin/omega URL into your Web browser and search your Web site using Omega. This screenshot shows the Omega search interface being used to find HOWTO documents.

The query syntax that Omega uses includes support for boolean AND, OR, XOR, and NOT operators. You can also use + and – to specify terms which must or must not be present in the documents you are seeking. If you do not use + or – then your query terms are evaluated in a ranked fashion — that is, some of the terms you enter might not be present in a resulting document if Omega thinks the document is relevant enough to the other terms in your query. The NEAR and ADJ keywords let you find documents which have words that are located close to each other. The difference between these keywords is that for ADJ, order is important — the query term before ADJ must appear before the query term after ADJ in the document. For NEAR, the query terms must be near each other in the document, but which order they appear in doesn’t matter. The default number of words that query terms might be apart for NEAR and ADJ is 10 words. You can specify a different distance by appending /n after NEAR or ADJ where n is the number of words that the query terms can be apart.

You can query for an exact phrase by enclosing it in double quotes. Query terms containing slashes or the @ character are automatically treated as phrase queries. If your query term contains a capital letter it is considered a proper name and thus is not stemmed before searching. Normally text in an index undergoes a process called stemming where a word is converted into a more basic form. For example, the three related words stemmer, stemming, and stemmed might simply become stem. See the project’s query syntax page for full details.

Omega also indexes some metadata about each file. You can include this metadata into your queries to constrain what results should be returned by starting a query term with a capital letter and the metadata restriction. For example, the files indexed with omindex using the --url /howto will have a pathname metadata field that is /howto. If you want to find a howto about Ethernet, and your Omega index contains numerous Web sites but you only want to search the howto documents, you can use the query ethernet P/howto, where P is the prefix for the pathname field. See the documentation for full details on the other prefixes.

You can change which template is used by altering the FMT CGI parameter used to call omega. The FMT parameter contains the page “format” that you want Omega to use.

Examples of using the FMT page format parameter include using the “topterms” page format to have Omega suggest extra query terms across the top of the result set that might help you refine your query. Using the page format “godmode” lets you search for document by ID and also see a list of query terms that would have found each document. Of course, the list of query terms (distinct words) for a document might be quite long, so the use of godmode should be restricted or disabled to prevent users from bogging down your system. Using godmode helps when you are learning the more powerful query abilities. Using the “xml” FMT makes Omega produce search results in XML instead of as an HTML Web page.

With Xapian and Omega you can provide both full-text and metadata search for your Web site. You can also use the XML format page to obtain search results from Omega and post-process them directly into your current Web site without diverting the user to the CGI program. This allows you to use Omega to quickly add search to your site without breaking the look and feel by including Omega Web pages directly. Xapian and Omega support advanced features such as proximity searching, which should please power users.

Category:

  • Internet & WWW