Build your own search engine with ht://Dig

553

Author: Johnathon Williams

Most Linux users know how easily they can run a Web server on their favorite distros. Unfortunately, serving pages is one thing — finding them is another. That’s when many users turn to ht://Dig.

ht://Dig is more than a simple search script for a Web site. It combines a powerful collection of command-line search utilities with an easy-to-use CGI script. Properly configured, they work together to form a robust, extensible search engine for a domain or intranet.

Like Google, ht://Dig can search PDF, PostScript, Microsoft Word, Microsoft Excel, and Microsoft PowerPoint files, in addition to the expected plain text and HTML files. Unlike some search utilities, it maintains its database in plain text files, keeping software dependencies low.

ht://Dig is available as a set of stable binary packages for all the major distros. Most split the program into two packages: htdig, which contains the command-line utilities, and htdig-web, which contains the CGI script. Download and install both from your favorite repository, or binaries and source code are available from the project’s site. As of this writing, the most recent production version is 3.1.6.

Out of the box, ht://Dig is limited to searching plain text and HTML files. Fortunately, a number of conversion utilities can expand its reach. This tutorial includes instructions for indexing PostScript, PDF, Microsoft Word, Microsoft PowerPoint, and Microsoft Excel Files. But first, you need to install the following additional packages:

  • Catdoc
  • Xpdf (for pdftotext)
  • Ghostscript (for ps2ascii)
  • xlHtml (for xlHtml and pptHtml)

These conversion utilities act as plug-ins for ht://Dig by converting foreign file types to plain text. All four are available as binary packages for the major distros. (xlHtml is slightly more difficult to find, but good binaries are out there.) Install them from your CDs or favorite repository.

Finally, you need doc2html. This collection of Perl scripts serves as a go-between for ht://Dig and the other utilities. Download it from ht://Dig’s parsers directory. I recommend you unpack the archive to /opt/local/htdig/scripts/. That way, you won’t need to change as many paths when we configure the script.

Configure the conversion utilities

The main script in the doc2html collection is doc2html.pl. Open it in your favorite text editor and scroll down until you see these lines:

######## Full paths to conversion utilities ##########
######## YOU MUST SET THESE ##########
######## (comment out those you don't have) ##########

Below them is a list of variables that call the conversion utilities. Before you can use these utilities, you must specify the full path to where it’s installed in the appropriate variable.

Let’s do the conversion utility for Microsoft Word documents first. Find these lines:

#version of catdoc for Word6, Word7 & Word97 files:
my $CATDOC = '';

Go to the second line, and insert the path for your installation of catdoc between the quotation marks. For my installation, the modified line looks like this:

my $CATDOC = '/usr/bin/catdoc';

Now activate the rest of the utilities in the same fashion. Double-check the path for each application on your system and correct them if necessary.

  • For Microsoft Excel: my $XLS2HTML = '/usr/bin/xlhtml';
  • For Microsoft PowerPoint: my $PPT2HTML = '/usr/bin/ppthtml';
  • For PDF: my $PDF2HTML = '/opt/local/htdig/scripts/pdf2html.pl';
  • For PostScript: my $CATPS = '/usr/bin/ps2ascii';

The configuration file includes variables for WordPerfect, Flash, Shockwave, and rich text file types as well, but you do not have to install and configure every utility available. Feel free to pick and choose.

Lines that point to missing conversion utilities should be left alone. Pay special attention to this, because the instructions included in htdig.conf are incorrect. This line — ######## (comment out those you don't have) ########## — is wrong. Do not comment out the variables that point to utilities you don’t have.

When you’re finished, save and close the file.

You need to configure one more script from doc2html. PDF files require a little extra help, so open pdftohtml.pl in your text editor. Scroll down until you see my $PDFTOTEXT and my $PDFINFO. Change the value in quotations for each of these to the full path for pdftotext and pdfinfo, respectively. On my system, the modified lines look like this:

my $PDFTOTEXT = "/usr/bin/pdftotext";
my $PDFINFO = "/usr/bin/pdfinfo";

When these are correct, save and close the file.

Configure ht://Dig — essentials

ht://Dig’s main configuration file is htdig.conf. On my system, it installs in /etc/htdig/. Open this file in your text editor, scroll down to the end, and find start_url. (This is also listed near the beginning of the file, but it is commented out and should be ignored.) Set this variable to the URL or URLs that you want to index. Erase the comment sign, if present, and enter your addresses after the colon. Separate multiple URLs with white space.

Make your entries here as specific as possible. If the files you want to index are contained in a specific directory, then give the URL with the path to that directory, instead of the base URL alone. This will keep your index as lean (and as fast) as possible.

The edited line should look something like this:

start_url: http://www.foo.com/PDFs/ http://www.foo.com/word-docs/

Next, scroll down until you see limit_urls_to: ${start_url}.

You don’t need to change this line, but you should be aware of it. This prevents ht://Dig from following links to URLs other than those you specify. Otherwise, the program would try to index the entire Internet. The only circumstance in which you would change this would be for a domain or intranet with zero outbound links.

Now it’s time to tell ht://Dig how to play nice with the conversion utilities. Make some whitespace below the previous line, and enter these lines:


external_parsers: application/pdf->text/html
/opt/local/htdig/scripts/doc2html.pl
application/postscript->text/html /opt/local/htdig/scripts/doc2html.pl
application/msword->text/html /opt/local/htdig/scripts/doc2html.pl
application/msexcel->text/html /opt/local/htdig/scripts/doc2html.pl
application/vnd.ms-excel->text/html /opt/local/htdig/scripts/doc2html.pl
application/vnd.ms-powerpoint->text/html /opt/local/htdig/scripts/doc2html.pl

The above code assumes that you installed doc2html in the recommended location. If not, change the path in the second half of each line to match your installation. Each new file type requires a line in this section, so if you decide to install one of the conversion utilities that I skipped, don’t forget to add a line for it, too. You can find sample lines for each of the conversion utilities at the top of doc2html.pl.

Finally, scroll to the bottom of the configuration file and find local_urls_only: true. This line limits ht://Dig to files on its home computer. If you plan to index files on other machines, you should comment out this line by placing a pound sign in front of it.

Configure ht://Dig — little things

By now your configuration file is in pretty good shape, but there are still a couple of things you can do.

Scroll down until you see max_head_length. This value determines how many bytes from each document is stored in the index. Comments in htdig.conf suggest you set this to about 50,000 if you want to index most of the text in each document. Just to be sure, I set mine to 80,000, and tests confirmed that it grabbed every scrap of text from my files. A higher number will increase the size of your database, so remember to balance the quality of your index with the free space on your hard drive.

Below this, you will find max_doc_size. This value sets the size cutoff for indexed files. Leaving it at the default setting of 100KB will mean that your 1.5MB PDF files are ignored. I wanted to index a collection of very large PDFs, so I set mine to 2.5MB. (This value must be set in bytes, too, so 2.5MB becomes 2500000.)

When you’re finished, save your work and close the file.

Index and search

Now comes the fun part. Open a terminal, and type rundig -vvv.

Rundig is a script that calls all of the ht://Dig utilities necessary to create your index. The -vvv flag tells the script to be as verbose as possible, so you can see if any serious errors occur. Indexing can take a long time for a large domain or intranet, so be prepared to wait. The process is also a CPU hog, so don’t count on playing Quake to pass the time. (Not on the same machine, anyway.)

When your index is ready, open your Web browser and point it to http://localhost/htdig/search.html. If that doesn’t bring up the search page, double-check your install location for the htdig-web package. On my machine, ht://Dig installs its default search page to /var/www/html/htdig/search.html and its CGI script to /var/www/cgi-bin/htsearch.

Use the provided search form to perform searches on your new index. If you’re not seeing as many results as you expect, try increasing the value for max_head_length in htdig.conf.

Issues and limitations

One common problem is that ht://Dig creates a new index each time rundig is invoked, rather than simply updating its existing index. This keeps your index lean by purging dead links, but it also immediately destroys your existing database, halting searches until the new one is complete. To keep the old database available for searching while the new one is being built, index with this command: rundig -a. The -a option causes the new index to be created as a copy until it is finished.

No hard-coded limits cap the database size, so ht://Dig will happily store gigabytes of search data in its index. Make sure that you’ve allocated plenty of room for the partition that contains the default database location, /var/lib/htdig. If you need more, you can change the database location by altering the database_dir variable in htdig.conf.

Learning more

This article serves only as a brief introduction to the power of ht://Dig. To learn more, check out the official documentation, and sign up for the mailing lists.

Happy digging!

Category:

  • Enterprise Applications