December 8, 2005

PhpDig excels at small Web site indexing

Author: Colin Beckingham

Webmasters looking to provide search capabilities for their site would do well to try out PhpDig, a Web spider and search engine written in PHP with a MySQL backend. There are other open source search engines, all of which have their own advantages. PhpDig just happens to suit the needs of my Information Technology for Greenhouses and Horticulture site. Here's how I got it working.

Webmasters with small sites know the problem of providing useful site search capabilities. Typically, visitors enter keywords in a search box and the search engine returns a ranked list of pages related to the query. This is a useful service -- provided the visitor can tune the search and the results returned are reliable and relevant.

Some Webmasters rely on Google for this service. A listing in Google or another mainstream engine is a must-have in practical terms, so it is easy enough to piggyback on the main engine with a site-specific search, provided Google understands your site and keeps coming back for updates -- but this isn't always the case.

Large search engines boast of indexing of billions of pages, but we are only interested in digesting a hundred pages or so. We need them indexed on a regular basis, daily or at least more often than Google might do it.

It is also important to know if our site is responding correctly by providing public pages, hiding private pages, and following links correctly. Since Google uses algorithms that it doesn't share, we have no way of predicting the indexing results or doing any testing in advance. Advance testing is useful if, for example, you have private files that you want to be sure will not be indexed, but you are relying on your robots.txt file to deny access to bots. If we make a spelling mistake in robots.txt, our private pages could go in Google's cache for the world to read. We also need to control what words are indexed and customize our own search and result pages.

Enter PhpDig

PhpDig will index your site as frequently as you like via a cron job. Results are consistent and testable within minutes. PhpDig will crawl a single or multiple Web sites following links within the domain according to known rules and store the results in a MySQL database.

Users can then use a search form provided by PhpDig to enter criteria and see immediately which pages appear to be relevant; and the results page is not subjected to commercial advertising.

Two sites that use PhpDig are the New Mexico State Library site and the California State University at Dominguez Hills. Note that Dominguez Hills has boosted the number of results per page to 20 from the default of 10. This is a good example of the ability to control output depending on individual site requirements.

What follows is a beginner's attempt to get PhpDig working. My environment consists of a home workstation with a dialup connection to the Internet, and a hosted server. The home machine runs SUSE Enterprise Server 9.0 with Apache 2 and PHP5, and the remote server runs Apache 2 and PHP4. MySQL database space is limited to 10MB. I did not run into any surprises, such as PHP code working on one machine with PHP5 but not on the other with PHP4.

Clearly, this environment places some restrictions on what I can achieve. I don't want to run PhpDig directly on my remote server for a number of reasons. First, I want to see the results before publishing them. Next, I need to know how much space will be taken up by the results, since I have limited space on the remote server. Lastly, I don't want to run into any timeouts or difficulties on the remote server, since I don't have any control over how it runs.

PhpDig provides a number of ways of getting the search data onto the remote server, including facilities that work if you can connect directly to the MySQL database. My database is hidden behind a firewall, so I have to follow a more indirect path. My approach is to use my home server to do the actual crawling, test locally, judge the size of the resulting files and data tables, and then move the data to the remote server to allow users to perform searches.

The local install

PhpDig itself is a small download of about 280KB. My first mistake was to unzip the package on a Windows machine and transfer to the destination directory on the home server. This messed up the permissions of some of the files and directories, so I started again by moving the entire zip file to my local server and unzipping it there with the command:

unzip phpdig-1.8.8.zip -d /srv/www/htdocs/

Next you have to modify the config.php file. Change the provided name and password lines to ones that work on your system:


define('PHPDIG_ADM_USER','newUsername'); //Username
define('PHPDIG_ADM_PASS','newPassword'); //Password

Next, move into the database configuration process. I can afford the luxury of a dedicated MySQL database for PhpDig on my home server, but not on my hosted server. Later I will need to move my tables to a server where the tables will need to be distinguished by name alone, so I also specified a prefix of pdig_ for the tables. In no time at all the tables were in place and I was ready to perform the first crawl of my site.

First spider

It took PhpDig a bit more than five minutes to crawl about 20 Web pages of various lengths and complexity over the dialup connection. Not all of this was due to the speed of the connection; the server is a bit slow and processing does take time.

The point was that it was done when it was convenient to me. I was able to watch the process as it was under way, and I noted a number of links that were not followed correctly. The fault was mine; the links were not expressed clearly enough for the spider to follow them. After some code corrections and parameter changes instructing PhpDig to crawl to the correct depth I was able to check that certain common words would be correctly reported in search results. PhpDig will ignore, and therefore exclude from the index, certain common words listed in a common_words list. You can remove words or add more to ensure that words with special meanings are added to the index even if they are common words.

So far so good. PhpDig had indexed the sites, and I was able to check thoroughly that the results were useful.

Now to find out how much space this information was taking up on my home server. Exploring the zipped PHP script file structure showed that the unzipped PHP scripts and other associated files needed about 800KB of space. The MySQL data in the tables took up about 500KB, 5% of my allocation. This gave me some idea of how much space PhpDig would require on the remote server.

I now had a functioning search facility, but because it was on my home server I was the only one who could use it. I needed to be able to search from the remote server, so the PhpDig files needed to be uploaded and configured there too. We will not be spidering with this copy, just searching, but to ensure that we have all the files necessary for searching I decided to upload everything.

Install to the remote server

Installing the software on the remote server was a repeat of the process to get the files on the home server, with some exceptions:

  • I used a text editor to modify config.php before transfer.
  • An FTP of the files created the file set with the correct permissions.
  • In the database configuration screen I specified the name of my database.
  • I repeated the same table name prefix as for the tables on the home server.
  • I ignored the setup screen prompting for a Web site to crawl.

To get the data to the remote server I used the following command on the home server:

mysqldump --comments=0 --add-drop-table -u User -p phpdig > /home/User/work/phpdig.sql

This prompts for the password and then generates the output as series of SQL statements in a large text file.

The --comments=0 instruction disallows comment lines in the output. The --add-drop-table option forces the table to be dropped before it is recreated and the new data is written, ensuring that the table is clean with the current data only.

Updating the remote tables is now a matter of getting the dumped instructions to the server and then processing them. You can use FTP to move them to the secure side of the server firewall and then run a custom script to load the tables, or use phpMyAdmin. The phpMyAdmin utility can accept a gzipped or bzipped file, so while still on the home server I used gzip to compress the MySQL dump file. After compressing the file, I logged into phpMyAdmin and uploaded the gzipped file. After waiting patiently for my dialup connection to complete the upload, and for the server to process the files, a refresh of the table list in phpMyAdmin showed that the tables were indeed there, with the required data.

In my experience using FTP and a custom loading script is much faster than the phpMyAdmin route. A custom script would load the dump file from MySQL and, since it is just a list of SQL commands, read them off one at a time and process them until end of file is reached.

Searching for refinement

The search page now worked just as if it had been spidered from the remote server. The process of getting to this point is not as quick and efficient as it would be on a dedicated remote server, but the gains in the ability to see the spider working, check results quickly and easily, and refresh the data on my own schedule more than make up for the inconvenience.

PhpDig is quite versatile and can be modified and extended in a number of ways. As the documentation points out, the config.php file in the include directory is the key to most things, including setting preferences for the style and format of the search and results pages. An option that may be valuable for improved security is the ability to define a directory for storage of working scripts outside the reach of the publicly accessible directories by moving certain files and setting paths according to the instructions inside config.php.

You may also want to play with the keywords that are deemed to be very common. The file to check is common_words.txt (and various related foreign language associated files) in the /includes/ subdirectory.

PhpDig may be a small tool, but it provides small publishers with great control, testability, and convenience. Just don't expect it to spider the entire Internet for you.

Category:

  • Open Source
Click Here!