Searching database content with Sphinx

289

Author: Federico Kereki

If you use Google or any other search engine, you already are a user of full text searching: the capability to search for a word or group of words within many texts for the best matches for your query. Sphinx is a full text search engine for database content, which you can integrate with other applications. (You can test it or use it with a command-line tool, but Sphinx is most useful as part of a Web site, not as a standalone utility.)

Sphinx (short for “SQL Phrase Index”) can index large amounts of text — up to 232 (a bit over 4 billion) database records, with several text fields each. It supports both simple and complicated search expressions, so for example you can look for documents that include the words RELATIVITY and either SPECIAL or GENERAL, but not QUANTUM.

Andrew Aksyonoff began developing Sphinx in 2001. He needed a package with high search quality and speed, and moderate hardware requirements for indexing, but couldn’t find one off the shelf. Sphinx is at version 0.9.7 now, with 0.9.8 in development. The Sphinx Web site isn’t updated too frequently, but all updates have to do with important features or releases.

Sphinx works in a LAMP (Linux/Apache/MySQL/PHP) context. Many content management system use MySQL for data and are coded in PHP, which make them a perfect fit for Sphinx. You can also run Sphinx on BSD (FreeBSD, NetBSD), Solaris, and even Windows. For coders, Sphinx provides Ruby, Perl, and Python libraries, but no Java.

The program is free software, and can be redistributed or modified under the terms of the GPLv2 or any later version.

Indexing

Sphinx can natively process not only MySQL but also PostgreSQL tables, and also read XML text files, but there are several limitations with using XML, which is not as flexible as using a database. The main idea is that you use Sphinx to index some table or tables from your database, and then you will allow a user to search those tables by using Sphinx in the background.

You can index the text fields of any table you wish. All indexed tables must include an ID field, so Sphinx can retrieve the full record when needed. Sphinx can even mix records from different tables, with the only requirement that the ID fields should be unique over all tables. You can index as many as 32 text fields, and several “attributes” (either integer values or timestamps) that are not used for searching, but for further filtering; for example, you can make Sphinx limit its search to records only from a specified date onward.

The indexing procedure is specified by a configuration file (examples are included with the software) that defines what database to access, what tables to read, what fields to use, and so on. Among other things, you can specify:

  • whether to use “stemming.” Stemming means that if you look for a word like “swim,” it will also find derived words like “swimming” and “swimmer.” There are English and Russian stemmers available (with French, Spanish, Portuguese, Italian, German, Dutch, Swedish, Norwegian, Danish, Finnish, Hungarian, and Turkish stemming on its way with 0.9.8, according to the Sphinx Web site) or Soundex (an English-only approximate phonetic match algorithm that tries to bring together words that sound somewhat alike).
  • whether to apply “stop words” — usually short words that are so common that you do not wish to include them in the search, because they would clutter up the index files. “the”, “a”, and “of” are likely examples.
  • the minimum word length: words shorter than this won’t be included in the index files. A typical value for this could be four, for example.
  • the encoding used for the data. You can have Windows-like SBSC (Single Byte Charset: all possible characters must fit within a single byte) or Linux/Unix like UTF-8 files (where a character might be represented by more than a single byte).
  • character folding rules. Folding determines whether uppercase and lowercase letters are considered equivalent, and whether accented foreign characters such as “á” and “ü” are treated like their unaccented equivalents. German speakers should note that currently there is no way to convert “ß” into “ss” or “ö” into “oe”; fussy or classicist English users would require processing “æ” and “œ” as pairs of letters too.
  • word delimiters. All characters not specified in the character set above are considered to be word delimiters, and not included in the indexing procedure.

For large datasets over 100GB in size, you can have distributed indices and searching. That is, you can split the data in chunks on different machines, searching in parallel in order to reduce response times and server load. Searches will be distributed across the machines, but the results will be grouped together before they are reported to the user. This approach is also suggested for high availability setups and for multiprocessor machines, though details are somewhat lacking in the documentation.

If you need real-time updating, you can use a workaround with databases that just grow without modifying old records. It’s based on the concept of a “delta update” (“delta” implies just the differences between the old table and the new one). The idea is having two sources and two indices: one for the main bulk (which supposedly will get reindexed not too frequently) and one for the daily updates (smaller, so it can be reindexed quickly). Once in a while (whenever the delta part grows enough) you could reindex all the data, and start again with an empty delta. This is not a definitive solution; the author is working on quick real-time updates for version 1.0.

If you have to deal with records that can be updated (i.e., not just added), then you should opt to rebuild the index from scratch. On the other hand, since indexing is fast, having to run a full index creation might not be such a problem.

Searching

You can search the indexed texts in two ways: by using a (provided) standalone program, or by coding your own search by using the standard provided APIs, which in turn use the searchd daemon that must be running in the background. You can even set up MySQL to use a pluggable storage engine (SphinxSE) which would let you search from within SQL, without any drivers or APIs. However, this storage engine requires compiling MySQL itself from source, and might not be what you need for other processing, so this option is probably most useful for developers to study.

In every case, if you are used to the search operators in common Web sites (such as Google, for example) you will find yourself at home with the available operators:

  • AND queries: ELLINGTON & ARMSTRONG searches for documents including both ELLINGTON and ARMSTRONG. By the way, you need not include the ampersand: ELLINGTON ARMSTRONG produces the same result.
  • OR queries: ELLINGTON | ARMSTRONG produces documents with ELLINGTON or ARMSTRONG, or both.
  • NOT queries: ELLINGTON -ARMSTRONG produces results with ELLINGTON and without ARMSTRONG. However, you cannot do a search like -ARMSTRONG alone; that would implicitly involve nearly all documents.
  • Field search: @title SOLITUDE @author ELLINGTON would look for SOLITUDE in the “title” field, and for ELLINGTON in the “author” field; users obviously must know what fields are available.
  • Phrase search: if you look for "DUKE ELLINGTON" the search will return only those documents with the words DUKE and ELLINGTON next to each other.
  • Proximity search: a variant of the above is "DUKE ELLINGTON"~10, which specifies that both words should appear less than 10 words away from each other.

A few more possibilities:

  • The search API allows you to assign weights to fields, so a match in certain field will be considered more “significant” or important than a match in other fields.
  • The search API also lets you sort search results by relevance, timestamp, attributes, or by a SQL-like expression.
  • If you need just statistics (for example, you might be interested in how many posts there are by month, or how many were written by a certain person) you can group the results instead of getting every single document, and each group will have one best match (that is, just one record — the closest one — out of all the records in the group) and the count you want.

You cannot do “pattern searching” as in “UNIT* ST?T?S”. This is a common restriction among full text engines, because such queries are hard to do in an efficient way no matter what indices you might have.

In conclusion

The two top points with this product should be its speed, and the reported capability to deal with really large data sets. I didn’t try to run my own benchmarks, but the site reports an indexing speed up to 10MB per second with modern CPUs, and an average query time less than 1/10th second for 2-4GB text sources. For the latter point, the numbers are even more impressive, and the site cites many Sphinx-powered sites (including one from MySQL AB itself), with one installation indexing up to 1.5TB worth of text coming from more than 1 billion online forum posts.

On the negative side, Sphinx’s documentation isn’t up to the same high level as its performance. For example, in order to get the PHP interface to work, I had to study the sample code and deduce what calls to use and which parameters to pass. This also happens with the APIs for other languages, and I couldn’t find the binary protocol specification. The more complex setups (distributed searching and indexing) could do with fuller examples and some diagrams. Examples are oriented to MySQL; you are on your own if you use PostgreSQL. The XML processing capabilities are low and too strict; being able to deal with more generic XML schemes would come in very handy. Finally, if you need real time updates right now, unless you can work with the “delta” solution, wait until version 1.0.

While the developers should fix some points, the good features of Sphinx outweigh the difficulties. Even though it’s not yet at a 1.0 release, Sphinx seems a stable and powerful search tool, and I’m planning to use it more extensively.

Category:

  • Databases