If you use Google or any other search engine, you already are a user of full text searching: the capability to search for a word or group of words within many texts for the best matches for your query. Sphinx is a full text search engine for database content, which you can integrate with other applications. (You can test it or use it with a command-line tool, but Sphinx is most useful as part of a Web site, not as a standalone utility.)
Sphinx (short for "SQL Phrase Index") can index large amounts of text -- up to 232 (a bit over 4 billion) database records, with several text fields each. It supports both simple and complicated search expressions, so for example you can look for documents that include the words RELATIVITY and either SPECIAL or GENERAL, but not QUANTUM.
Andrew Aksyonoff began developing Sphinx in 2001. He needed a package with high search quality and speed, and moderate hardware requirements for indexing, but couldn't find one off the shelf. Sphinx is at version 0.9.7 now, with 0.9.8 in development. The Sphinx Web site isn't updated too frequently, but all updates have to do with important features or releases.
Sphinx works in a LAMP (Linux/Apache/MySQL/PHP) context. Many content management system use MySQL for data and are coded in PHP, which make them a perfect fit for Sphinx. You can also run Sphinx on BSD (FreeBSD, NetBSD), Solaris, and even Windows. For coders, Sphinx provides Ruby, Perl, and Python libraries, but no Java.
The program is free software, and can be redistributed or modified under the terms of the GPLv2 or any later version.
Sphinx can natively process not only MySQL but also PostgreSQL tables, and also read XML text files, but there are several limitations with using XML, which is not as flexible as using a database. The main idea is that you use Sphinx to index some table or tables from your database, and then you will allow a user to search those tables by using Sphinx in the background.
You can index the text fields of any table you wish. All indexed tables must include an ID field, so Sphinx can retrieve the full record when needed. Sphinx can even mix records from different tables, with the only requirement that the ID fields should be unique over all tables. You can index as many as 32 text fields, and several "attributes" (either integer values or timestamps) that are not used for searching, but for further filtering; for example, you can make Sphinx limit its search to records only from a specified date onward.
The indexing procedure is specified by a configuration file (examples are included with the software) that defines what database to access, what tables to read, what fields to use, and so on. Among other things, you can specify:
For large datasets over 100GB in size, you can have distributed indices and searching. That is, you can split the data in chunks on different machines, searching in parallel in order to reduce response times and server load. Searches will be distributed across the machines, but the results will be grouped together before they are reported to the user. This approach is also suggested for high availability setups and for multiprocessor machines, though details are somewhat lacking in the documentation.
If you need real-time updating, you can use a workaround with databases that just grow without modifying old records. It's based on the concept of a "delta update" ("delta" implies just the differences between the old table and the new one). The idea is having two sources and two indices: one for the main bulk (which supposedly will get reindexed not too frequently) and one for the daily updates (smaller, so it can be reindexed quickly). Once in a while (whenever the delta part grows enough) you could reindex all the data, and start again with an empty delta. This is not a definitive solution; the author is working on quick real-time updates for version 1.0.
If you have to deal with records that can be updated (i.e., not just added), then you should opt to rebuild the index from scratch. On the other hand, since indexing is fast, having to run a full index creation might not be such a problem.
You can search the indexed texts in two ways: by using a (provided) standalone program, or by coding your own search by using the standard provided APIs, which in turn use the searchd daemon that must be running in the background. You can even set up MySQL to use a pluggable storage engine (SphinxSE) which would let you search from within SQL, without any drivers or APIs. However, this storage engine requires compiling MySQL itself from source, and might not be what you need for other processing, so this option is probably most useful for developers to study.
In every case, if you are used to the search operators in common Web sites (such as Google, for example) you will find yourself at home with the available operators:
ELLINGTON & ARMSTRONG searches for documents including both ELLINGTON and ARMSTRONG. By the way, you need not include the ampersand: ELLINGTON ARMSTRONG produces the same result.ELLINGTON | ARMSTRONG produces documents with ELLINGTON or ARMSTRONG, or both.ELLINGTON -ARMSTRONG produces results with ELLINGTON and without ARMSTRONG. However, you cannot do a search like -ARMSTRONG alone; that would implicitly involve nearly all documents.@title SOLITUDE @author ELLINGTON would look for SOLITUDE in the "title" field, and for ELLINGTON in the "author" field; users obviously must know what fields are available."DUKE ELLINGTON" the search will return only those documents with the words DUKE and ELLINGTON next to each other."DUKE ELLINGTON"~10, which specifies that both words should appear less than 10 words away from each other.A few more possibilities:
You cannot do "pattern searching" as in "UNIT* ST?T?S". This is a common restriction among full text engines, because such queries are hard to do in an efficient way no matter what indices you might have.
The two top points with this product should be its speed, and the reported capability to deal with really large data sets. I didn't try to run my own benchmarks, but the site reports an indexing speed up to 10MB per second with modern CPUs, and an average query time less than 1/10th second for 2-4GB text sources. For the latter point, the numbers are even more impressive, and the site cites many Sphinx-powered sites (including one from MySQL AB itself), with one installation indexing up to 1.5TB worth of text coming from more than 1 billion online forum posts.
On the negative side, Sphinx's documentation isn't up to the same high level as its performance. For example, in order to get the PHP interface to work, I had to study the sample code and deduce what calls to use and which parameters to pass. This also happens with the APIs for other languages, and I couldn't find the binary protocol specification. The more complex setups (distributed searching and indexing) could do with fuller examples and some diagrams. Examples are oriented to MySQL; you are on your own if you use PostgreSQL. The XML processing capabilities are low and too strict; being able to deal with more generic XML schemes would come in very handy. Finally, if you need real time updates right now, unless you can work with the "delta" solution, wait until version 1.0.
While the developers should fix some points, the good features of Sphinx outweigh the difficulties. Even though it's not yet at a 1.0 release, Sphinx seems a stable and powerful search tool, and I'm planning to use it more extensively.
Note: Comments are owned by the poster. We are not responsible for their content.
Searching database content with Sphinx
Posted by: Anonymous [ip: 85.232.103.165] on August 25, 2007 01:52 PM1) 0.9.8 dev version introduces 64bit ids and thus allows to search more than 4 billion documents (anyone out there really hitting this limit already?);
2) API calls are not covered in the doc indeed; but each call is documented in sphinxapi.php source.
#