Open source search technology goes beyond keywords

7

Author: Michael Stutz

For several years a group of academic researchers has been quietly working on a new kind of search engine — one that recognizes the semantic meaning of a query instead of only taking input as a keyword to be literally matched. The technology is licensed under the GPL, and a desktop version is imminent.

In its simplest form, semantic indexing can recognize synonyms, or for example a search in an inventory database for “fruit” could turn up documents listing “apples” and “oranges.”

Aaron Coburn, lead developer of the Semantic Indexing Project at Middlebury College, says that his team is currently documenting its open source search toolkit and finishing up a new desktop search application that should be released later this month.

All of the source code is available for download, published under the terms of the GNU General Public License. The project’s core technology is the Semantic Engine, which is distributed with its C++ code, Perl bindings, and all the necessary code for building the GUI. There’s also a Subversion archive for development versions. The new desktop application, called the the Standalone Engine, will be available later this month.

Already, Coburn and his team have demonstrated their work with a number of disparate search projects — from a database of research notes by author Stephen Johnson to the descriptions of artwork held in the British Museum in London.

Most impressive of all has been the graphic visualization of novels. Coburn says this particular demonstration began with a close collaboration with a Spanish professor who wanted to make a searchable ebook reader for Don Quixote.

“Later,” Coburn says, “we started adding as many Project Gutenberg texts as possible, in whatever languages we happened to know — English, French, German, Polish, Russian.”

To this, Coburn added some software to visualize the semantic data in the database, and the search software became a powerful tool for plot visualization. He began using it to make visualizations of characters in Jane Austen novels, charting their various interactions through the course of the narrative. “And the algorithms seemed to do a really good job of detecting how the characters interacted!”

He’s since applied this visualization tool to other novels, including Samuel Richardson’s Clarissa — one of the largest novels in the English language — and the classic Chinese novel Dream of the Red Chamber.

Another project that demonstrates the search technology is Blog Census, a Web crawler that can identify sites that are weblogs. That was followed up with the Discourse Analysis project, where Coburn has indexed the writing of thousands of political columnists and bloggers, and then applied keyword visualization and analysis to the indexed text.

Project history

The Semantic Engine has its origins, Coburn says, in a summit on the future of information technology held in 2001 by the National Institute for Technology and Liberal Education (NITLE).

Coburn says that NITLE had “invited a small group of experts to speak about new technologies that would significantly affect the Liberal Arts in the next five years.” There were presentations on varied academic hot topics such as XML, process-driven learning tools, and Latent Semantic Analysis (LSA). It was the last that seemed most interesting to Coburn’s team.

After the summit, Coburn says, NITLE conducted a study where a college instructor had to create a course syllabus, then apply the appropriate Learning Object Metadata to it. It took the instructor about 45 minutes to create the syllabus — and four hours to apply the metadata. “From this,” Coburn says, “we thought it would be extremely useful to have a tool that could either automatically generate metadata, or find information in large collections that lacked the metadata commonly found in library catalogs.”

NITLE’s Maciej Ceglowski and the organization’s principal consulting scientist, John Cuadrado, began work on a project to build an open source search engine based on latent semantic indexing (LSI) technology, where a search for documents containing term X will also match documents containing term Y, if a significant number of documents contain both X and Y together.

They crafted various components in Perl and C++ — a part-of-speech “tagger” pulls out terms from documents, while another tool contains the search algorithm — and then they constructed what Ceglowski calls an “elaborate demo” of the software.

“The technology inside was real, and we got a lot built, but as a project it was always too understaffed to turn into actual software,” says Ceglowski, who is no longer employed by NITLE.

But around 2003, NITLE received a grant from the Andrew W. Mellon Foundation to pursue the work in earnest. “Now there was a budget and the possibility to dedicate staff to the project,” says Coburn. “I began working with NITLE in 2001, building Web applications, but it wasn’t until then that I began working on the [semantic indexing] project.”

At that point, Coburn says the team ran into problems with the scalability of the LSI algorithm. They replaced it with a new context-graph-based algorithm that accomplished the same thing, but solved the problem of scalability. This, he says, was all implemented in Perl — “and the results were good!”

They began to create applications to showcase the tools, refining the code all the while. “The code we used was pretty fast — but it was, after all, based on using the Perl interpreter, loading the entire graph into memory and having it persist there. It was, well, a bit unstable,” Coburn says. “So we changed things around to store the graph in a MySQL database. And once all of the prototyping and experimentation slowed down, we rewrote the entire thing in C++ with an accompanying GUI using the Qt framework.”

Coburn says that the project currently has funding into 2007, but they’re looking for both a host and funding to continue it beyond that time. Meanwhile, the project continues to research and implement real-world applications of their search tools.

“I hope that the tools will help us start to think about data differently and to find patterns in text that we may not have noticed using keyword searches,” he says.