The state of distributed search

41

Author: Tom Walker

Even as commercial search engines like Google, Yahoo, and MSN Search grow more dominant, a new distributed search engine with the unglamorous name Grub is taking a different tack. Grub aims to take participants’ unused bandwidth and CPU cycles to crawl more than 10 billion Web documents and maintain an up-to-date index of the Web.

Distributed search, which is still very much an emerging field, works in a way that will be familiar to people who donate processor time to projects like distributed.net and SETI@Home. By downloading and running a screensaver, you contribute some of your bandwidth and CPU cycles to spider and analyze Web pages. When you’re not using your computer, the Grub client is activated. Your computer crawls the Web, following hyperlinks from document to document. The more people who do this, the bigger the index gets.

Incorporating peer-to-peer technology has been a goal of Web searchers since the days of Napster, but until now has been mainly the domain of academic research papers, with many saying that it’s an unachievable pipe dream. If you’re Google, though, with a multi-million dollar cluster of Linux servers doing your spidering, it’s more like a nightmare — and the ever-decreasing cost of domestic bandwidth means that, dream or nightmare, it’s starting to look more and more possible. We asked Google for a reaction to Grub and P2P search, but the company is currently subject to the SEC “quiet period” before its IPO.

Andre Stechert, Grub’s director of technology, says, “distributed search makes it possible to deliver results that are fresher, more comprehensive, and more relevant. It achieves this by making it feasible to maintain up-to-date crawl data and metadata in the 10+-billion document range and to use heavyweight page analysis algorithms that look deeper than keywords.” The Grub project is part of LookSmart (a name you might remember from the “old days” of search) who intend to use results from Grub and its community directory project, Zeal, to improve its existing search engines.

Analysts think LookSmart may be on to something. Susan Feldman, research vice president of content technologies at IDC says, “a volunteer spidering effort supporting new search engines would help small, niche, or topical search engines to develop.”

In open source software, instead of relying on one team of programmers, many eyes make bugs fewer; similarly, distributed crawling will, proponents hope, produce better results than a company’s proprietary server farm. Distributed crawling puts the community in charge of the results. It moves Web crawling from the cathedral to the bazaar. People all over the world can collaborate to build a better search index, and they can do it using bandwidth and hardware that they’ve already paid for but aren’t using. Grub doesn’t even require any time or effort from users, beyond that required to download and install the client — less than a minute, generally. Grub’s crawler is available for both Windows and Linux.

Grub doesn’t directly provide a search engine — rather, Grub clients crawl the Web to provide an index that can be used either to improve an existing search engine or create an entirely new one. Today only WiseNut, which is also part of corporate parent LookSmart, is using crawl results from Grub to improve its index, but Grub makes its results available to anyone who wants to use them, with an XML-based query API.

A few test searches using WiseNut show that while the concept may be good, the execution is incomplete. For any search term you care to try, WiseNut returns far fewer results than the big search engines, in part because the Grub index is currently small compared most search engines’ indexes.

Promising prospects

Distributed crawling could give open source search engines the leg up they need. Until now, open source search projects such as Nutch have had to rely on donations to afford the hardware and bandwidth needed to search a significant portion of the Web. On its Web site, Nutch says, “We estimate that a two-hundred-million page demo system will require from $10,000 to $200,000 in hardware, depending on how much query traffic we wish to handle.” Unless you’re Wikipedia, you just don’t have people willing to donate that kind of money.

Andre Stechert believes that community Web crawling projects could help. Distributed crawl indexes, he says, “are today where [Internet] directories were before the advent of open directories like dmoz. They are proprietary and, for the most part, redundant. Creating and maintaining them incurs a large cost for the companies owning these infrastructures and for ISPs and Webmasters who are hosting the sites being crawled…. We think it’s important to provide a community-based crawl database because it’s a distraction from the main problem at hand: providing relevant, speedy answers to user queries. The challenge of bringing up a production-scale, production-quality crawling infrastructure is significant enough to chew up lots of time from any team. By lowering the barrier to entry for search engine research, we hope to focus more of the research effort where value is being created for the world at large.”

That’s what distributed search is all about: lowering barriers to entry. Just as open directories helped Google and others to challenge Yahoo’s seemingly unassailable position (just look at the list of sites using data from DMOZ, which includes many well-known search engines) distributed crawling could do the same for Google’s competitors. Imagine a world where any new player in the search market can focus solely on providing an excellent user interface, without having to worry about the logistics of crawling the enormous World Wide Web. That is the future that projects like Grub promise. As Stechert says, “distributed computing is an enabling platform for the future of search.” Grub currently spiders about 100 million URLs per day.

IDC’s Susan Feldman envisions what allowing search engines to focus on interface instead of crawling might lead to. “As broadband connections to the Internet become more prevalent,” she says, “the next leap forward may well be a conversational and visual approach to searching that lets users interact with systems as if they were engaging in a natural dialogue of questions and answers. The visualization would enable them to understand the scope of a collection at a glance, and to then drill down into the topics that are most interesting to them.” The improvements could be more than just the interface: for example, someone could build a search engine on top of a distributed search index where the community has even more influence over the results. You could comment on your results, or even make use of web-of-trust algorithms to allow users to rearrange or vote on them.

This future of search is one that we can all get behind.