open source technology nutch (www.nutch.org).
Clustering Engine is a system for clustering textual data.This engine automatically categorizes search results on-the-fly into hierarchical clusters.
Search results clustering attempts to overcome the problem of information overload.The user interface of most search engines are based on keyword-based queries and endless lists of matching documents.Unfortunately, even when exceptional ranking algorithms are used, relevance sorting inevitably promotes quality based on some notion of popularity of what can be found on the Web.
If an overview of the topic, or an in-depth analysis of a certain subject is required, search engines usually fail in delivering such information. It seems natural to expect that search engines should not only return the most popular documents matching a query but also provide another, potentially more comprehensive, overview of the subject it covers.
One approach is to automatically group search results into thematic categories, called clusters.
Assuming clusters descriptions are informative about the documents they contain, the user spends much less time following irrelevant links.
He/She also gets an excellent overview of subjects present in the search result and thus knows when the query needs to be reformulated. Because the clustering process is performed dynamically for each query, the discovered set of groups is apt to depict the real structure of results, not some predefined categories.
It makes search results more "user friendly".Now, instead of scrolling and trying to find something on page users can select the right topic first and see only results for that topic.
With clustering Engine users can find more relevant search results easily."