Apache Lucene Helps Online European Library Open Its Virtual Doors

53
When a group of European museums joined together to put images of all of their paintings, drawings, sculptures, photos and other artifacts together in a centralized, online collection for the world to view, they turned to open source enterprise search software to make it happen.

Today, that virtual library, called Europeana, is just beginning to bring together the culture, history and scientific resources of Europe in one place online for the use of students, researchers and residents around the world.

Sjoerd Siebinga, a senior developer for Europeana in The Hague, The Netherlands, said that using open source applications has made it easier for the many nations, museums and other institutions to share their content while dealing with unique technical difficulties, including multiple spoken languages and cultural differences.

The project is being spearheaded by the European Commission, which helps enact laws and run the operations for the 27 member nations of the European Union.

“Part of the reason to use open source was driven by the European Commission” directly, Siebinga said. “The data needs to be available on the local level, so it was mandated that what they build can be shared with even a small library in Bulgaria with only 100 records” to place in the collection. With open source, open standards could be used that can promote that kind of ease of construction, no matter how large or small the organization, he said. “We’re open-sourcing the complete code base so other libraries can contribute code and [applications].”

A two-year pilot project began in July 2007 after the idea for the virtual library was first proposed in 2005 by the leaders of seven of the Commission’s members. Today, the project is still in its development stages. It went live as a prototype last November on the Web and is expected to be officially launched in release form sometime next year, Siebinga said.

The backbone of the Europeana platform is the Apache Lucene open source search application, along with the Apache Solr enterprise core search engine that uses an API similar to those for Web services, according to the Solr community. Solr is based on Lucene.

Siebinga said he had worked with Solr previously, so he was happy that Lucene and Solr were chosen by Europeana’s leaders for the massive project.

“It’s faceted search allows you to drill down deeply,” he said. “It gave us a tremendous amount of functionality straight out of the box. We did benchmarking and it was very fast. At one point in time, we needed to scale up and Solr gave us easy capabilities to do so.”

They looked at other open source products, including relational databases like PostgreSQL, but none of them would have easily done what they needed to accomplish, he said. “We would have had to do a lot of other development” with the alternatives.

By using Lucene and Solr, “we are essentially building this huge library even when smaller entities that are involved don’t have the high technology to do it for themselves.”

Because each of the museums and organizations had its own way of storing its own data, it made matters more complicated. “They presented their data to us in their own formats, then we provided them with the tools to transform the data into Europeana’s internal data schemes.”

The 27 languages of the 27 member nations of the European Union are each supported in the prototype version.

“Giving all this information to the users is almost like drinking from a fire hose,” Siebinga said. “The more difficult part is how to find stuff,” so one feature expected in the release version is a timeline function that can allow users to find information based on specific years, decades or centuries.

While Europeana will be free for students and most individual users, some kind of fee structure could be created for commercial users to help make its use sustainable for the organizations that include their artifacts, he said. Money for the project is so far being contributed by government cultural ministries in the member nations.

So far, some five million artifacts are featured in Europeana, including maps, paintings, drawings, pottery, books, newspapers, letters, diaries and archival papers. Multimedia files are also included, from music and spoken word pieces stored on cylinders, tapes, discs and radio broadcasts, to video through old films, newsreels and TV broadcasts.

Anil Uberoi, marketing director for Lucid Imagination, a San Mateo, CA-based company that offers paid enterprise-level support and services for Lucene and Solr users, said the two projects are helping Europeana because they allow unstructured data to be catalogued and searched, for far less money than similar products from proprietary vendors.

Some 4,000 to 5,000 user organizations are working with Lucene around the world, he said, and many of those are now hiring Lucid to provide enterprise-level support. Lucid opened its doors this past January.

“The opportunity was there,” Uberoi said. “We have the guy who wrote the original code. He works here. We are very close to the project.”