Ranking the Web With Radical Transparency


Ranking every URL on the web in a transparent and reproducible way is a core concept of the Common Search project, says Sylvain Zimmer, who will be speaking at the upcoming Apache: Big Data Europe conference in Seville, Spain.

The web has become a critical resource for humanity, and search engines are its arbiters, Zimmer says. However, the only search engines currently available are for-profit entities, so the Common Search project is creating a nonprofit engine that is open, transparent, and independent.

We spoke with Zimmer, who founded Jamendo, dotConferences, and Common Search, to learn more about why nonprofit search engines are important, why Apache Spark is such a great match for the job, and some of the challenges the project faces.

Sylvain Zimmer, Founder of Common Search

Linux.com: Could you provide some background on the Common Search project? Why is a nonprofit search engine needed?

Sylvain Zimmer: Search engines are the arbiters of the web: they decide which websites and what information we get when we search for something online. As many studies have shown, it is easy to misinform or manipulate an audience with tailored search results.

We think it is critical for the Internet and ultimately for our society to have a healthy diversity in its sources of information. That means having both commercial and non-commercial search engines available, so that we can compare their results and watch out for biases.

Linux.com: The website mentions “radical transparency” as a core value of the project. Can you explain what that means and why it’s important?

Sylvain: Indeed, the cornerstone of Common Search is the transparency and reproducibility of our results.

Being transparent means that you can actually understand why our top search result came first, and why the second had a lower ranking. This is why people will be able to trust us and be sure we aren’t manipulating results. However for this to work, it needs to apply not only to the results themselves but to the whole organization. This is what we mean by “radical transparency.” Being a nonprofit doesn’t automatically clear us of any ulterior motives, we need to go much further.

As a community, we will be able to work on the ranking algorithm collaboratively and in the open, because the code is open source and the data is publicly available. We think that this means the trust in the fairness of the results will actually grow with the size of the community.

As Eric S. Raymond said, “given enough eyeballs, all bugs are shallow.” We think this also applies for search engine results!

Linux.com: Why did you choose Apache Spark?  What are some features of Spark that make it well suited for this project?

Sylvain: When we choose languages and frameworks for building Common Search, we consider two factors: technology and community. We want to use technologies that are close to the state-of-the-art and well suited for the unique challenges of a search engine, but we also want to position ourselves within vibrant communities that can be a source of talented contributors.

Spark is quite unique, because it fits both needs almost perfectly: it was built specifically for fast, large-scale data processing, and it is one of the most active Apache projects. It also supports Python, which is our main back-end language, so choosing it made a lot of sense.

Linux.com: What are some challenges that remain to be solved?

Sylvain: We definitely need to raise awareness about the need for a nonprofit search engine. It seems quite obvious when explained clearly, but many people are still unaware of the dangers of having only for-profit search engines on the market.

One of the biggest challenges though is just to make people believe we can actually build a useful service as a nonprofit. I think few people believed Wikipedia would survive vandalism and grow to become one of the top destinations on the web, but it actually happened!

What we need the most right now is for many new contributors to join the project. We made sure the project is very welcoming for newcomers with lots of documentation, simple tutorials, and easy issues to get started on. This is really our main focus because each new contributor gets us closer to a better, fairer web 🙂

Hear from leading open source technologists from Cloudera, Hortonworks, Uber, Red Hat, and more atApache: Big Data and ApacheCon Europe on November 14-18 in Seville, Spain. Register Now >>