Evolution of Apache Hadoop

98

hortonworks logo blackThe year 2016 will see Americans lining up to elect their new president. While passion and sentiments will dictate the outcome of the elections on the surface, deep down, modern technology will be at play, helping determine who will be the next president. These elections will harness the power of Big Data on a scale never done before. We have already seen the role that Big Data played in 2012 elections, and it’s only going to get bigger. This Big Data revolution is led by, as expected, open source and Apache Hadoop, in particular.

Brief History of Apache Hadoop

Almost a decade ago, Yahoo! asked its developers to work on a great web processing structure for the company in order to modernize its infrastructure. The team created an internal project that could handle what they needed. However, the developers wanted to open source the project so that they could collaborate with other players like Facebook and Twitter on the project.

Yahoo! had many patents on search, and their lawyers, for all the right reasons, didn’t want the patents to go into public domain. It was a deadlock. The team didn’t want to take on such a proprietary project, which would be deprived of collaboration, so they started to look around and found Hadoop, a little component written for the Nutch search engine project by Doug Cutting and Mike Cafarella.

Yahoo! folks realized that their own prototype was ahead of Hadoop at that point, but the fact that it could not be open sourced led them to make a tough decision. “We thought, it’s early enough, even if we are ahead of Hadoop as it was then, it makes no sense to develop proprietary infrastructure. So we abandoned our internal project and decided to adopt Hadoop. We also convinced Doug to join us,” said Arun C Murthy, founder and Architect at Hortonworks, who worked with Yahoo! back then.

The team worked very hard to improve Hadoop. “When we started on Hadoop, it worked on barely 2-3 machines. We did a lot of work, and at some point we had 100-plus people working on software and then we reached the point where in 2008 we went to production with a web search app on Hadoop, WebMap. This was an app where we were trying to grab the entire web,” added Murthy.

Beyond Web Search

Apache Hadoop, however, was meant to do much more than just power web search at Yahoo! Back then, Yahoo! was doing basic customization for users based on the IP address. “We found that they could offer content customized based on different factors, such as usage patterns and search history which goes beyond IP addresses. It not only allowed Yahoo! to serve better, personalized content but also to offer more suited ads, thus leading to monetization,” recalled Murthy.

And, this technology went beyond mere ads and customization; it went beyond Yahoo!

Today, almost every major industry utilizing Big Data is using Hadoop in one form or another, and it has brought a sea change to those industries. Advertising and the content industry clearly benefit from such analytic capabilities. The health industry also gains from it as, based on different data, a company can offer medicine at right place, at right time, and in the right quantity.

The insurance industry is also taking great advantage of it. By using data gathered through a tracking device installed in cars, for example, companies can offer better rates to careful drivers, and higher rates to reckless ones. The oil industry is using it, governments are using it, and even security agencies are heavy users of Big Data as analytics plays a critical role in national security.

In a nutshell, Hadoop is everywhere.

What Made Hadoop Such a Huge Success?

Many factors contributed to Hadoop’s success. The Apache Foundation (ASF) offered the environment it needed to attract the best developers. Doug Cutting, the founder of Hadoop told me in an interview, “Apache’s collaborative model was critical to Hadoop’s success. As an Apache project, it has been able to gather a diverse set of contributors that together refine and shape the technology to be more useful to a wider audience.”

Hadoop benefitted greatly from the infrastructure of Apache Foundation. “…be it communications (mailing lists, bug tracking etc) or hardware resources for helping with software  development processes like building/testing the projects,” said Vinod Kumar Vavilapalli, a member of the Hadoop Project Management Committee.

The foundation also offered the project “legal shelter for the said contributions via the non-profit legal entity and the Apache Software License for the code. Besides these structural benefits, the diverse communities that form the foundation also help in fostering a collaborative, meritocratic environment,” added Vinod.

Hadoop Is Like a Solar System

Apache Hadoop, once a tiny component of another Apache project, is now a star in its own right, with different open source components revolving around it.

In talking about the evolution of Apache Hadoop, Vinod said, “It’s been a long and fantastic journey for Apache Hadoop since its modest beginnings.” Apache Hadoop has in fact become much more than a single Big Data project.

“Today, Hadoop together with its sister projects like Apache Hive, Apache Pig, Apache HBase, Apache Tez, Apache Oozie, Apache Spark, and nearly 20 (!) other related projects has spawned an entirely new industry aimed at addressing the big data storage and processing needs of our times,” said Vinod.

A recent addition to Hadoop’s ecosystem is YARN, which stands for Yet Another Resource Negotiator. It sits on top of the HDFS/distributed file system and essentially acts as the operating system for Hadoop. It has transformed the “Hadoop project from being a single-type data-processing framework (MapReduce) to a much-larger-in-scope cluster-management platform that facilitates running of a wide variety of applications and frameworks all on the same physical cluster resources,” stated Vinod.

“Then there are many data access engines that can plug into YARN such as Spark, Hive (for SQL), or Storm (for Streaming). But that isn’t enough for enterprises – they need security (Apache Ranger), data governance (Apache Atlas) and operations (Apache Ambari) capabilities. We have teams working on each of these projects and many more,” added Murthy.

Communities Are the Leaders of True Open Source Projects

Community-driven projects are always better than company owned ones; they attract more talent and derive more benefit from them. No matter how large a company is, it can’t hire everyone. The developer with the right skills may be working for a competitor. When you create a community driven by open source, the developer being paid by your competitor actually works to improve your code.

Such a community-driven development model was also pivotal to Hadoop’s success. “There is no leader or primary contributor. We’re all peers, reviewing and refining each other’s contributions, building consensus to move the project forward. Contributors are vital to open source. They provide improvements motivated by need. Contributors direct the project and drive it forward. One initially seeds an open source project with some useful code, but it’s not really alive until folks are contributing,” said Cutting.

“And that’s why ASF is a great place to collaborate,” said Murthy, “I can influence it with my code, but no one owns it except for ASF and they are like Switzerland — you don’t worry about ASF doing anything nasty to your code. That’s why everyone from Microsoft to IBM are comfortable putting their IP in the ASF.”

Community-Driven Capitalism

Many successful open source projects have struck a fine balance between the cathedral and the bazaar; open source is as much about entrepreneurship as it is about community. Apache Hadoop allowed early contributors such as Murthy to create companies like Hortonworks, which now offers open source software support subscriptions and training and consulting services. The company now serves industries including retail, healthcare, financial services, telecommunications, insurance, oil and gas, and manufacturing.

Open Source Is Becoming the Norm in Enterprise

The deep penetration of Apache Hadoop in multi-billion dollar industries is yet another example of open source becoming the norm in the enterprise segment; you don’t much hear about proprietary technologies at all.

Jim Zemlin, the executive director of the Linux Foundation says “Organizations have discovered that they want to shed what is essentially commodity R&D and software development that isn’t core to their customers and build all of that software in open source.”

This approach allows them to focus on their core business instead of building every single component used in their product. Sam Ramji, CEO of Cloud Foundry summed it nicely, “Users want it and it’s more efficient way to build software. The time for open source is here, even if has not taken over the world yet. I think in 10 years from now we won’t even have a word open source, we will just say software.”