Using Apache Hadoop to Turn Big Data Into Insights


The Apache Hadoop framework for distributed processing of large data sets is supported and used by a wide-ranging community — including businesses, governments, academia, and technology vendors. According to John Mertic, Director of ODPi and the Open Mainframe Project at The Linux Foundation, Apache Hadoop provides these diverse users with a solid base and allows them to add on different pieces depending on what they want to accomplish.

John Mertic, Director, ODPi and Open Mainframe Project
As a preview to Mertic’s talk at Apache: Big Data Europe in Seville, Spain, we spoke with him about some of the challenges facing the project and its goals for growth and development.

Apache Hadoop has a large and diverse community and user base. What are some of the various ways the project is being used for business and how can the community meet those needs?

If you think of a use case where a business needs to answer a question with data, the chances that they are using Apache Hadoop are fairly high. The platform is evolving to become the go-to strategy for working with data in a business. Hadoop’s ability to turn data into business insights speaks to the flexibility and depth of both Hadoop and the Big Data ecosystem as a whole.

The Big Data community can help to increase the adoption of Apache Hadoop through consistent encouragement of interoperability and standardization across Hadoop offerings. These efforts will not only help to mitigate the risks associated with implementing such differing platforms, but also streamline new development, promote open source architectures, and eliminate functionality confusion.   

What is the most common misconception about Apache Hadoop?

The most common misconception about Apache Hadoop is that it is just a project of The Apache Software Foundation, and one containing only YARN, MapReduce, HDFS. In reality, as it’s brought to market by platform providers like Hortonworks, IBM, Cloudera, or MapR, Hadoop can be equipped with 15-20 additional projects that vary across platform vendors, like Hive, Ambari, HCFS, etc. To use an analogy, Apache Hadoop is like Mr. Potato Head. You start with a solid base and can add different pieces depending on what you are trying to accomplish. What an end user may think of as Apache Hadoop is actually more than what it really is, and thus it may seem quite amorphous.

What are its strengths, and what value does it bring to users?

The Hadoop ecosystem enables a multitude of strategies for dealing with and capitalizing on data in any enterprise environment. The breadth and depth of the evolving platform now enables businesses to consider this growing ecosystem as part of their strategy for data management.

Can you describe some of the current challenges facing the project?

There certainly are compatibility gaps with Apache Hadoop and, while technologists are tackling some of these by creating new innovative projects, I think having a tighter feedback loop of real-life usage from businesses — to help technologies closest to the project understand the challenges and opportunities — will be crucial to increase adoption. Obtaining those use cases directly from user to project can help solidify and mature these projects quickly.

The effects of the broad ecosystem – most commonly occurring through end user confusion and enterprise software expectations – happen when end users turn to Hadoop from the matured world of enterprise data warehouses with the same expectations but don’t see the same stability in this new ecosystem.

What are the project’s goals and strategies for growth?

ODPi’s goals for the Big Data community at-large are to solve end-user challenges more directly, remove the investment risks for legacy companies considering a move to Hadoop through universal standardization, and connect the technology more directly to business outcomes for potential enterprise users.

Attending Apache: Big Data Europe? Join Apache project members and speakers at the ODPi Community Lounge!