On the Rise: Six Unsung Apache Big Data Projects


Countless organizations around the world are now working with data sets so large and complex that traditional data processing applications can no longer drive optimized analytics and insights. That’s the problem that the new wave of Big Data applications aims to solve, and the Apache Software Foundation (ASF) has recently graduated a slew of interesting open source Big Data projects to Top-Level status. That means that they will get active development and strong community support.

Most people have heard of Apache Spark, a Big Data processing framework with built-in modules for streaming, SQL, machine learning and graph processing. IBM and other companies are pouring billions of development dollars into Spark initiatives, and NASA and the SETI Institute are collaborating to analyze terabytes of complex deep space radio signals using Spark’s machine learning capabilities in a hunt for patterns that might betray the presence of intelligent extraterrestrial life.

However, several other recently elevated Apache Big Data projects deserve attention, too. In fact, some of them may produce ecosystems of activity and development that will rival Spark’s. In conjunction with this week’s ApacheCon North America conference and Apache: Big Data events, this article will round up the Apache Big Data projects that you should know about.

Here are six projects on the rise:


Apache recently announced that its Kylin project, an open source Big Data project born at eBay, has graduated to Top-Level status. Kylin is an open source Distributed Analytics Engine designed to provide an SQL interface and multi-dimensional analysis (OLAP) on Apache Hadoop, supporting extremely large datasets. It is still widely used at eBay and at a few other organizations.

“Apache Kylin’s incubation journey has demonstrated the value of Open Source governance at ASF and the power of building an open-source community and ecosystem around the project,” said Luke Han, Vice President of Apache Kylin. “Our community is engaging the world’s biggest local developer community in alignment with the Apache Way.”

As an OLAP-on-Hadoop solution, Apache Kylin aims to fill the gap between Big Data exploration and human use, “enabling interactive analysis on massive datasets with sub-second latency for analysts, end users, developers, and data enthusiasts,” according to developers. “Apache Kylin brings back business intelligence (BI) to Apache Hadoop to unleash the value of Big Data,” they added.


Apache also recently announced that Apache Lens, an open source Big Data and analytics tool, has graduated from the Apache Incubator to become a Top-Level Project (TLP). According to the announcement: “Apache Lens is a Unified Analytics platform. It provides an optimal execution environment for analytical queries in the unified view. Apache Lens aims to cut the Data Analytics silos by providing a single view of data across multiple tiered data stores.”

“By providing an online analytical processing (OLAP) model on top of data, Lens seamlessly integrates Apache Hadoop with traditional data warehouses to appear as one. It also provides query history and statistics for queries running in the system along with query life cycle management.”

“Incubating Apache Lens has been an amazing experience at the ASF,” said Amareshwari Sriramadasu, Vice President of Apache Lens. “Apache Lens solves a very critical problem in Big Data analytics space with respect to end users. It enables business users, analysts, data scientists, developers and other users to do complex analysis with ease, without knowing the underlying data layout.”


The ASF has also announced that Apache Ignite has become a top-level project. It’s an open source effort to build an in-memory data fabric.

“Apache Ignite is a high-performance, integrated and distributed In-Memory Data Fabric for computing and transacting on large-scale data sets in real-time, “orders of magnitude faster than possible with traditional disk-based or flash technologies,” according to Apache community members. “It is designed to easily power both existing and new applications in a distributed, massively parallel architecture on affordable, industry-standard hardware.”


The foundation announced that Apache Brooklyn is now a Top-Level Project (TLP), “signifying that the project’s community and products have been well-governed under the ASF’s meritocratic process and principles.” Brooklyn is an application blueprint and management platform used for integrating services across multiple data centers as well as and a wide range of software in the cloud.

According to the Brooklyn announcement: “With modern applications being composed of many components, and increasing interest in micro-services architecture, the deployment and ongoing evolution of deployed apps is an increasingly difficult problem. Apache Brooklyn’s blueprints provide a clear, concise way to model an application, its components and their configuration, and the relationships between components, before deploying to public Cloud or private infrastructure. Policy-based management, built on the foundation of autonomic computing theory, continually evaluates the running application and makes modifications to it to keep it healthy and optimize for metrics such as cost and responsiveness.”

Brooklyn is in use at some notable organizations. Cloud service providers Canopy and Virtustream have created product offerings built on Brooklyn. IBM has also made extensive use of Apache Brooklyn in order to migrate large workloads from AWS to IBM Softlayer.


In April, the Apache Software Foundation elevated its Apex project to Top-Level status. It is billed as “a large scale, high throughput, low latency, fault tolerant, unified Big Data stream and batch processing platform for the Apache Hadoop ecosystem.” Apex works in conjunction with Apache Hadoop YARN, a resource management platform for working with Hadoop clusters.


Finally, Apache Tajo, an advanced open source data warehousing system in Apache Hadoop, is another new Big Data project to know about. Apache claims that Tajo provides the ability to rapidly extract more intelligence for Hadoop deployments, third party databases, and commercial business intelligence tools.

Clearly, although Apache Spark draws the bulk of the headlines, it is not the only Big Data tool from Apache to keep your eyes on. As this year continues, Apache likely will graduate even more compelling Big Data projects to Top-Level status, where they will benefit from optimized development resources and more.