The Apache Software Foundation’s Two New Big Data Projects Tackle Science and Processing

321

The Apache Software Foundation is making a big commitment to Big Data. As reported in this post, in recent months the foundation has promoted a slew of open source Big Data projects to Top-Level Status.  This puts a number of them on the same kind of development fast track that catapulted the Spark project to success.

Doug Cutting, co-founder of Hadoop, recently said at the Apache Big Data conference, “The hallmark of this ecosystem that’s emerged is the way that it’s evolving. We’re seeing not just new projects added, but some of the old projects being replaced over time by things that are better. In the end, nothing is sacred. Any component can be replaced by something that is better.”

As cases in point, Apache has announced that two new Big Data projects have earned Top-Level status: OODT and Bahir. By earning Top-Level Status, OODT and Bahir will benefit from active development and strong community support.

As background, countless organizations around the world are now working with data sets so large and complex that traditional data processing applications can no longer drive optimized analytics and insights. That’s the problem that the new wave of Big Data applications aims to solve, and Apache has graduated more than 10 of these applications to Top Level in the past year.

OODT: NASA is Onboard

Originally created at NASA Jet Propulsion Laboratory in 1998 as a way to build a national framework for data sharing, OODT has also been instrumental to the National Cancer Institute’s Early Detection Research Network for managing distributed scientific data sets across 20+ institutions nationwide for more than a decade.

According to Apache:

“OODT is a grid middleware framework for science data processing, information integration, and retrieval. As ‘middleware for metadata’ (and vice versa), OODT is used for computer processing workflow, hardware and file management, information integration, and linking databases. The OODT architecture allows distributed computing and data resources to be searchable and utilized by any end user.”

“Apache OODT 1.0 is a great milestone in this project,” said Tom Barber, Vice President of Apache OODT. “Effectively managing data pools has historically been problematic for some users, and OODT addresses a number of the issues faced. v1.0 allows us to prepare for some big changes within the platform with new UI designs for user-facing apps and data flow processing under the hood. It’s an exciting time in the data management sector and we believe Apache OODT can be at the forefront of it.”

Apache OODT is in use in many scientific data system projects in Earth science, planetary science, and astronomy at NASA, such as the Lunar Mapping and Modeling Project (LMMP), NPOESS Preparatory Project (NPP) Sounder PEATE Testbed, the Orbiting Carbon Observatory-2 (OCO-2) project, and the Soil Moisture Active Passive mission testbed.

In addition, OODT is used for large-scale data management and data preparation tasks in the DARPA MEMEX and XDATA efforts, and for supporting research and data analysis within the pediatric intensive care domain in collaboration with Children’s Hospital Los Angeles (CHLA) and its Laura P. and Leland K. Whittier Virtual Pediatric Intensive Care Unit (VPICU), among many other applications.

Bahir and Big Data Processing

Apache Bahir has become a Top-Level Project (TLP), too, and Spark developers will want to take note. Bahir bolsters Big Data processing by serving as a home for existing connectors that initiated under Apache Spark, and provides additional extensions/plugins for other related distributed system, storage, and query execution systems.

Bahir code is extracted from the Apache Spark project, and has spun out as a standalone project to provide implementations for different Spark-related extensions/plugins, connectors, and other pluggable components. Current extensions include:

  • streaming-akka (akka:Open Source toolkit and runtime simplifying the construction of concurrent and distributed applications on the Java Virtual Machine)

  • streaming-mqtt (mqtt: lightweight messaging protocol for small sensors and mobile devices, optimized for high-latency or unreliable networks)

  • streaming-twitter (Twitter: online social networking service; Bahir allows the processing of social data from Twitter)

  • streaming-zeromq (zeromq: a high-performance asynchronous messaging library, aimed at use in distributed or concurrent applications)

In addition, Apache Bahir has a strong relationship with different storage layers; the project intends to extend that relationship to a number of other ASF projects and Apache-licensed initiatives.

“Apache Bahir is a new community that aims to be a place to curate extensions related to distributed analytic platforms following the Apache Governance,” said Luciano Resende, Vice President of Apache Bahir and an Architect at IBM contributing to The Apache Software Foundation for over 10 years. “The project is initially offering a few Apache Spark extensions but it is definitely open for expanding to other platforms such as Apache Beam, Apache Flink and others.”

“We are very interested in streaming-mqtt for remote sensing applications and control/monitoring. We have a lot of Big Data needs in Earth science especially in remote and difficult to access environments and plugins such as streaming-mqtt from Bahir provide a readily accessible and Apache-based solution to that,” said Chris Mattmann, member of the Apache Bahir Project Management Committee, and Chief Architect, Instrument and Science Data Systems Section at NASA Jet Propulsion Laboratory.

“We are very motivated to increase the size and diversity of the Apache Bahir community,” added Resende. “We welcome feedback, use cases, bug reports, patch submissions, code contributions, documentation, new extension proposals, and other ways to participate.”

Are you interested in more cutting-edge Big Data projects that Apache is elevating to Top-Level? You can find a comprehensive collection of them in this post.