Sparkling Water: Bridging Open Source Machine Learning and Apache Spark


Although many people have experience with the fields of machine learning and artificial intelligence through applications in their pockets, such as Apple’s Siri and Microsoft’s Cortana, the scope of this technology extends well beyond the smartphone., formerly known as Oxdata, has carved out a unique niche in the machine learning and artificial intelligence arena because its primary tools are free and open source, and because it is connecting its tools to other widely used data analytics tools. As a case in point, has now announced the availability of version 2.0 of its open Sparkling Water tool. Sparkling Water,’s API for Apache Spark, allows users of Spark to leverage very powerful machine learning intelligence.

You can download Sparkling Water 2.0 for free now. New features include the ability to: interface with Apache Spark, Scala and MLlib via’s Flow UI; build ensembles using algorithms from both H2O and MLlib; and give Spark users the power of H2O’s visual intelligence capabilities.

Sparkling Water includes a toolchain for building machine learning pipelines on Apache Spark.

In essence, Sparkling Water is an API that allows Spark users to leverage H2O’s open source machine learning platform instead of — or alongside — the algorithms that are included in Spark’s existing MLlib machine-learning library. has published a number of use cases for how Sparkling Water and its other open tools are used in fields ranging from genomics to insurance.

Analysts are beginning to realize that open source machine learning tools can be used in conjunction with tools like Spark, giving them flexibility as they focus on big data. “Enterprises are looking to take advantage of a variety of machine learning algorithms to address an increasingly complex set of use cases when determining how to best serve their customers,” said Matt Aslett, Research Director, Data Platforms and Analytics at 451 Research. “Sparkling Water is likely to be attractive to H2O and Spark users alike, enabling them to mix and match algorithms as required.”

Moreover, in an interview with’s Vinod Iyengar, who oversees product strategy at the company, he noted that running’s powerful, open tools on affordable clusters is within reach of anyone now. “In the last five years the cost of storage has come down dramatically, as has the cost of memory,” he said. “Additionally, anyone can leverage an advanced computing cluster on, say, Amazon Web services, for a few hundred dollars. All of this means that organizations or individuals can take a whole lot of data and produce powerful predictions and insights from the large data sets without facing huge costs.”

Tipping Point

What does this mean in simple terms? It means that we are at a tipping point where anyone can wield the same kind of machine learning and artificial intelligence muscle that is used for everything from drug discovery to deep data analytics.

Iyengar also sees the open source roots of Sparkling Water as powerful. “Code is truly getting commoditized and the only defensible asset is community,” he said. “The relationships we have with our customers are also deepened due to the open source nature of our products. Because H2O and Sparkling Water are open source, our customers are also our community. They take part in H2O not just as consumers, but as developers as well.”

Notably, is also working on a data science hub called Steam, which will eliminate all the DevOps work required to build and deploy machine learning and artificial intelligence models. With Steam, developers and data scientists will be encouraged to compare models across teams and take them into production without the need for heavy engineering work on the backend. We will follow up on Steam in a post to come soon.

To learn more about the promise of machine learning and artificial intelligence, watch a video featuring David Meyer, Chairman of the Board at OpenDaylight, a Collaborative Project at The Linux Foundation. And, to learn more about’s machine learning work, see this previous post.