5 Questions About the Open Source Spark Connector for Cloudant Data


Editor’s Note: This article is paid for by IBM as a Diamond-level sponsor of ApacheCon North America, and written by Linux.com.

Connectors make all our lives easier. In the case of the Spark-Cloudant connector, using Spark analytics on data stored in Cloudant is simplified with the easy-to-use syntax of the connector. With the large Spark ecosystem, one can now conduct federated analytics across Cloudant and other disparate data sources. And we all know that the days of analyzing just your own company data are long gone.  Piping in more data is essential these days.

Mike Breslin, offering manager at IBM Cloud Data Services, focuses on IBM Cloudant.
We talked with Mike Breslin, offering manager at IBM Cloud Data Services, who focuses on IBM Cloudant, to explore details on the Spark-Cloudant connector. And, to discover a few tips on using it too.

Linux.com: What’s the one thing you want practitioners to know about the Cloudant-Spark connector? What advantages does it bring in real-world practice?

Mike Breslin: It’s an open source connector built by the IBM Cloudant team. It consists of easy-to-use APIs so you can leverage Spark analytics on your Cloudant data.    

Linux.com:  Where can we find the connector and how can it be used?

Breslin: The spark-cloudant connector is pre-loaded into the Apache Spark-aaS offering on IBM Bluemix or can be used in stand-alone Spark instances. Download it from the Spark Packages site or  Github if you want to use it as a standalone and include it in the environment variables. It’s available for anyone’s use under the Apache 2.0 license. As is common in most things Spark, it’s available for Python and Scala applications.

The connector just takes a few seconds to download; it isn’t a very big piece of software. Powerful, but not big.

Linux.com:  What’s the quickest way to start analyzing your Cloudant data in Spark?

Breslin: The best way to get started is to just jump right in. But you might want to check out the tutorials and walk-throughs first.     

Linux.com:  The connector offers several helpful capabilities. Which do you find the most useful yourself and why?

Breslin: The integration means leveraging fast Spark analytics on your Cloudant data for ad hoc querying and advanced analytics.  You can load whole databases into a Spark cluster for analysis.

Spark has federated analytics so you can use disparate data sources and one of those sources can be Cloudant. Because Spark has a variety of connection capabilities, you can use it to conduct federated analytics over Cloudant, dashDB data warehouse and Object Storage and other disparate data sources.

You can also transform the data and then write it back to Cloudant to store it there.    

Linux.com:  As you know, time is of the essence in this type of work. Got any tips to share on making the work or outputs faster or better?

Breslin: If you’re not using Spark already, you’ll likely find it faster and easier to use Spark-as-a-Service. If you’re new to Spark, I recommend checking out the Spark Fundamentals classes on Big Data University and the tutorials on IBM developerWorks.

As for familiarizing yourself with the connector, I’d suggest you check out the README on GitHub and the video tutorials on our Learning Center showing how to use the connector in both a Scala and Python notebook.