IBM loves Apache Spark. It’s training its engineers on it, it’s contributing to the project, and it’s building many of its big data products on top of the open source platform so IBM’s enterprise customers can use its powerful tools.
Luciano Resende, an architect at IBM’s Spark Technology Center, told the crowd at Apache Big Data in Vancouver that Spark’s all-in-one ability for handling structured, unstructured, and streaming data in one memory-efficient platform has led IBM to use the open source project where it can.
“We at IBM … have noted the power of Spark, and the other big data technologies that are coming in [from the Apache Software Foundation],” Resende said.
IBM is particularly invested in Spark’s machine-learning capabilities and is contributing back to the project with its work on SystemML, which helps create iterative machine-learning algorithms. It offers Spark-as-a-service in the cloud, and it’s building it into the next iteration of the Watson analytics platform. Basically anywhere it can, IBM is harnessing the efficient power of Apache Spark.
“We have our ETL platform, and we moved that to be on top of Spark,” Resende said. “By doing that it enabled us to go from 40 million lines of code to 4 million lines of code.”
Resende said Spark plays major roles in IBM’s Watson Health product, where doctors can query data lakes of internal and external data to better predict patient outcomes, and in helping a major telecom client create a 360-degree customer view to improve customer experience.
But, perhaps the most impressive use of Spark for IBM is how it helps run one of the tech titan’s recent acquisitions: The Weather Company.
The Weather Company provides data for The Weather Channel, as well as dozens of apps, and is used by Google, Apple, and several other companies. Resende said the database receives 30 billion API requests a day — that’s more than 60x the number of daily tweets — and serves a mobile user base of more 120 million active users.
As a result, The Weather Company processes 360 petabytes of data every day, which has to be analyzed both in batch processes and through streaming.
“For this, they’ve chosen Apache Spark,” Resende said, “and for storage they use a lot of Apache Cassandra. This allows them to process all the data they have — 360 PB of data — as traffic daily, and it allows them to have a platform that can scale linearly and in a cost-efficient way.”
Watch the complete video below: