Apache Big Data Preview: Q&A with IBM’s Anjul Bhambhri


ibmpos blue thumbAs a preview to the upcoming Apache Big Data Europe conference, we spoke with with Anjul Bhambhri, Vice President, Big Data and Analytics, IBM Silicon Valley Lab, who will be giving a keynote presentation titled, “Apache Spark — Making the Unthinkable Possible.” We talked with Bhambhri about IBM’s involvement with open source and what Big Data really means.

Q: Big Data seems to refer not so much to “a large amount of data” as “so much data that traditional methods can’t handle it.”  For IBM, what does Big Data mean?

Currently, Big Data means hundreds of terabytes, going up to tens of petabytes. This could be structured data – data in fixed records or fields, like in databases — as well as unstructured data like documents, images, and multimedia. Every industry has Big Data — Financial, Telco, Insurance, Healthcare, Automotive, Retail. They want to combine data from multiple data sources of record with data sources of interaction, extract new information, derive actionable insights from the information, and deliver new and incremental value via products and services.


Q: What does IBM’s Big Data Products group do? What are some of the main products?

IBM’s Big Data products help customers across industries store, manage, and analyze all of the data using descriptive,  predictive, prescriptive analytics. The capabilities are available to data scientists as well as to Lines-of-Business (LOB). Products like BigInsights provide the ability to store, query all types of data using SQL, extract information from unstructured data using text analytics, do large-scale analytics using R and Machine Learning. The SPSS suite of products provide the ability to build and score predictive models at scale on big data. Watson Analytics offers LOB the benefits of advanced analytics without the complexity. LOB  can get answers and new insights to make confident decisions in minutes — all on your own.

The Infosphere Streams product lets you analyze all of Big Data in real time. As data continues to grow in the Big Data platforms, it is very easy for the platform to become a swamp as opposed to a source of well-curated information. This is where the IBM tools such as BigQuality and BigIntegration help IT personnel create a balance between unfettered data exploration and analytics while ensuring that the data is appropriately governed in terms of use, lineage, quality. These tools also make it easy for LOB to publish “good information” from data allowing analysts and users to “shop for data” without having to curate the data themselves.

Q: How do these products, and the group’s activities, relate to Apache?

All of these products leverage Apache projects like Spark and Hadoop and their ecosystem of Apache projects, on Big Data. In addition, these products are bringing SQL, large-scale analytics to the Hadoop and Spark community. ISVs that are building applications leveraging Apache projects can now use these value-add capabilities to build applications that better serve their customers.

Over the years, we’ve seen use cases that span the spectrum from log mining to fraud detection, from text analytics to machine learning. Let me give you a couple of examples on how our customers have leveraged Apache projects along with IBM products to derive valuable insights from big data.

The data integration space is one of Apache Hadoop’s strengths. We have a few large ISVs that are integrating customer interactions across multiple channels. They collect all “touches” a customer has with a business — be it via phone, emails and web chats, social media, and blogs (voice, video, web, chat and telephony).  This raw data is then turned into a trail, a “Customer Journey” if you will. To do this, they’re leveraging Hadoop, in combination with IBM BigInsights value-adds around Data transformation, SQL, Text Analytics & Machine Learning.

The immediate benefit is that businesses are able to see their customers in a more complete light. The integrated channel views are leading to improved call-handling efficiencies and faster problem resolutions. For example, the end customers don’t have to repeat themselves to operators on the phone, and there are fewer gaps in customer service. Thanks to Hadoop, these ISVs have seen significant drops in operational costs. Crunching millions of interactions is now running in minutes and hours as opposed to days and weeks.

We also have a large healthcare provider who’s using Hadoop with IBM’s BigInsights text analytics. They’re faced with the problem of identifying patients that have recalled implants, such as hip replacements, spinal implants, and the like. Unfortunately, the information is buried in thousands of medical records in textual form. Can you imagine teams of nurses reading through patient medical records, trying to piece together the maker, model, serial number, etc. for these implants?  With text analytics running on a distributed platform like Hadoop and Spark, they are able to identify patients programmatically, more accurately, and quickly.

Q: Can you talk briefly about IBM’s involvement with Linux and with open source?

IBM has been at the forefront of the open source movement, starting with its contributions to Apache HTTP server, and the Apache Foundation itself, continuing on with the Eclipse Foundation, and its embrace of Linux with its own Linux Technology Center.

IBM’s contributions to Linux include ports to mainframe, the Power architecture, and the Cell Broadband Engine. These ports enable support for Linux on all modern IBM systems and allow Linux to be at the heart of IBM solutions regardless of hardware platform. More than 500 IBM software products now run natively on Linux, including WebSphere, DB2, and systems management products.

In addition to its involvement in Linux, IBM has been actively contributing to open source platforms including Hadoop, Node.js, and Docker. The latest in this trail of open source advancement by IBM has been the announcement of the Spark Technology Center to promote and enhance the Apache Spark platform.

Q: You’re scheduled to give a keynote called “Apache Spark — Making the Unthinkable Possible.” Can you tell us briefly what it’s about? How is IBM contributing to the stretching of boundaries of Big Data thinking?

The Watson initiative is an example of Big Data thinking. We are pushing the boundaries on assimilating all existing data and knowledge and using that to predict and provide guidance on wide range of topics from healthcare, to financial advice, to management of natural resources, to even cooking up a new recipe from a set of ingredients. This is the new age of using data to not only predict outcomes but to change them based on ongoing events.