Apache Spark Creator Matei Zaharia Describes Structured Streaming in Spark 2.0 [Video]

June 27, 2016

667

Apache Spark has been an integral part of Mesos from its inception. Spark is one of the most widely used big data processing systems for clusters. Matei Zaharia, the CTO of Databricks and creator of Spark, talked about Spark’s advanced data analysis power and new features in its upcoming 2.0 release in his MesosCon 2016 keynote.

Spark’s Design Goals

Spark was created to meet two needs: to provide a unified engine for big data processing, and a concise high-level API for working with big data.

“A lot of data and analysis is exploratory and interactive. So, unlike things like high-performance computing, where you write a program and then you run it for many years, and you can afford to spend a few months optimizing it, in data science, what you really do is write a program and you run it once, and then you realize it was computing the wrong thing and you never run it again. So you can’t actually spend a lot of time sitting down and tuning your program. The solution is to have very high-level APIs that try to get you pretty good performance and are faster to iterate, so that you can actually explore your data,” said Zaharia.

Spark uses libraries for data processing, such as SQL and data frames for structured data, streaming libraries for incremental processing, and graphics processing. According to Zaharia, “These all build on top of the Resilient Distributed Dataset (RDD) API, and the cool thing is when we look at users, most users do use a mix of these. I think something like 75% of users use two libraries or more. It’s actually useful for people trying to build applications.”

Spark 2.0

Spark 2.0 has not yet been released, but you can try out the preview release. The most significant new feature is structured streaming, which greatly expands Spark’s real-time data analysis capabilities.

“It has event time, which means your records can have time stamps set from outside, and they can come in out of order, and you can still do aggregation and windowing by the original time in the data. It’s got windowing, sessions, sessionization, and a really nice API for plugging in data sources and syncs… With structured streaming, you’re able to take the data in a stream, build a table in Spark SQL, and serve the table through JBDC, and anything that docks SQL can query the real time state of your stream,” Zaharia said.

Watch Matei Zaharia’s full keynote presentation below to learn about other new 2.0 features, and see a live demonstration of structured streaming.

https://www.youtube.com/watch?v=L029ZNBG7bk?list=PLGeM09tlguZQVL7ZsfNMffX9h1rGNVqnC

Spark’s Design Goals

Spark 2.0

More Mesos Large-Scale Solutions

RELATED ARTICLESMORE FROM AUTHOR

Building Autonomous ML Experimentation with Tangle and Tangent

Score Big on Your Tech Career

Celebrating the Second Year of Linux Man-Pages Maintenance Sponsorship

How to Deploy Lightweight Language Models on Embedded Linux with LiteLLM

Automating Compliance Management with UTMStack’s Open Source SIEM & XDR

RELATED ARTICLES MORE FROM AUTHOR