Apache Spark Creator Matei Zaharia Describes Structured Streaming in Spark 2.0 [Video]


Apache Spark has been an integral part of Mesos from its inception. Spark is one of the most widely used big data processing systems for clusters. Matei Zaharia, the CTO of Databricks and creator of Spark, talked about Spark’s advanced data analysis power and new features in its upcoming 2.0 release in his MesosCon 2016 keynote.

Spark’s Design Goals

Spark was created to meet two needs: to provide a unified engine for big data processing, and a concise high-level API for working with big data.

“A lot of data and analysis is exploratory and interactive. So, unlike things like high-performance computing, where you write a program and then you run it for many years, and you can afford to spend a few months optimizing it, in data science, what you really do is write a program and you run it once, and then you realize it was computing the wrong thing and you never run it again. So you can’t actually spend a lot of time sitting down and tuning your program. The solution is to have very high-level APIs that try to get you pretty good performance and are faster to iterate, so that you can actually explore your data,” said Zaharia.

Spark uses libraries for data processing, such as SQL and data frames for structured data, streaming libraries for incremental processing, and graphics processing. According to Zaharia, “These all build on top of the Resilient Distributed Dataset (RDD) API, and the cool thing is when we look at users, most users do use a mix of these. I think something like 75% of users use two libraries or more. It’s actually useful for people trying to build applications.”

Spark 2.0

Spark 2.0 has not yet been released, but you can try out the preview release. The most significant new feature is structured streaming, which greatly expands Spark’s real-time data analysis capabilities.

“It has event time, which means your records can have time stamps set from outside, and they can come in out of order, and you can still do aggregation and windowing by the original time in the data. It’s got windowing, sessions, sessionization, and a really nice API for plugging in data sources and syncs… With structured streaming, you’re able to take the data in a stream, build a table in Spark SQL, and serve the table through JBDC, and anything that docks SQL can query the real time state of your stream,” Zaharia said.

Watch Matei Zaharia’s full keynote presentation below to learn about other new 2.0 features, and see a live demonstration of structured streaming.


More Mesos Large-Scale Solutions

You might enjoy these previous articles about MesosCon:

4 Unique Ways Uber, Twitter, PayPal, and Hubspot Use Apache Mesos

How Verizon Labs Built a 600 Node Bare Metal Mesos Cluster in Two Weeks

Running Distributed Applications at Scale on Mesos from Twitter and CloudBees

New Tools and Techniques for Managing and Monitoring Mesos

And, watch this spot for more blogs on ingenious and creative ways to hack Mesos for large-scale tasks.

MesosCon Europe 2016 offers you the chance to learn from and collaborate with the leaders, developers and users of Apache Mesos. Don’t miss your chance to attend! Register by July 15 to save $100.


Apache, Apache Mesos, and Mesos are either registered trademarks or trademarks of the Apache Software Foundation (ASF) in the United States and/or other countries. MesosCon is run in partnership with the ASF.