NEXmark: A Benchmarking Framework for Processing Data Streams


ApacheCon North America is only a few weeks away — happening May 16-18  in Miami. This year, it’s particularly exciting because ApacheCon will be a little different in how it’s set up to showcase the wide variety of Apache topics, technologies, and communities.

Apache: Big Data is part of the ApacheCon conference this year. Ismaël Mejía and Etienne Chauchot, of Talend, are giving a joint presentation called NEXmark, which is a unified framework to evaluate Big Data and processing systems with Apache Beam. In this interview, they are sharing some highlights on that talk and other thoughts on these topics, too.

LinuxCon: Who should attend your talk? Who will get the most out of it?

Etienne: Our talk is about NEXmark, which comes from a research paper that tried to evaluate the streaming systems for streaming semantics. This paper was adopted by Google into a suite of jobs, pipelines we’re calling them. It was contributed to the community, but it didn’t integrate well with all the Apache stuff, so we took the job and we improved on it and we’re going to present this story.

Ismaël:  And for the audience question, we will just define the concepts that are specific to Beam, so basic big data knowledge is required.

LinuxCon: Is it only focused on Apache Beam or is it on Big Data in general?

Etienne: In the Big Data world there are two big families: batch and streaming. We will treat both cases because Beam is a unified model for both. Then there are many Apache products involved also.

Apache Beam is enough traction to execute the pipeline or jobs. But we also need different Apache products, or different runners we call them, so we can run Beam code on Apache Flink, Apache Spark, or Apache Apex. But we also integrate with the data stores using Apache, like Cassandra.

Ismaël:  The main goal of this benchmark suite is to reproduce cases of advanced semantics of Beam that cover all the streaming of the space also.

LinuxCon: So you are both involved in Apache Beam? How long have you been involved in that?

Etienne: Since December, myself.

Ismaël:   I’ve been since June of the last year. I’m already a commenter, that’s the good news, as of two weeks ago.

LinuxCon: What are the main highlights? You talk about the runner, is there anything specific or new technology or new logic that you are unveiling as part of your talk?

Etienne: The big thing is that there is a new unified solution to evaluate Big Data using both streaming and batch and that’s quite new. Attendees will also learn the concepts of Beam and the API.

Linux.Com: So what’s your overall aim?

Etienne: There is one aim, that people will know that they can take this and use it to evaluate their own inference to two. For example, you might want to use big data framework from Apache and Spark, maybe version one or version two. You decide you want to evaluate the differences. So, you can take this suite and play this out. And then you will have some criteria extracted to decide. And the second thing that could be of interest is to use the advanced semantics of Beam. Things like timers, and other new stuff. So that would be of interest.

LinuxCon: Is this the first time you’re presenting?

Etienne: I went to Apache: Big Data in Vancouver last year and Seville also. It was a really nice atmosphere. But this is the first time I’m going to present something, so it’s going to be cool.

Ismaël:   This will be the second time I have attended ApacheCon. I’ve already been to the one in Seville, Europe. I’ve noticed that it’s a family atmosphere. That’s why I feel very confident in this kind of environment, and it’s very interesting for me. I mean in addition to the very interesting technical talks. But this is my first time speaking at ApacheCon.

LinuxCon: When is your talk? What date and time is it?

Etienne: It will be on Wednesday, May 17 at 2:30 pm.

Learn first-hand from the largest collection of global Apache communities at ApacheCon 2017 May 16-18 in Miami, Florida. ApacheCon features 120+ sessions including five sub-conferences: Apache: IoT, Apache Traffic Server Control Summit, CloudStack Collaboration Conference, FlexJS Summit and TomcatCon. Secure your spot now! readers get $30 off their pass to ApacheCon. Select “attendee” and enter code LINUXRD5. Register now >>