How eBay Uses Apache Software to Reach Its Big Data Goals

1873

eBay’s ecommerce platform creates a huge amount of data. It has more than 800 million active listings, with 8.8 million new listings each week. There are 162 million active buyers, and 25 million sellers.

“The data is the most important asset that we have,” said Seshu Adunuthula, eBay’s head of analytics infrastructure, during a keynote at Apache Big Data in Vancouver in May. “We don’t have inventory like other ecommerce platforms, what we’re doing is connecting buyers and sellers, and data plays an integral role into how we go about doing this.”

Inside eBay, hordes of hungry product teams want to make use of all the transactional and behavioral data the platform creates to do their jobs better, from surfacing the most interesting items to entice buyers to helping sellers understand the best way to get their stuff sold.

Adunuthula said that about five years ago, eBay make the conscious choice to go all-in with open source software to build its big data platform and to contribute back to the projects as the platform took shape.

“The idea was that we would not only use the components from Apache, but we also start contributing back,” he said. “That has been a key theme in eBay for years: how do we contribute back to the open source community.”

Repository, Streams, and Services

Adunuthula said there are three main components to eBay’s data platform: the data repository, data streams, and data services.

Starting with data repositories, eBay is making use of Hadoop and several of the surrounding projects, like Hive and Hbase, along with hardware from Teradata to store the data created by millions of daily transactions on eBay.

“A decade ago we used to call them data warehouses; now for the last five years because of the type of the data and the structure of the data changing, we are calling them data lakes,” Adunuthula said. “Apache Hadoop is a big component of how we’re implementing the data lakes. It essentially is a place where you store your denormalized data, your aggregated data, and historical data.”

The data streams are a key portion of the strategy; product teams and analysts desperately want to see data as it comes in so they can pull insights that much quicker. eBay has built connectors to Hadoop, processes the streaming data with Storm and Spark clusters, and accesses it via Kafka.

“Today we have deployed 300-400 Kafka brokers,” he said. “LinkedIn probably has the biggest Kafka deployment, but we might get there soon. The amount of data that the product team is requesting to be available in streams is high. We’ll get a lot of Kafka topics with lots of data available stream processing happening on Storm, but Spark 2.0 looks very promising.”

For data services, eBay has created its own distributed analytics engine with an SQL interface and multi-dimensional analysis on Hadoop and made it open source: the Apache Kylin project.

“The realization was: now we’ve got this commodity scale computation platform, and I have MOLAP style cubes and they were never operational at scale before,” Adunuthula said. “You could never take a 100TB cube and keep scaling it at the rate at which the data is growing.

“But now with all components that are available to us: the raw data in Hive, the processing capabilities using MapReduce or Spark, and then storing the cubes in HBase, with a very limited effort we were able to build out these MOLAP cubes, and we have more than a dozen MOLAP cubes operational within eBay, around 100TB is the size of the largest cubes, with around 10 billion rows of data in them.”

eBay’s latest work is making the Kylin cubes “completely streaming aware,” Adunuthula said.

“Instead of taking three hours to do daily refreshes on the data, these cubes refresh every few minutes, or even to the second,” he said. “So there is a lot of interesting work going into Kylin and we think this will be a valuable way of getting to the data.”

The final piece is creating “Notebook” views on all that complex data being processed with Apache Zeppelin, allowing analysts to work together collaboratively and quickly.

“The product teams love this a lot: there is always that one analyst among them that knows how to write that best query,” he said. “We can take that query, put that into this workspace so others can use it.”

Watch the complete presentation below:

https://www.youtube.com/watch?v=wKy9IRG4C2Q?list=PLGeM09tlguZQ3ouijqG4r1YIIZYxCKsLp

linux-com_ctas_apache_052316_452x121.png?itok=eJwyR2ye