Apache Foundation Crucial to Hadoop, Big Data’s Success


Looking back at 10 years of Hadoop, project co-founder and Cloudera Chief Architect Doug Cutting can see two primary factors in the success of open source big data technology: a heap of luck and the Apache Foundation’s unique support.

Cutting delivered a keynote at the Apache Big Data conference in Vancouver in May. In that talk, he said Hadoop was the right technology at the right time, but the reason it was able to capitalize on that position was the work from the Apache community.

“What really has made this happen is people using software, people contributing to software, people encouraging contributions from others; that core Apache capability is what drives things forward,” Cutting said.

Once Hadoop’s utility, flexibility and scalability became apparent to enterprise IT departments, the open source community expanded the ecosystem quickly. Cutting said that compared to the the era of databases directly preceding it — proprietary software and expensive hardware controlled by a very small group of huge companies — the pace of innovation is rapidly accelerating.

“The hallmark of this ecosystem that’s emerged is the way that it’s evolving,” Cutting said. “We’re seeing not just new projects added, but some of the old projects being replaced over time by things that are better. In the end, nothing is sacred. Any component can be replaced by something that is better.

“This is really exciting,” Cutting continued. “The pace of change in the big data ecosystem is astronomically greater than we saw in the 20 preceding years. The way this change is happening is the key: It’s a decentralized change. There is no one organization, or handful of organizations, that are deciding what are the next components in the stack.

“Rather, we’ve got this process where there are random mutations sprouting up all over. Some make it into the incubator and become top level projects at Apache, but mostly what matters is that people start using them. They decide which ones work, and start to invest further in those, and there is this very organic process in selecting and improving the next thing.

“It’s leading not only to faster change, but change that is more directed towards the problems that people really care about, and where they need solutions.”

Cutting, who first worked with Apache as founder of the Lucene project, also acknowledged that Hadoop happened to be the beneficiary of being in the right place at the right time. Cutting was working on Hadoop’s predecessor, Nutch, when Google released papers about its filesystem, GFS, and MapReduce. This helped solve some of the Nutch project’s scalability issues, and soon Hadoop was born.

The world it was born into was begging for a way to better utilize data, Cutting said.

“Industry was ripe for harnessing the data it was generating,” said Cutting. “A lot of data was being just discarded. People saw the possibility of capturing it but they didn’t have the tools, so they were ready to jump on something which gave them the tools.”

Hadoop was a first mover, and once the Apache-backed project started to grow and prove itself, the old guard found they had lost their ability to lock in clients to their proprietary systems. The community got momentum, and rest is 10 years of history.

“It’s really hard to fight Apache open source with something that isn’t,” Cutting said. “It’s much easier to join than to fight.”

Watch the complete presentation below: