Addressing the Complexity of Big Data with Open Source


Just like a zoo with hundreds of different species and exhibits, the big data stack is created from more than 20 different projects developed by committers and contributors of the Apache Software Foundation. Each project has its own complex dependencies structure, which, in turn, build on one another very much like the Russian stacking doll (matryoshka). Further, all of these projects have their own release trains where different forks might include different features or use different versions of the same library. When combined, there are a lot of incompatibilities, and many of the components rely on each other to work properly, such as the case of a software stack. For example, Apache HBase and Apache Hive depend on Apache Hadoop’s HDFS. In this environment, is it even possible to consistently produce software that would work when deployed to a hundred computers in a data center?…

All of these moving parts effectively serve one purpose: to create the packages from known building blocks and transfer them a different environment (dev, QA, staging, and production) so that no matter where they are deployed, they will work the same way. The deployment mechanism needs to control the state of the target system. Relying on a state machine like Puppet or Chef has many benefits. You can forget about messy shell or Python scripts to copyfiles, create symlinks, and set permissions. Instead, you define “the state” that you want the target system to be, and the state machine will execute the recipe and guarantee that the end state will be as you specified. The state machine controls the environment instead of assuming one. These properties are great for operations at scale, DevOps, developers, testers, and users, as they know what to expect.

Read more at DZone