October 22, 2008

Hands-on Hadoop for cluster computing

Author: Amit Kumar Saha

Hadoop is a distributed computing platform that provides a framework for storing and processing petabytes of data. Because it is Java-based, Hadoop runs on Linux, Windows, Solaris, BSD, and Mac OS X. Hadoop is widely used in organizations that demand a scalable, economical (read commodity hardware), efficent, and reliable platform for processing vast amounts of data.

For storing such large amounts of data, Hadoop uses the Hadoop Distributed File System (HDFS). The master/slave architecture of HDFS is central to the way a Hadoop cluster functions. It supports one master server, called the NameNode, which manages the filesystem metadata, and many DataNodes that actually store the data.

The NameNode is a potential single point of failure, so Hadoop also provides a Secondary NameNode that performs periodic checkpoints of the namespace and helps keep the size of the file that contains the log of HDFS modifications within certain limits at the NameNode. The project documentation provides a diagram of the architecture, while the HDFS User Guide gives a comprehensive introduction to HDFS from a user's perspective.

Hadoop uses Google's MapReduce programming model for the distributed processing of data. In Hadoop's implementation, a JobTracker process on one node acts as the job scheduler and allocator for the cluster. It allocates jobs to TaskTrackers on each cluster node. JobTracker is a MapReduce master and TaskTracker is a MapReduce slave.

The functional units of a Hadoop cluster -- NameNode, DataNodes, JobTracker, and TaskTracker -- are implemented as daemons. All the nodes have built in Web servers that makes it easy to check the current status of the cluster. The web server on the NameNode gives access to the status of the DataNodes, reporting live nodes, dead nodes, capacity of the distributed filesystem, and miscellaneous other information. You can also track the status of the TaskTrackers and the tasks running on them via the Web server functioning in the JobTracker.

Hadoop was originally built as infrastructure for the Nutch project, which crawls the Web and builds a search engine index for the crawled pages. Tasks that require carrying out the same processing on huge amounts of data are typically suited to be run on Hadoop.

Hadoop supports three operating modes. By default, Hadoop is configured to run in non-distributed mode as a single Java process. While that's useless for any real work, it helps you learn how to run Hadoop without setting up multiple boxes. You can also run it in pseudo-distributed mode, in which all the Hadoop daemons are run on a single node but each daemon is implemented as a separate Java process. In fully distributed mode, Hadoop runs on multiple nodes with distributed NameNode, JobTracker, DataNodes, and TaskTrackers. A minimum of three nodes is required to set up Hadoop in this mode.

Setting up a fully distributed Hadoop cluster

To test Hadoop, I set up a three-node cluster on three commodity Linux boxes, each running Debian "Lenny" beta 2, Sun JDK 1.6, and Hadoop Other required software includes an SSH client and server running on all your nodes.

Unpack the Hadoop gunzip tarball in any directory you have write permissions to. The root of the distribution is referred to as HADOOP_HOME. All machines in the cluster usually have the same HADOOP_HOME path. It is a good idea to export HADOOP_HOME in one of your login scripts, such as .bashrc.

Hadoop configuration is driven by two configuration files in the HADOOP_HOME/conf/ directory. The default configuration settings appear in the read-only hadoop-default.xml file, while node-specific configuration information is in hadoop-site.xml. The contents of the latter file depend on the role of the node in the cluster. Properties set in this file overrides the ones set in the default file. The original tarball contains an empty default-site.xml that you must modify to suit your needs.

Another important file is conf/slaves. On the JobTracker this file lists all the hosts on which the TaskTracker daemon has to be started. On NameNode it lists all the hosts on which the DataNode daemon has to be started. You must maintain this file manually, even if you are scaling up to a large number of nodes.

Finally, conf/chadoop-env.sh contains configuration options such as JAVA_HOME, the location of logs, and the directory where process IDs are stored.

In my test setup, I installed NameNode and JobTracker on two separate nodes, and DataNode and TaskTracker both on a third node. The conf/slaves files on the first two nodes contained the IP address of the third. All four daemons used the same conf/hadoop-site.xml file, specifically this one. For more information on what the various properties in the conf/hadoop-site.xml mean, refer to the online documentation.

You must also set up passphraseless SSH between all the nodes. If you want to use hostnames to refer to the nodes, you must also edit the /etc/hosts file on each node to reflect the proper mapping between the hostnames and IP addresses.

Hadoop startup

To start a Hadoop cluster you need to start both the HDFS and MapReduce. First, on the NameNode, navigate to HADOOP_HOME and format a new distributed filesystem with a command like bin/hadoop namenode -format. You can then start the HDFS with the following command, run on the designated NameNode:

$ bin/start-dfs.sh

starting namenode, logging to /home/amit/hadoop/hadoop- starting datanode, logging to /home/amit/hadoop/hadoop-
localhost: starting secondarynamenode, logging to /home/amit/hadoop/hadoop-

The bin/start-dfs.sh script also consults the conf/slaves file on the NameNode and starts the DataNode daemon on all the listed slaves.

Start MapReduce with the following command on the designated JobTracker:

$ bin/start-mapred.sh

starting jobtracker, logging to /home/amit/hadoop/hadoop- starting tasktracker, logging to /home/amit/hadoop/hadoop-
starting jobtracker, logging to /home/amit/hadoop/hadoop- starting tasktracker, logging to /home/amit/hadoop/hadoop-

The bin/start-mapred.sh script also consults the conf/slaves file on the JobTracker and starts the TaskTracker daemon on all the listed slaves.

To cross-check whether the cluster is running properly, you can look at the processes running on each node, using jps. On NameNode you should see the processes Jps, NameNode, and, if you have only a three-node cluster, SecondaryNameNode. On JobTracker check for Jps and JobTracker. On TaskTracker/DataNode you should see Jps, DataNode, and TaskTracker.

Running MapReduce jobs

Once you have a Hadoop cluster running, you can see it in action by executing one of the example MapReduce Java class files bundled in hadoop- As an example, we'll try Grep (yes, with an initial capital letter), which extracts matching strings from text files and counts how many time they occurred.

To begin, create an input set for Grep. In this case the input will be a set of files in the conf/ directory. Grep will extract the matching strings specified by the regular expression supplied during execution and check that it completes properly. Its parameters are input, output (the directory on the DFS in which the output will be stored), and the regular expression that you want to match.

bin/hadoop dfs -copyFromLocal conf input

bin/hadoop dfs -ls

Found 1 items
/user/amit/input <dir> 2008-10-12 00:38 rwxr-xr-x amit supergroup

bin/hadoop jar hadoop- grep input grep-output 'dfs[a-z.]+'

After the execution is over, copy the output into the local filesystem so you can examine it; data under Hadoop is stored as blocks in DFS and is not visible using normal Unix commands:/p>

bin/hadoop dfs -get grep-output output
cd output
3 dfs.
3 dfs.class
3 dfs.replication
2 dfs.period
1 dfs.max.objects

You can monitor Hadoop's HDFS and JobTrackers via bundled Web applications. You can view the HDFS administration panel at http://NameNode IPaddr:5070 (see screenshot) and view the currently running jobs and past history of jobs executed and failed at http://JobTracker IPaddr:50030 (see screenshot).

When you're done with the cluster, you can stop HDFS with the command bin/stop-dfs.sh run on the NameNode, and stop MapReduce with the command bin/stop-mapred.sh run on the JobTracker.

Hadoop is known to work with 4,000 nodes, which hints at the scalability of Hadoop. Start small and scale up as you go, and let us know how you make out!


  • High Performance Computing