Thursday, March 31, 2016

Setup HDFS Cluster in CentOS VMs

This post summarizes my experience in setting up a test environment for HDFS cluster using CentOS VMs.

Before we start we like to configure the VMs (Refer to this on how to setup and configure CentOS VMs using VirtualBox for HDFS: http://czcodezone.blogspot.sg/2016/03/setup-centos-vm-in-virtualbox-for.html) to be the following:

centos01/192.168.56.101: run namenode
centos02/192.168.56.102: run datanode
centos03/192.168.56.103: run datanode
centos04/192.168.56.104: run datanode
centos05/192.168.56.105: run datanode

where centos01 is the hostname and the 192.168.56.101 is the ip assigned to host centos01

1. Set up hostname for each computer

In this case, we assume that we have not set up a DNS  for the VMs to know each other, but we do not want to use raw ip addresses. Therefore we need to configure VMs to identify by hostname centos0x.

On  each VM above, do the following (in the following case, we use centos01/192.168.56.101 for illustration, DO replace them with the individual VM's hostname and ipaddress instead).

1.1. Modify /etc/sysconfig/network


Run the following command to edit /etc/hostname:

```bash
`vi /etc/hostname

In the /etc/hostname, put the following line (replace "centos01" accordingly)

centos01

Run the following command to edit /etc/sysconfig/network:

```bash
`vi /etc/sysconfig/network

In the /etc/sysconfig/network, put the following line (replace "centos01" accordingly)

HOSTNAME=centos01

Run the following command to restart network service:

```bash
`service network restart

1.2. Modify /etc/hosts


Run the following command to edit /etc/hosts:

```bash
`vi /etc/hosts

In the /etc/hosts, add the following lines

192.168.56.101 centos01
192.168.56.102 centos02
192.168.56.103 centos03
192.168.56.104 centos04
192.168.56.105 centos05

2. Set up passwordless ssh from the namenode to the datanodes


We need to setup passwordless ssh from the namenode (namely centos01) to the rest of the VMs which serve as datanodes and itself. To do this, on the centos01, run the following command to create the id_dsa.pub public key:

```bash
`mkdir ~/.ssh
`ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
`scp ~/.ssh/id_dsa.pub root@centos02:/root
`scp ~/.ssh/id_dsa.pub root@centos03:/root
`scp ~/.ssh/id_dsa.pub root@centos04:/root
`scp ~/.ssh/id_dsa.pub root@centos05:/root
`cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Now the id_dsa.pub has been copied from centos01 to the other 4 VMs's root directory, we need to append them to /root/.ssh/authorized_keys. for each VM in centos02/3/4/5, run the following command:

```bash
`mkdir ~/.ssh
`touch ~/.ssh/authorized_keys
`cat ~/id_dsa.pub >> ~/.ssh/authorized_keys

3. Configure hadoop on each VMs


Perform the following steps on each VM of centos01/2/3/4/5

3.1 Configure $JAVA_HOME 

Before we start running hdfs, we need to specify the JAVA_HOME in the environment path. Assume that we install the java-1.8.0-openjdk-devel on each VM for java jdk, the installation is at /usr/lib/jvm/java-1.8.0-openjdk. To specify the JAVA_HOME in the environment path. run the following command:

```bash
`vi ~/.bashrc

In the .bashrc, add the following line:

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk

3.2. Download hadoop 


Download the hadoop binary distribution, unzip it to the root directory ~/hadoop.

3.3 Configure ~/hadoop/etc/hadoop/slaves 


Run the folowing command to edit slaves:

```bash
`vi ~/hadoop/etc/hadoop/slaves

In the ~/hadoop/etc/hadoop/slaves, remove the "localhost" and add the following line:

centos02
centos03
centos04
centos05

This slaves files specified the datanodes

3.4. Configure ~/hadoop/etc/hadoop/core-site.xml


Run the following command to edit core-site.xml:

```bash
`vi ~/hadoop/etc/hadoop/core-site.xml

In ~/hadoop/etc/hadoop/core-site.xml, write the following (centos01 refers to the namenode):

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://centos01:9000/</value>
</property>
</configuration>

3.5. Configure ~/hadoop/etc/hdfs-site.xml


Run the following command to create directory for hdfs data:

```bash
`mkdir ~/hadoop_data
`mkdir ~/hadoop_data/data
`mkdir ~/hadoop_data/name
`mkdir ~/hadoop_data/local
`chmod -R 755 ~/hadoop_data

In ~/hadoop/etc/hadoop/hdfs-site.xml, write the following:

<configuration>
<property>
<name>dfs.datanode.data.dir</name>
<value>/root/hadoop_data/data</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/root/hadoop_data/name</value>
</property>
</configuration>

The above specify how the data for namenode and datanode are stored.

4. Start the hdfs cluster


On the namenode centos01, run the following command to format namenode:

```bash
`~/hadoop/bin/hdfs namenode -format

On the namenode centos01, start the hdfs cluster:

```bash
`~/hadoop/sbin/start-dfs.sh

To check what are running on each VM, run the following command on each VM:

```bash
`jps

To check the reporting of the hdfs cluster, run the following command on the namenode centos01:

```bash
`~/hadoop/bin/hdfs dfsadmin -report

Another way to check is to visit the web server hosted by namenode:

```bash
`curl http://centos01:50070/

5. Stop the hdfs cluster


To stop the hdfs cluster, run the following commmand on the namenode centos01:

```bash
`~/hadoop/sbin/stop-dfs.sh


No comments:

Post a Comment