This post summarizes my experience in setting up a test environment for HDFS cluster using CentOS VMs.
Before we start we like to configure the VMs (Refer to this on how to setup and configure CentOS VMs using VirtualBox for HDFS:
http://czcodezone.blogspot.sg/2016/03/setup-centos-vm-in-virtualbox-for.html) to be the following:
centos01/192.168.56.101: run namenode
centos02/192.168.56.102: run datanode
centos03/192.168.56.103: run datanode
centos04/192.168.56.104: run datanode
centos05/192.168.56.105: run datanode
where centos01 is the hostname and the 192.168.56.101 is the ip assigned to host centos01
1. Set up hostname for each computer
In this case, we assume that we have not set up a DNS for the VMs to know each other, but we do not want to use raw ip addresses. Therefore we need to configure VMs to identify by hostname centos0x.
On each VM above, do the following (in the following case, we use centos01/192.168.56.101 for illustration, DO replace them with the individual VM's hostname and ipaddress instead).
1.1. Modify /etc/sysconfig/network
Run the following command to edit /etc/hostname:
```bash
`vi /etc/hostname
In the /etc/hostname, put the following line (replace "centos01" accordingly)
centos01
Run the following command to edit /etc/sysconfig/network:
```bash
`vi /etc/sysconfig/network
In the /etc/sysconfig/network, put the following line (replace "centos01" accordingly)
HOSTNAME=centos01
Run the following command to restart network service:
```bash
`service network restart
1.2. Modify /etc/hosts
Run the following command to edit /etc/hosts:
```bash
`vi /etc/hosts
In the /etc/hosts, add the following lines
192.168.56.101 centos01
192.168.56.102 centos02
192.168.56.103 centos03
192.168.56.104 centos04
192.168.56.105 centos05
2. Set up passwordless ssh from the namenode to the datanodes
We need to setup passwordless ssh from the namenode (namely centos01) to the rest of the VMs which serve as datanodes and itself. To do this, on the centos01, run the following command to create the id_dsa.pub public key:
```bash
`mkdir ~/.ssh
`ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
`scp ~/.ssh/id_dsa.pub root@centos02:/root
`scp ~/.ssh/id_dsa.pub root@centos03:/root
`scp ~/.ssh/id_dsa.pub root@centos04:/root
`scp ~/.ssh/id_dsa.pub root@centos05:/root
`cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Now the id_dsa.pub has been copied from centos01 to the other 4 VMs's root directory, we need to append them to /root/.ssh/authorized_keys. for each VM in centos02/3/4/5, run the following command:
```bash
`mkdir ~/.ssh
`touch ~/.ssh/authorized_keys
`cat ~/id_dsa.pub >> ~/.ssh/authorized_keys
3. Configure hadoop on each VMs
Perform the following steps on each VM of centos01/2/3/4/5
3.1 Configure $JAVA_HOME
Before we start running hdfs, we need to specify the JAVA_HOME in the environment path. Assume that we install the java-1.8.0-openjdk-devel on each VM for java jdk, the installation is at /usr/lib/jvm/java-1.8.0-openjdk. To specify the JAVA_HOME in the environment path. run the following command:
```bash
`vi ~/.bashrc
In the .bashrc, add the following line:
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
3.2. Download hadoop
Download the hadoop binary distribution, unzip it to the root directory ~/hadoop.
3.3 Configure ~/hadoop/etc/hadoop/slaves
Run the folowing command to edit slaves:
```bash
`vi ~/hadoop/etc/hadoop/slaves
In the ~/hadoop/etc/hadoop/slaves, remove the "localhost" and add the following line:
centos02
centos03
centos04
centos05
This slaves files specified the datanodes
3.4. Configure ~/hadoop/etc/hadoop/core-site.xml
Run the following command to edit core-site.xml:
```bash
`vi ~/hadoop/etc/hadoop/core-site.xml
In ~/hadoop/etc/hadoop/core-site.xml, write the following (centos01 refers to the namenode):
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://centos01:9000/</value>
</property>
</configuration>
3.5. Configure ~/hadoop/etc/hdfs-site.xml
Run the following command to create directory for hdfs data:
```bash
`mkdir ~/hadoop_data
`mkdir ~/hadoop_data/data
`mkdir ~/hadoop_data/name
`mkdir ~/hadoop_data/local
`chmod -R 755 ~/hadoop_data
In ~/hadoop/etc/hadoop/hdfs-site.xml, write the following:
<configuration>
<property>
<name>dfs.datanode.data.dir</name>
<value>/root/hadoop_data/data</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/root/hadoop_data/name</value>
</property>
</configuration>
The above specify how the data for namenode and datanode are stored.
4. Start the hdfs cluster
On the namenode centos01, run the following command to format namenode:
```bash
`~/hadoop/bin/hdfs namenode -format
On the namenode centos01, start the hdfs cluster:
```bash
`~/hadoop/sbin/start-dfs.sh
To check what are running on each VM, run the following command on each VM:
```bash
`jps
To check the reporting of the hdfs cluster, run the following command on the namenode centos01:
```bash
`~/hadoop/bin/hdfs dfsadmin -report
Another way to check is to visit the web server hosted by namenode:
```bash
`curl http://centos01:50070/
5. Stop the hdfs cluster
To stop the hdfs cluster, run the following commmand on the namenode centos01:
```bash
`~/hadoop/sbin/stop-dfs.sh