On a particular day when i did not sleep much and was running low on coffee, I made the following stupid error in my code:
if(model != null & model.getVocabulary().countWords() > 0) return;
What i meant to write is:
if(model != null && model.getVocabulary().countWords() > 0) return;
When i put this code in a apache spark job and run in the cluster, it threw Null Pointer Exception. Turns out that the missing "&" is a serious mistake as if model is evaluated to null, "&&" will then skip the right hand side evaluation. However, with "&", both sides are evaluated. As a result of evaluating "model.getVocabulary()" with model being null causing the null pointer exception.
Lesson learnt: should have more sleeping hours instead of wasting time debugging some funny error like above :p
Wednesday, April 20, 2016
Thursday, March 31, 2016
Setup Hadoop YARN on CentOS VMs
This post summarizes my experience in setting up testing environment for YARN using CentOS VMs.
After we setup the hdfs following the link (http://czcodezone.blogspot.sg/2016/03/setup-hdfs-cluster-in-centos-vms.html). We can go ahead to set up yarn to manage jobs in hadoop.
Hadoop v2 uses application masters in the datanode to work together with nodemanager to manage a job, and uses resource manager in the namenode to schedule resources for a job (a job refers to a particular application or driver from distributed computation framework such as mapreduce or spark).
To setup yarn, on each VM (both namenode and datanode), perform the following steps:
Run the following command to edit the mapred-site.xml:
```bash
`cd hadoop/etc/hadoop
`cp mapred-site.xml.template mapred-site.xml
`vi mapred-site.xml
In the mapred-site.xml, Modify as follows:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://centos01:9000</value>
</property>
</configuration>
The "hdfs://centos01:9000" specify the master as the "fs.default.name". (It is important that this is not specified as "hdfs://localhost:9000", otherwise the "hadoop/bin/hdfs dfsadmin -report" will have "Connection refused" exception)
Run the following command to edit the yarn-site.xml:
```bash
`vi hadoop/etc/hadoop/yarn-site.xml
In the yarn-site.xml, modify as follows:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
On the namenode centos01 (please refers to http://czcodezone.blogspot.sg/2016/03/setup-centos-vm-in-virtualbox-for.html), run the following command to start the hdfs and then yarn:
```bash
`hadoop/sbin/start-dfs.sh
`hadoop/sbin/start-yarn.sh
`hadoop/sbin/mr-jobhistory-daemon.sh start historyserver
`jps
To stop hdfs and yarn, run the following command:
```bash
`hadoop/sbin/mr-jobhistory-daemon.sh stop historyserver
`hadoop/sbin/stop-yarn.sh
`hadoop/sbin/stop-dfs.sh
After we setup the hdfs following the link (http://czcodezone.blogspot.sg/2016/03/setup-hdfs-cluster-in-centos-vms.html). We can go ahead to set up yarn to manage jobs in hadoop.
Hadoop v2 uses application masters in the datanode to work together with nodemanager to manage a job, and uses resource manager in the namenode to schedule resources for a job (a job refers to a particular application or driver from distributed computation framework such as mapreduce or spark).
1. Setup yarn configuration in hadoop
To setup yarn, on each VM (both namenode and datanode), perform the following steps:
1.1. Edit hadoop/etc/hadoop/mapred-site.xml
Run the following command to edit the mapred-site.xml:
```bash
`cd hadoop/etc/hadoop
`cp mapred-site.xml.template mapred-site.xml
`vi mapred-site.xml
In the mapred-site.xml, Modify as follows:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://centos01:9000</value>
</property>
</configuration>
The "hdfs://centos01:9000" specify the master as the "fs.default.name". (It is important that this is not specified as "hdfs://localhost:9000", otherwise the "hadoop/bin/hdfs dfsadmin -report" will have "Connection refused" exception)
1.2 Edit hadoop/etc/hadoop/yarn-site.xml
Run the following command to edit the yarn-site.xml:
```bash
`vi hadoop/etc/hadoop/yarn-site.xml
In the yarn-site.xml, modify as follows:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
2. Start the yarn in the namenode.
On the namenode centos01 (please refers to http://czcodezone.blogspot.sg/2016/03/setup-centos-vm-in-virtualbox-for.html), run the following command to start the hdfs and then yarn:
```bash
`hadoop/sbin/start-dfs.sh
`hadoop/sbin/start-yarn.sh
`hadoop/sbin/mr-jobhistory-daemon.sh start historyserver
`jps
3. Stop hdfs and yarn
To stop hdfs and yarn, run the following command:
```bash
`hadoop/sbin/mr-jobhistory-daemon.sh stop historyserver
`hadoop/sbin/stop-yarn.sh
`hadoop/sbin/stop-dfs.sh
Setup Spark Cluster in CentOS VMs
This post summarizes my experience in setting up a test environment for spark cluster using CentOS VMs.
After setup the VMs, we will designate the following topology (refer to this for setting up and configure CentOS VMs using VirtualBox http://czcodezone.blogspot.sg/2016/03/setup-centos-vm-in-virtualbox-for.html):
centos01: master
centos02: slave
centos03: slave
centos04: slave
centos05: slave
where centos01 refers to the hostname of the VM.
Download and unzip the spark-1.6.0-hadoop2.6.tgz to "/root/spark", run the following command on centos01 to specify the list of slaves:
```bash
`cp spark/conf/slaves.template spark/conf/slaves
`vi spark/conf/slaves
In the spark/conf/slaves add the following lines:
centos02
centos03
centos04
centos05
Make sure that the firewalls turned off on centos01/2/3/4/5 and and passwordless ssh from centos01 to centos02/3/4/5.
Run the following command to copy the "/root/spark" from centos01 to centos02/3/4/5:
```bash
`rsync -a /root/spark/ root@centos02:/root/park
`rsync -a /root/spark/ root@centos03:/root/park
`rsync -a /root/spark/ root@centos04:/root/park
`rsync -a /root/spark/ root@centos05:/root/park
Run the following command on centos01 to start the spark cluster:
```bash
`spark/sbin/start-all.sh
To stop the spark cluster, run the following comand on centos01:
```bash
`spark/sbin/stop-all.sh
After the spark cluster is started, we can start the spark shell by running the following command:
```bash
`spark/bin/spark-shell --master spark://centos01:7077
The port 7077 is the default port for spark master centos01
After the spark cluster has been setup, assuming the master is centos01, run the following command to submit a spark job:
```bash
`spark/bin/spark-submit --class com.tutorials.spark.WordCountDriver --master spark://centos01:7077 word-count.jar
Or more refinely:
```bash
`spark/bin/spark-submit --class com.tutorials.spark.WordCountDriver --master spark://centos01:7077 --executor-memory 2G --total-executor-cores 8 word-count.jar 1000
Below are two other alternatives i tested using YARN cluster and mesos cluster
To run an application in YARN cluster instead, setup and configure hdfs and yarn using the (link: http://czcodezone.blogspot.sg/2016/03/setup-hdfs-cluster-in-centos-vms.html) and (link: http://czcodezone.blogspot.sg/2016/03/setup-hadoop-yarn-on-centos-vms.html), Start the HDFS and YARN.
run the following command to edit the .bashrc:
```bash
`vi .bashrc
In the spark/conf/spark-env.sh of each VM centos01/2/3/4/5, add the following line:
export HADOOP_HOME_DIR=/root/hadoop
export HADOOP_CONF_DIR=/root/hadoop/etc/hadoop
export HADOOP_YARN_DIR=/root/hadoop/etc/hadoop
Run the following command on each VM to update .bashrc:
```bash
`source .bashrc
To submit a spark job, run the following command:
```bash
``spark/bin/spark-submit --class com.tutorials.spark.WordCountDriver --master yarn --deploy-mode cluster word-count.jar
Or
```bash
``spark/bin/spark-submit --class com.tutorials.spark.WordCountDriver --master yarn-cluster word-count.jar
Or more refinely:
```bash
`spark/bin/spark-submit --class com.tutorials.spark.WordCountDriver --master yarn --deploy-mode cluster --executor-memory 2G --num-executors 20 word-count.jar 1000
To run an application in mesos cluster instead, setup and configure hdfs and mesos (link: http://czcodezone.blogspot.sg/2016/03/setup-mesos-cluster-in-centos-vms.html), Start the HDFS and MESOS.
Put the spark bin package in the hdfs (run the following command on the hadoop namenode centos01):
```bash
`wget http://www.apache.org/dyn/closer.lua/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz
`hadoop/bin/hdfs dfs -mkdir /pkg
`hadoop/bin/hdfs dfs -put spark-1.6.0-bin-hadoop2.6.tgz /pkg/spark-1.6.0-bin-hadoop2.6.tgz
Run the following command to modify the spark-env.sh in spark/conf
```bash
`vi spark/conf/spark-env.sh
In the spark-env.sh, add the following lines:
export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so
export SPARK_EXECUTOR_URI= hdfs://centos01:9000/pkg/spark-1.6.0-bin-hadoop2.6.tgz
Where centos01 is the hadoop namenode
To submit a spark job, run the following command:
```bash
`spark/bin/spark-submit --class com.tutorials.spark.WordCountDriver --master mesos://mesos01:5050 word-count.jar
Important: mesos01 must be the current leader master node, otherwise, the command such as "spark-shell --master mesos://mesos01:5050" will cause the spark-shell to hang on the line "No credential provided. attempting to register without authentication". The solution is to find out which node is the active leader master node by running the command "mesos-resolve `cat /etc/mesos/zk`" and the luanch the spark shell by specifying the active leader master as in the --master option instead.
After setup the VMs, we will designate the following topology (refer to this for setting up and configure CentOS VMs using VirtualBox http://czcodezone.blogspot.sg/2016/03/setup-centos-vm-in-virtualbox-for.html):
centos01: master
centos02: slave
centos03: slave
centos04: slave
centos05: slave
where centos01 refers to the hostname of the VM.
1. Configure the spark cluster
Download and unzip the spark-1.6.0-hadoop2.6.tgz to "/root/spark", run the following command on centos01 to specify the list of slaves:
```bash
`cp spark/conf/slaves.template spark/conf/slaves
`vi spark/conf/slaves
In the spark/conf/slaves add the following lines:
centos02
centos03
centos04
centos05
Make sure that the firewalls turned off on centos01/2/3/4/5 and and passwordless ssh from centos01 to centos02/3/4/5.
Run the following command to copy the "/root/spark" from centos01 to centos02/3/4/5:
```bash
`rsync -a /root/spark/ root@centos02:/root/park
`rsync -a /root/spark/ root@centos03:/root/park
`rsync -a /root/spark/ root@centos04:/root/park
`rsync -a /root/spark/ root@centos05:/root/park
2. Start and stop the spark cluster
Run the following command on centos01 to start the spark cluster:
```bash
`spark/sbin/start-all.sh
To stop the spark cluster, run the following comand on centos01:
```bash
`spark/sbin/stop-all.sh
3. Run the spark shell
After the spark cluster is started, we can start the spark shell by running the following command:
```bash
`spark/bin/spark-shell --master spark://centos01:7077
The port 7077 is the default port for spark master centos01
4. Submit a spark job to spark cluster
After the spark cluster has been setup, assuming the master is centos01, run the following command to submit a spark job:
```bash
`spark/bin/spark-submit --class com.tutorials.spark.WordCountDriver --master spark://centos01:7077 word-count.jar
Or more refinely:
```bash
`spark/bin/spark-submit --class com.tutorials.spark.WordCountDriver --master spark://centos01:7077 --executor-memory 2G --total-executor-cores 8 word-count.jar 1000
Below are two other alternatives i tested using YARN cluster and mesos cluster
4.1. Submit a spark job via YARN cluster
Suppose we have a resource management cluster such as Hadoop YARN setup, we can submit the spark job to YARN for processing as well (YARN will the spark master)To run an application in YARN cluster instead, setup and configure hdfs and yarn using the (link: http://czcodezone.blogspot.sg/2016/03/setup-hdfs-cluster-in-centos-vms.html) and (link: http://czcodezone.blogspot.sg/2016/03/setup-hadoop-yarn-on-centos-vms.html), Start the HDFS and YARN.
run the following command to edit the .bashrc:
```bash
`vi .bashrc
In the spark/conf/spark-env.sh of each VM centos01/2/3/4/5, add the following line:
export HADOOP_HOME_DIR=/root/hadoop
export HADOOP_CONF_DIR=/root/hadoop/etc/hadoop
export HADOOP_YARN_DIR=/root/hadoop/etc/hadoop
Run the following command on each VM to update .bashrc:
```bash
`source .bashrc
To submit a spark job, run the following command:
```bash
``spark/bin/spark-submit --class com.tutorials.spark.WordCountDriver --master yarn --deploy-mode cluster word-count.jar
Or
```bash
``spark/bin/spark-submit --class com.tutorials.spark.WordCountDriver --master yarn-cluster word-count.jar
Or more refinely:
```bash
`spark/bin/spark-submit --class com.tutorials.spark.WordCountDriver --master yarn --deploy-mode cluster --executor-memory 2G --num-executors 20 word-count.jar 1000
4.2. Submit a spark job via mesos cluster
To run an application in mesos cluster instead, setup and configure hdfs and mesos (link: http://czcodezone.blogspot.sg/2016/03/setup-mesos-cluster-in-centos-vms.html), Start the HDFS and MESOS.
Put the spark bin package in the hdfs (run the following command on the hadoop namenode centos01):
```bash
`wget http://www.apache.org/dyn/closer.lua/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz
`hadoop/bin/hdfs dfs -mkdir /pkg
`hadoop/bin/hdfs dfs -put spark-1.6.0-bin-hadoop2.6.tgz /pkg/spark-1.6.0-bin-hadoop2.6.tgz
Run the following command to modify the spark-env.sh in spark/conf
```bash
`vi spark/conf/spark-env.sh
In the spark-env.sh, add the following lines:
export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so
export SPARK_EXECUTOR_URI= hdfs://centos01:9000/pkg/spark-1.6.0-bin-hadoop2.6.tgz
Where centos01 is the hadoop namenode
To submit a spark job, run the following command:
```bash
`spark/bin/spark-submit --class com.tutorials.spark.WordCountDriver --master mesos://mesos01:5050 word-count.jar
Important: mesos01 must be the current leader master node, otherwise, the command such as "spark-shell --master mesos://mesos01:5050" will cause the spark-shell to hang on the line "No credential provided. attempting to register without authentication". The solution is to find out which node is the active leader master node by running the command "mesos-resolve `cat /etc/mesos/zk`" and the luanch the spark shell by specifying the active leader master as in the --master option instead.
Setup HDFS Cluster in CentOS VMs
This post summarizes my experience in setting up a test environment for HDFS cluster using CentOS VMs.
Before we start we like to configure the VMs (Refer to this on how to setup and configure CentOS VMs using VirtualBox for HDFS: http://czcodezone.blogspot.sg/2016/03/setup-centos-vm-in-virtualbox-for.html) to be the following:
centos01/192.168.56.101: run namenode
centos02/192.168.56.102: run datanode
centos03/192.168.56.103: run datanode
centos04/192.168.56.104: run datanode
centos05/192.168.56.105: run datanode
where centos01 is the hostname and the 192.168.56.101 is the ip assigned to host centos01
On each VM above, do the following (in the following case, we use centos01/192.168.56.101 for illustration, DO replace them with the individual VM's hostname and ipaddress instead).
Run the following command to edit /etc/hostname:
```bash
`vi /etc/hostname
In the /etc/hostname, put the following line (replace "centos01" accordingly)
centos01
Run the following command to edit /etc/sysconfig/network:
```bash
`vi /etc/sysconfig/network
In the /etc/sysconfig/network, put the following line (replace "centos01" accordingly)
HOSTNAME=centos01
Run the following command to restart network service:
```bash
`service network restart
Run the following command to edit /etc/hosts:
```bash
`vi /etc/hosts
In the /etc/hosts, add the following lines
192.168.56.101 centos01
192.168.56.102 centos02
192.168.56.103 centos03
192.168.56.104 centos04
192.168.56.105 centos05
We need to setup passwordless ssh from the namenode (namely centos01) to the rest of the VMs which serve as datanodes and itself. To do this, on the centos01, run the following command to create the id_dsa.pub public key:
```bash
`mkdir ~/.ssh
`ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
`scp ~/.ssh/id_dsa.pub root@centos02:/root
`scp ~/.ssh/id_dsa.pub root@centos03:/root
`scp ~/.ssh/id_dsa.pub root@centos04:/root
`scp ~/.ssh/id_dsa.pub root@centos05:/root
`cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Now the id_dsa.pub has been copied from centos01 to the other 4 VMs's root directory, we need to append them to /root/.ssh/authorized_keys. for each VM in centos02/3/4/5, run the following command:
```bash
`mkdir ~/.ssh
`touch ~/.ssh/authorized_keys
`cat ~/id_dsa.pub >> ~/.ssh/authorized_keys
Perform the following steps on each VM of centos01/2/3/4/5
```bash
`vi ~/.bashrc
In the .bashrc, add the following line:
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
Download the hadoop binary distribution, unzip it to the root directory ~/hadoop.
Run the folowing command to edit slaves:
```bash
`vi ~/hadoop/etc/hadoop/slaves
In the ~/hadoop/etc/hadoop/slaves, remove the "localhost" and add the following line:
centos02
centos03
centos04
centos05
This slaves files specified the datanodes
Run the following command to edit core-site.xml:
```bash
`vi ~/hadoop/etc/hadoop/core-site.xml
In ~/hadoop/etc/hadoop/core-site.xml, write the following (centos01 refers to the namenode):
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://centos01:9000/</value>
</property>
</configuration>
Run the following command to create directory for hdfs data:
```bash
`mkdir ~/hadoop_data
`mkdir ~/hadoop_data/data
`mkdir ~/hadoop_data/name
`mkdir ~/hadoop_data/local
`chmod -R 755 ~/hadoop_data
In ~/hadoop/etc/hadoop/hdfs-site.xml, write the following:
<configuration>
<property>
<name>dfs.datanode.data.dir</name>
<value>/root/hadoop_data/data</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/root/hadoop_data/name</value>
</property>
</configuration>
The above specify how the data for namenode and datanode are stored.
On the namenode centos01, run the following command to format namenode:
```bash
`~/hadoop/bin/hdfs namenode -format
On the namenode centos01, start the hdfs cluster:
```bash
`~/hadoop/sbin/start-dfs.sh
To check what are running on each VM, run the following command on each VM:
```bash
`jps
To check the reporting of the hdfs cluster, run the following command on the namenode centos01:
```bash
`~/hadoop/bin/hdfs dfsadmin -report
Another way to check is to visit the web server hosted by namenode:
```bash
`curl http://centos01:50070/
To stop the hdfs cluster, run the following commmand on the namenode centos01:
```bash
`~/hadoop/sbin/stop-dfs.sh
Before we start we like to configure the VMs (Refer to this on how to setup and configure CentOS VMs using VirtualBox for HDFS: http://czcodezone.blogspot.sg/2016/03/setup-centos-vm-in-virtualbox-for.html) to be the following:
centos01/192.168.56.101: run namenode
centos02/192.168.56.102: run datanode
centos03/192.168.56.103: run datanode
centos04/192.168.56.104: run datanode
centos05/192.168.56.105: run datanode
where centos01 is the hostname and the 192.168.56.101 is the ip assigned to host centos01
1. Set up hostname for each computer
In this case, we assume that we have not set up a DNS for the VMs to know each other, but we do not want to use raw ip addresses. Therefore we need to configure VMs to identify by hostname centos0x.On each VM above, do the following (in the following case, we use centos01/192.168.56.101 for illustration, DO replace them with the individual VM's hostname and ipaddress instead).
1.1. Modify /etc/sysconfig/network
Run the following command to edit /etc/hostname:
```bash
`vi /etc/hostname
In the /etc/hostname, put the following line (replace "centos01" accordingly)
centos01
Run the following command to edit /etc/sysconfig/network:
```bash
`vi /etc/sysconfig/network
In the /etc/sysconfig/network, put the following line (replace "centos01" accordingly)
HOSTNAME=centos01
Run the following command to restart network service:
```bash
`service network restart
1.2. Modify /etc/hosts
Run the following command to edit /etc/hosts:
```bash
`vi /etc/hosts
In the /etc/hosts, add the following lines
192.168.56.101 centos01
192.168.56.102 centos02
192.168.56.103 centos03
192.168.56.104 centos04
192.168.56.105 centos05
2. Set up passwordless ssh from the namenode to the datanodes
We need to setup passwordless ssh from the namenode (namely centos01) to the rest of the VMs which serve as datanodes and itself. To do this, on the centos01, run the following command to create the id_dsa.pub public key:
```bash
`mkdir ~/.ssh
`ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
`scp ~/.ssh/id_dsa.pub root@centos02:/root
`scp ~/.ssh/id_dsa.pub root@centos03:/root
`scp ~/.ssh/id_dsa.pub root@centos04:/root
`scp ~/.ssh/id_dsa.pub root@centos05:/root
`cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Now the id_dsa.pub has been copied from centos01 to the other 4 VMs's root directory, we need to append them to /root/.ssh/authorized_keys. for each VM in centos02/3/4/5, run the following command:
```bash
`mkdir ~/.ssh
`touch ~/.ssh/authorized_keys
`cat ~/id_dsa.pub >> ~/.ssh/authorized_keys
3. Configure hadoop on each VMs
Perform the following steps on each VM of centos01/2/3/4/5
3.1 Configure $JAVA_HOME
Before we start running hdfs, we need to specify the JAVA_HOME in the environment path. Assume that we install the java-1.8.0-openjdk-devel on each VM for java jdk, the installation is at /usr/lib/jvm/java-1.8.0-openjdk. To specify the JAVA_HOME in the environment path. run the following command:```bash
`vi ~/.bashrc
In the .bashrc, add the following line:
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
3.2. Download hadoop
Download the hadoop binary distribution, unzip it to the root directory ~/hadoop.
3.3 Configure ~/hadoop/etc/hadoop/slaves
Run the folowing command to edit slaves:
```bash
`vi ~/hadoop/etc/hadoop/slaves
In the ~/hadoop/etc/hadoop/slaves, remove the "localhost" and add the following line:
centos02
centos03
centos04
centos05
This slaves files specified the datanodes
3.4. Configure ~/hadoop/etc/hadoop/core-site.xml
Run the following command to edit core-site.xml:
```bash
`vi ~/hadoop/etc/hadoop/core-site.xml
In ~/hadoop/etc/hadoop/core-site.xml, write the following (centos01 refers to the namenode):
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://centos01:9000/</value>
</property>
</configuration>
3.5. Configure ~/hadoop/etc/hdfs-site.xml
Run the following command to create directory for hdfs data:
```bash
`mkdir ~/hadoop_data
`mkdir ~/hadoop_data/data
`mkdir ~/hadoop_data/name
`mkdir ~/hadoop_data/local
`chmod -R 755 ~/hadoop_data
In ~/hadoop/etc/hadoop/hdfs-site.xml, write the following:
<configuration>
<property>
<name>dfs.datanode.data.dir</name>
<value>/root/hadoop_data/data</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/root/hadoop_data/name</value>
</property>
</configuration>
The above specify how the data for namenode and datanode are stored.
4. Start the hdfs cluster
On the namenode centos01, run the following command to format namenode:
```bash
`~/hadoop/bin/hdfs namenode -format
On the namenode centos01, start the hdfs cluster:
```bash
`~/hadoop/sbin/start-dfs.sh
To check what are running on each VM, run the following command on each VM:
```bash
`jps
To check the reporting of the hdfs cluster, run the following command on the namenode centos01:
```bash
`~/hadoop/bin/hdfs dfsadmin -report
Another way to check is to visit the web server hosted by namenode:
```bash
`curl http://centos01:50070/
5. Stop the hdfs cluster
To stop the hdfs cluster, run the following commmand on the namenode centos01:
```bash
`~/hadoop/sbin/stop-dfs.sh
Wednesday, March 30, 2016
Setup Mesos Cluster in CentOS VMs
This post summarizes my experiences in setting up mesos cluster for spark and hadoop cluster deployment.
Firstly, before start, set up a zookeeper cluster (link: http://czcodezone.blogspot.sg/2016/03/setup-zookeeper-cluster-in-centos-vms.html) to run on 3 nodes with hostnames zoo01, zoo02, zoo03 as well as set up 3 mesos master eligable nodes and 4 mesos slave nodes (Refers to this on how to setup CentOS VMs for cluster: http://czcodezone.blogspot.sg/2016/03/setup-centos-vm-in-virtualbox-for.html)
Run the following command to disable the firewalld
```bash
`systemctl stop firewalld.service
`systemctl disable firewalld.service
`iptables -F
Run the following command to open the /etc/selinux/config
```bash
`vi /etc/selinux/config
Edit the to file to include the following lines:
SELINUX=disabled
#SELINUXTYPE=targeted
Run the following command to shutdown selinux immediately:
```bash
`setenforce 0
Start the zookeepers in the zoo01/2/3
Run the following line to install development tools
```bash
`yum install -y java-1.8.0-openjdk-devel
`yum install -y maven
Add the following line to the /root/.bashrc file, add the following line:
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
Run the following commands to install the following libraries required to compile mesos:
```bash
`rpm -Uvh http://repos.mesosphere.io/el/7/noarch/RPMS/mesosphere-el-repo-7-2.noarch.rpm
# to install master and/or slave
`yum -y install mesos
# to install marathon
`yum -y install marathon
# to install chronos
`yum -y install chronos.x86_64
Edit /etc/mesos/zk:
zk://zoo01:2181,zoo02:2181,zoo03:2181/mesos
#
# Mesos Master
#
```bash
`systemctl enable mesos-master.service
`systemctl start mesos-master.service
`systemctl mask mesos-slave.service
`systemctl stop mesos-slave.service
#
# Marathon
#
```bash
`systemctl enable marathon.service
`systemctl start marathon.service
#
# Mesos Slave
#
```bash
`systemctl mask mesos-slave.service
`systemctl stop mesos-slave.service
#
# Chronos
#
```bash
`systemctl enable chronos.service
`systemctl start chronos.service
Personal Note: I noticed that if the mesos services are started immediately after the zookeepers are started,
it is possible that the mesos won't be able to detect the master from zookeeper. The way to solve this is to restart the mesos services on each node again.
Mesos-Master:
```bash
`curl http://mesos01:5050
Marathon:
```bash
`curl http://mesos01:8080
Getting master from CLI
```bash
`mesos-resolve `cat /etc/mesos/zk`
Killing a framework:
```bash
`curl -XPOST http://mesos02:5050/api/v1/scheduler -d '{ "framework_id": { "value": "[task_id]" }, "type": "TEARDOWN"}' -H Content-Type:application/json
To run a spark application in MESOS cluster instead, setup and configure HDFS (link http://czcodezone.blogspot.sg/2016/03/setup-hdfs-cluster-in-centos-vms.html) and Spark (link http://czcodezone.blogspot.sg/2016/03/setup-spark-cluster-in-centos-vms.html).
Put the spark bin package in the hdfs (run the following command on the hadoop namenode centos01):
```bash
`wget http://www.apache.org/dyn/closer.lua/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz
`hadoop/bin/hdfs dfs -mkdir /pkg
`hadoop/bin/hdfs dfs -put spark-1.6.0-bin-hadoop2.6.tgz /pkg/spark-1.6.0-bin-hadoop2.6.tgz
Run the following command to modify the spark-env.sh in spark/conf
```bash
`vi spark/conf/spark-env.sh
In the spark-env.sh, add the following lines:
export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so
export SPARK_EXECUTOR_URI= hdfs://centos01:9000/pkg/spark-1.6.0-bin-hadoop2.6.tgz
Where centos01 is the hadoop namenode
To submit a spark job, run the following command:
```bash
`spark/bin/spark-submit --class com.tutorials.spark.WordCountDriver --master mesos://mesos01:5050 word-count.jar
Important: mesos01 must be the current leader master node, otherwise, the command such as "spark-shell --master mesos://mesos01:5050" will cause the spark-shell to hang on the line "No credential provided. attempting to register without authentication". The solution is to find out which node is the active leader master node by running the command "mesos-resolve `cat /etc/mesos/zk`" and the luanch the spark shell by specifying the active leader master as in the --master option instead.
Firstly, before start, set up a zookeeper cluster (link: http://czcodezone.blogspot.sg/2016/03/setup-zookeeper-cluster-in-centos-vms.html) to run on 3 nodes with hostnames zoo01, zoo02, zoo03 as well as set up 3 mesos master eligable nodes and 4 mesos slave nodes (Refers to this on how to setup CentOS VMs for cluster: http://czcodezone.blogspot.sg/2016/03/setup-centos-vm-in-virtualbox-for.html)
1. Disable Firewall and SELinux
1.1. Disable firewalld
Run the following command to disable the firewalld
```bash
`systemctl stop firewalld.service
`systemctl disable firewalld.service
`iptables -F
1.2. Disable SELinux,
Run the following command to open the /etc/selinux/config
```bash
`vi /etc/selinux/config
Edit the to file to include the following lines:
SELINUX=disabled
#SELINUXTYPE=targeted
Run the following command to shutdown selinux immediately:
```bash
`setenforce 0
2. Start the zookeepers
Start the zookeepers in the zoo01/2/3
2.1. Install Mesos
Run the following line to install development tools
```bash
`yum install -y java-1.8.0-openjdk-devel
`yum install -y maven
Add the following line to the /root/.bashrc file, add the following line:
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
Run the following commands to install the following libraries required to compile mesos:
```bash
`rpm -Uvh http://repos.mesosphere.io/el/7/noarch/RPMS/mesosphere-el-repo-7-2.noarch.rpm
# to install master and/or slave
`yum -y install mesos
# to install marathon
`yum -y install marathon
# to install chronos
`yum -y install chronos.x86_64
2.2. Configure Master with Zookeeper
Edit /etc/mesos/zk:
zk://zoo01:2181,zoo02:2181,zoo03:2181/mesos
3. Start Mesos
#
# Mesos Master
#
```bash
`systemctl enable mesos-master.service
`systemctl start mesos-master.service
`systemctl mask mesos-slave.service
`systemctl stop mesos-slave.service
#
# Marathon
#
```bash
`systemctl enable marathon.service
`systemctl start marathon.service
#
# Mesos Slave
#
```bash
`systemctl mask mesos-slave.service
`systemctl stop mesos-slave.service
#
# Chronos
#
```bash
`systemctl enable chronos.service
`systemctl start chronos.service
Personal Note: I noticed that if the mesos services are started immediately after the zookeepers are started,
it is possible that the mesos won't be able to detect the master from zookeeper. The way to solve this is to restart the mesos services on each node again.
4. Web Interface
Mesos-Master:
```bash
`curl http://mesos01:5050
Marathon:
```bash
`curl http://mesos01:8080
5. Command line Interface
Getting master from CLI
```bash
`mesos-resolve `cat /etc/mesos/zk`
Killing a framework:
```bash
`curl -XPOST http://mesos02:5050/api/v1/scheduler -d '{ "framework_id": { "value": "[task_id]" }, "type": "TEARDOWN"}' -H Content-Type:application/json
6. Submit a spark job via MESOS cluster
To run a spark application in MESOS cluster instead, setup and configure HDFS (link http://czcodezone.blogspot.sg/2016/03/setup-hdfs-cluster-in-centos-vms.html) and Spark (link http://czcodezone.blogspot.sg/2016/03/setup-spark-cluster-in-centos-vms.html).
Put the spark bin package in the hdfs (run the following command on the hadoop namenode centos01):
```bash
`wget http://www.apache.org/dyn/closer.lua/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz
`hadoop/bin/hdfs dfs -mkdir /pkg
`hadoop/bin/hdfs dfs -put spark-1.6.0-bin-hadoop2.6.tgz /pkg/spark-1.6.0-bin-hadoop2.6.tgz
Run the following command to modify the spark-env.sh in spark/conf
```bash
`vi spark/conf/spark-env.sh
In the spark-env.sh, add the following lines:
export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so
export SPARK_EXECUTOR_URI= hdfs://centos01:9000/pkg/spark-1.6.0-bin-hadoop2.6.tgz
Where centos01 is the hadoop namenode
To submit a spark job, run the following command:
```bash
`spark/bin/spark-submit --class com.tutorials.spark.WordCountDriver --master mesos://mesos01:5050 word-count.jar
Important: mesos01 must be the current leader master node, otherwise, the command such as "spark-shell --master mesos://mesos01:5050" will cause the spark-shell to hang on the line "No credential provided. attempting to register without authentication". The solution is to find out which node is the active leader master node by running the command "mesos-resolve `cat /etc/mesos/zk`" and the luanch the spark shell by specifying the active leader master as in the --master option instead.
Setup ZooKeeper cluster in CentOS VMs.
This post summarizes my experience on setting up a zookeeper cluster in CentOS VMs.
Before start, setup three VMs with the following hostname/ipaddress (Refer to this on how to set up and configure CentOS VMs using VirtualBox http://czcodezone.blogspot.sg/2016/03/setup-centos-vm-in-virtualbox-for.html):
zoo01/192.168.56.91
zoo02/192.168.56.92
zoo03/192.168.56.93
192.168.56.91 zoo01
192.168.56.92 zoo02
192.168.56.93 zoo03
On each VM, download and unzip the zookeeper to the folder "/root/zookeeper".
On each VM, run the following command to edit the /root/zookeeper/conf/zoo.cfg:
```bash
`cp zookeeper/conf/zoo_sample.cfg zookeeper/conf/zoo.cfg
`vi zookeeper/conf/zoo.cfg
In the zookeeper/conf/zoo.cfg, add the following configuration:
dataDir=/root/zookeeper_data
server.1=zoo01:2222:2223
server.2=zoo02:2222:2223
server.3=zoo03:2222:2223
Run the following comand to create the directory for holding the data storage for zookeeper in each VM (the /root/zookeeper_data/myid must have an unique id on each VM):
```bash
`ssh root@zoo01
`mkdir /root/zookeeper_data
`touch /root/zookeeper_data/myid
`echo 1 >> /root/zookeeper_data/myid
`ssh root@zoo02
`mkdir /root/zookeeper_data
`touch /root/zookeeper_data/myid
`echo 2 >> /root/zookeeper_data/myid
`ssh root@zoo03
`mkdir /root/zookeeper_data
`touch /root/zookeeper_data/myid
`echo 3 >> /root/zookeeper_data/myid
To start the zookeeper cluster, on each VM zoo01/2/3, run the following command:
```bash
`zookeeper/bin/zkServer.sh start zookeeper/conf/zoo.cfg
(Note that for testing, you may want to use the command "start-foreground" instead of "start")
To check whether the zookeeper cluster is running, run the following command:
```bash
`zookeeper/bin/zkServer.sh status
To testing the zookeeper cluster, run the following command:
```bash
`zookeeper/bin/zkCli.sh -server 192.168.56.91:2181,192.168.56.92:2181,192.168.56.93:2181
Before start, setup three VMs with the following hostname/ipaddress (Refer to this on how to set up and configure CentOS VMs using VirtualBox http://czcodezone.blogspot.sg/2016/03/setup-centos-vm-in-virtualbox-for.html):
zoo01/192.168.56.91
zoo02/192.168.56.92
zoo03/192.168.56.93
1. Edit /etc/hosts
On each VM, edit /etc/hosts to add the following lines:192.168.56.91 zoo01
192.168.56.92 zoo02
192.168.56.93 zoo03
2. Download and configure zookeeper
2.1 Download zookeeper
On each VM, download and unzip the zookeeper to the folder "/root/zookeeper".
2.2. Configure zookeeper
On each VM, run the following command to edit the /root/zookeeper/conf/zoo.cfg:
```bash
`cp zookeeper/conf/zoo_sample.cfg zookeeper/conf/zoo.cfg
`vi zookeeper/conf/zoo.cfg
In the zookeeper/conf/zoo.cfg, add the following configuration:
dataDir=/root/zookeeper_data
server.1=zoo01:2222:2223
server.2=zoo02:2222:2223
server.3=zoo03:2222:2223
Run the following comand to create the directory for holding the data storage for zookeeper in each VM (the /root/zookeeper_data/myid must have an unique id on each VM):
```bash
`ssh root@zoo01
`mkdir /root/zookeeper_data
`touch /root/zookeeper_data/myid
`echo 1 >> /root/zookeeper_data/myid
`ssh root@zoo02
`mkdir /root/zookeeper_data
`touch /root/zookeeper_data/myid
`echo 2 >> /root/zookeeper_data/myid
`ssh root@zoo03
`mkdir /root/zookeeper_data
`touch /root/zookeeper_data/myid
`echo 3 >> /root/zookeeper_data/myid
3. Start zookeeper cluster
To start the zookeeper cluster, on each VM zoo01/2/3, run the following command:
```bash
`zookeeper/bin/zkServer.sh start zookeeper/conf/zoo.cfg
(Note that for testing, you may want to use the command "start-foreground" instead of "start")
4. Test zookeeper running
To check whether the zookeeper cluster is running, run the following command:
```bash
`zookeeper/bin/zkServer.sh status
To testing the zookeeper cluster, run the following command:
```bash
`zookeeper/bin/zkCli.sh -server 192.168.56.91:2181,192.168.56.92:2181,192.168.56.93:2181
Setup Cassandra Cluster on CentOS VMs
This post summarizes my experience in setting up a test environment for cassandra cluster using CentOS VMs.
Suppose we are to setup the cassandra cluster on two VMs cassanra01/192.168.56.131 and cassanra02/192.168.56.132 (link: http://czcodezone.blogspot.sg/2016/03/setup-centos-vm-in-virtualbox-for.html).
Perform all the steps below on each VM:
specify the /etc/hosts as follows:
192.168.56.131 cassandra01
192.168.56.132 cassandra02
Run the following command the install the required softwares:
```bash
`yum install -y java-1.8.0-openjdk-devel
Add the following line in .bashrc:
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
Download and unzip the cassandra package to "/root/cassandra", create a folder
"/root/cassandra_data" and create the following subfolders:
/root/cassandra_data/data
/root/cassandra_data/log
/root/cassandra_data/cache
Run the followng command to set the write permission for "cassandra_data"
folder:
```bash
`chmod 755 -R /root/cassandra_data
Run the following command to start editing cassandra/conf/cassandra.yml
In the cassandra.yml, edit the following lines (e.g., change "rpc_address: localhost" to "rpc_address: [ipv4_address of the current VM]" as well as "listen_address"; otherwise commented out both "listen_address" and "rpc_address"):
#listen_address: localhost
#rpc_address: localhost
data_file_directories:
- /root/cassandra_data/data
commitlog_directory: /root/cassandra_data/log
saved_caches_directory: /root/cassandra_data/cache
cluster_name: 'myCluster'
seed_provider:
- seeds: "cassandra01,cassandra02"
Start the cassandra by running the following command ("-R" to trigger it in root user):
```bash
`cassandra/bin/cassandra -R -f
Or ("nohup" to run process which wont be interrupted by SIGNUP, and "&" put the job into the background):
```bash
`nohup cassandra/bin/cassandra -R -f &
Stop the cassandra by running the following command:
or
```bash
`kill -9 `ps -ef | grep cassandra | awk '{print $2}'`
To check if cassandra is running:
```bash
``ps -ef | grep cassandra
Note that by default firewalld will block all the ports including those used by cassandra, a simple (but not safe) solution is to turn off the firewalld by running
the following commands:
```bash
`systemctl stop firewalld.service
`systemctl disable firewalld.service
To monitor the cassandra cluster, run the following command:
```bash
`cassandra/bin/cqlsh cassandra01
Or
`cassandra/bin/nodetool cfstats
Or
`cassandra/bin/nodetool ring
Suppose we are to setup the cassandra cluster on two VMs cassanra01/192.168.56.131 and cassanra02/192.168.56.132 (link: http://czcodezone.blogspot.sg/2016/03/setup-centos-vm-in-virtualbox-for.html).
Perform all the steps below on each VM:
specify the /etc/hosts as follows:
192.168.56.131 cassandra01
192.168.56.132 cassandra02
Run the following command the install the required softwares:
```bash
`yum install -y java-1.8.0-openjdk-devel
Add the following line in .bashrc:
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
Download and unzip the cassandra package to "/root/cassandra", create a folder
"/root/cassandra_data" and create the following subfolders:
/root/cassandra_data/data
/root/cassandra_data/log
/root/cassandra_data/cache
Run the followng command to set the write permission for "cassandra_data"
folder:
```bash
`chmod 755 -R /root/cassandra_data
Run the following command to start editing cassandra/conf/cassandra.yml
In the cassandra.yml, edit the following lines (e.g., change "rpc_address: localhost" to "rpc_address: [ipv4_address of the current VM]" as well as "listen_address"; otherwise commented out both "listen_address" and "rpc_address"):
#listen_address: localhost
#rpc_address: localhost
data_file_directories:
- /root/cassandra_data/data
commitlog_directory: /root/cassandra_data/log
saved_caches_directory: /root/cassandra_data/cache
cluster_name: 'myCluster'
seed_provider:
- seeds: "cassandra01,cassandra02"
Start the cassandra by running the following command ("-R" to trigger it in root user):
```bash
`cassandra/bin/cassandra -R -f
Or ("nohup" to run process which wont be interrupted by SIGNUP, and "&" put the job into the background):
```bash
`nohup cassandra/bin/cassandra -R -f &
Stop the cassandra by running the following command:
or
```bash
`kill -9 `ps -ef | grep cassandra | awk '{print $2}'`
To check if cassandra is running:
```bash
``ps -ef | grep cassandra
Note that by default firewalld will block all the ports including those used by cassandra, a simple (but not safe) solution is to turn off the firewalld by running
the following commands:
```bash
`systemctl stop firewalld.service
`systemctl disable firewalld.service
To monitor the cassandra cluster, run the following command:
```bash
`cassandra/bin/cqlsh cassandra01
Or
`cassandra/bin/nodetool cfstats
Or
`cassandra/bin/nodetool ring
Run Spark on Windows 10 OS
In order for spark to run on Windows 10, we need to download the winutils.exe from https://github.com/srccodes/hadoop-common-2.2.0-bin and put it in the hadoop.bin directory.
To do this download the zip from https://github.com/srccodes/hadoop-common-2.2.0-bin and then put the files in the zip directory into the hadoop/bin directory (full path on my computer is C:\\hadoop\bin).
In your spark code, put this before invoking the JavaSparkContext:
SparkConf conf = new SparkConf().setAppName("LeftJoin2");
String home = System.getProperty("hadoop.home.dir");
if (home == null) {
System.setProperty("hadoop.home.dir", "C:\\hadoop");
}
conf.setIfMissing("spark.master", "local[2]");
JavaSparkContext context = new JavaSparkContext(conf);
To do this download the zip from https://github.com/srccodes/hadoop-common-2.2.0-bin and then put the files in the zip directory into the hadoop/bin directory (full path on my computer is C:\\hadoop\bin).
In your spark code, put this before invoking the JavaSparkContext:
SparkConf conf = new SparkConf().setAppName("LeftJoin2");
String home = System.getProperty("hadoop.home.dir");
if (home == null) {
System.setProperty("hadoop.home.dir", "C:\\hadoop");
}
conf.setIfMissing("spark.master", "local[2]");
JavaSparkContext context = new JavaSparkContext(conf);
Install and Run MySQL on CentOS
To install mysql run the following command:
```bash
`yum install mysql
`rpm -Uvh http://dev.mysql.com/get/mysql-community-release-el7-5.noarch.rpm
`yum -y install mysql-community-server
Run the following command to start mysql:
`bash
`service mysql start
Run the following command to enable mysql on VM restart:
`bash
`chkconfig --levels 235 mysql on
Run the following command to access mysql
```bash
`mysql -u root
To configure myql for security run the following command:
```bash
`/usr/bin/mysql_secure_installation
Set root password? [Y/n] Y
Remove anonymous users? [Y/n] Y
Disallow root login remotely? [Y/n] n
Remove test database and access to it? [Y/n] Y
Reload privilege tables now? [Y/n] Y
Run the following command to restart mysql
```bash
`systemctl restart mysql.service
After the above steps "mysql -u root" will no longer be able to access mysql, Use the following command instead:
```bash
`mysql -u root -p mysql
To allow java applications to access mysql from another computer, run the following query and then restart mysql:
> update user set host="%" where user="root" and host='localhost';
Furthermore, ports may be blocked by firewalld running in the VM that runs the myql, which will block access by java application from another computer, this can be
resolved by stopping and disabling firewalld:
```bash
`systemctl stop firewalld.service
`systemctl disable firewalld.service
```bash
`yum install mysql
`rpm -Uvh http://dev.mysql.com/get/mysql-community-release-el7-5.noarch.rpm
`yum -y install mysql-community-server
Run the following command to start mysql:
`bash
`service mysql start
Run the following command to enable mysql on VM restart:
`bash
`chkconfig --levels 235 mysql on
Run the following command to access mysql
```bash
`mysql -u root
To configure myql for security run the following command:
```bash
`/usr/bin/mysql_secure_installation
Set root password? [Y/n] Y
Remove anonymous users? [Y/n] Y
Disallow root login remotely? [Y/n] n
Remove test database and access to it? [Y/n] Y
Reload privilege tables now? [Y/n] Y
Run the following command to restart mysql
```bash
`systemctl restart mysql.service
After the above steps "mysql -u root" will no longer be able to access mysql, Use the following command instead:
```bash
`mysql -u root -p mysql
To allow java applications to access mysql from another computer, run the following query and then restart mysql:
> update user set host="%" where user="root" and host='localhost';
Furthermore, ports may be blocked by firewalld running in the VM that runs the myql, which will block access by java application from another computer, this can be
resolved by stopping and disabling firewalld:
```bash
`systemctl stop firewalld.service
`systemctl disable firewalld.service
Setup CentOS VM in VirtualBox for Software Development and Distributed Computation on Spark and HDFS
This post summarize my experience on setting up a simple CentOS VM for development environment using VirtualBox that i used to test-run distributed jobs in spark and HDFS cluster.
Host-only adapter
If you need to share a folder in the host computer with the centos VM, add the shared folder in the "Shared Folders" tab of the VM settings and configure it to have:
1. Full Access
2. Auto Mount
In this example, it is assumed the shared folder is named "git" and it is available on "C:\Users\xschen\git" on the host computer.
After the installation is completed, run the following commands to install the necessary tools:
```bash
`yum update
`yum install -y java-1.8.0-openjdk-devel
`yum install -y maven
`yum install -y kernel-devel
`yum install -y gcc
`yum install -y bzip2
The java-1.8.0-openjdk-devel and the maven are used for java development, and the kernel-devel, gcc, and bzip2 can be used for compiling C or C++ based source codes (which will be needed later to install the vboxsf)
```bash
`ls /dev -l | grep cd
You should see something like /dev/sr0 which indicates mounted iso there. Run the following commands to access the mounted iso:
```bash
`mkdir /mnt/dvd
`mount -r -t iso9660 /dev/sr0 /mnt/dvd
Now run the following command to install the vboxsf:
```bash
`cd /mnt/dvd
`sh VBoxLinuxAdditions.run
if encounter any error, run the following command:
```bash
`yum groupinstall "Development Tools"
After the above commands are successfully executed, you should be able to see vboxsf is available in the centos VM by running the following command:
```bash
`lsmod | grep vbox
```bash
`mkdir /mnt/git
`mount -t vboxsf git /mnt/git
Now to access the shared folder, just enter the following commands:
```bash
`cd /mnt/git
`ls
1. the host computer is in the subnet "192.168.56.*"
2. the static ip address to the host-only network adapter is "192.168.56.101" (which will be in the same subnet as the host computer)
In the centos VM, install and run the ifconfig tool:
```bash
`yum -y install net-tools
`ifconfig
In my computer, i have the enp0s3 as the host-only network adapter. their configuration files ifcfg-enp0s3 (if not exist, simply create the text file of the same name) can be found in the folder "/etc/sysconfig/network-scripts".
```bash
`cd /etc/sysconfig/network-scripts
`vi ifcfg-enp0s3
Add or modify the following settings in the ifcfg-enp0s3:
BOOTPROTO=static
ONBOOT=yes
IPADDR=192.168.56.101
A sample of the ifcfg-enp0s3 is as shown below:
TYPE=Ethernet
BOOTPROTO=static
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
NAME=enp0s3
DEVICE=enp0s3
ONBOOT=yes
IPADDR=192.168.56.101
PREFIX=24
IPV6_PEERDNS=yes
IPV6_PEERROUTES=yes
IPV6_PRIVACY=no
```bash
`service network restart
`ifconfig
You should see that the static ip address
To disable firewall, run the command:
```bash
`systemctl stop firewalld.service
`systemctl disable firewalld.service
To restart the the firewall, run the following command:
```bash
`systemctl start firewalld.service
`systemctl enable firewall.service
```bash
`vi /etc/selinux/config
In the /etc/selinux/config, change the following line:
SELINUX=enforcing
SELINUXTYPE=targeted
to:
SELINUX=disabled
#SELINUXTYPE=targeted
Next to turn off the selinux immediately, run the following command:
```bash
`setenforce 0
1. Create vbox centos VM
Launch virtualbox, and create a VM using the centos iso image downloaded. Configure the VM settings to have the network adapter in the "Network" tab:Host-only adapter
If you need to share a folder in the host computer with the centos VM, add the shared folder in the "Shared Folders" tab of the VM settings and configure it to have:
1. Full Access
2. Auto Mount
In this example, it is assumed the shared folder is named "git" and it is available on "C:\Users\xschen\git" on the host computer.
2. Install the development tools in the centos VM
Launch the centos VM and follows the standard installation steps. Make sure that the three network adapters are enabled when installing the centos.After the installation is completed, run the following commands to install the necessary tools:
```bash
`yum update
`yum install -y java-1.8.0-openjdk-devel
`yum install -y maven
`yum install -y kernel-devel
`yum install -y gcc
`yum install -y bzip2
The java-1.8.0-openjdk-devel and the maven are used for java development, and the kernel-devel, gcc, and bzip2 can be used for compiling C or C++ based source codes (which will be needed later to install the vboxsf)
3. Access shared folder on the host computer
In order to access the shared folder "git" on the host computer, we must first install the VirtualBoxLinuxAdditions so that vboxsf is available in centos VM.3.1 Mount and install VBoxGuestAddition
Click the "Device-->Insert Guest Addition CD Image" in the menu of the VM user display, and the VBoxGuestAdditions.iso will be mounted on the VM cdrom. To see the device that mounts the VBoxGuestAdditions.iso, run the following command in the centos VM:```bash
`ls /dev -l | grep cd
You should see something like /dev/sr0 which indicates mounted iso there. Run the following commands to access the mounted iso:
```bash
`mkdir /mnt/dvd
`mount -r -t iso9660 /dev/sr0 /mnt/dvd
Now run the following command to install the vboxsf:
```bash
`cd /mnt/dvd
`sh VBoxLinuxAdditions.run
if encounter any error, run the following command:
```bash
`yum groupinstall "Development Tools"
After the above commands are successfully executed, you should be able to see vboxsf is available in the centos VM by running the following command:
```bash
`lsmod | grep vbox
3.2. Mount and access the shared folder
Run the following commands to mount and access the shared folder:```bash
`mkdir /mnt/git
`mount -t vboxsf git /mnt/git
Now to access the shared folder, just enter the following commands:
```bash
`cd /mnt/git
`ls
4. Assign static ip address to a network adapter
In this example, we want to assign a static ipaddress to the network adapter. It is assumed that:1. the host computer is in the subnet "192.168.56.*"
2. the static ip address to the host-only network adapter is "192.168.56.101" (which will be in the same subnet as the host computer)
In the centos VM, install and run the ifconfig tool:
```bash
`yum -y install net-tools
`ifconfig
In my computer, i have the enp0s3 as the host-only network adapter. their configuration files ifcfg-enp0s3 (if not exist, simply create the text file of the same name) can be found in the folder "/etc/sysconfig/network-scripts".
4.1 Configure static IP address in the network adapter
Run the following command to open and edit the ifcfg-enp0s3:```bash
`cd /etc/sysconfig/network-scripts
`vi ifcfg-enp0s3
Add or modify the following settings in the ifcfg-enp0s3:
BOOTPROTO=static
ONBOOT=yes
IPADDR=192.168.56.101
A sample of the ifcfg-enp0s3 is as shown below:
TYPE=Ethernet
BOOTPROTO=static
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
NAME=enp0s3
DEVICE=enp0s3
ONBOOT=yes
IPADDR=192.168.56.101
PREFIX=24
IPV6_PEERDNS=yes
IPV6_PEERROUTES=yes
IPV6_PRIVACY=no
4.2. Restart the network service
Run the following command to restart the network service and re-check the network configuration:```bash
`service network restart
`ifconfig
You should see that the static ip address
5. Security: Disable firewall and selinux
For the project I am working, I need to run distributed computation jobs using apache spark cluster and HDFS. In order for spark cluster to work, the firewall and selinux must be properly configured. As a quick and dirty way, the firewall and selinux (optional) can be disabled on all VMs running spark cluster (including master and slave nodes in spark and namenodes and datanodes in HDFS).5.1. Disable firewalld
To disable firewall, run the command:
```bash
`systemctl stop firewalld.service
`systemctl disable firewalld.service
To restart the the firewall, run the following command:
```bash
`systemctl start firewalld.service
`systemctl enable firewall.service
5.2. Disable selinux (Optional)
Run the command to edit /etc/selinux/config:```bash
`vi /etc/selinux/config
In the /etc/selinux/config, change the following line:
SELINUX=enforcing
SELINUXTYPE=targeted
to:
SELINUX=disabled
#SELINUXTYPE=targeted
Next to turn off the selinux immediately, run the following command:
```bash
`setenforce 0
Labels:
CentOS,
Hadoop,
Spark,
VirtualBox,
Virtualization
Monday, February 1, 2016
Elasticsearch Version Upgrade using Rolling Restart
This upgrade method follows updating one node at a time and restart again. The same method can also be used for restarting a elasticsearch cluster in a safe and efficient way.
In a elasticsearch production cluster, start with one master-eligible node, stop the elasticsearch service, perform upgrade, and then restart the service. Repeat this until all master-eligible nodes have been upgraded. Since master-eligible nodes do not keep shards and replicas, the process is safe with data so far.
Next stop, upgrade, and restart the client nodes one at a time just like with the master-eligible nodes.
Next, before starting to upgrade any data node in the elasticsearch cluster, dynamically adding a setting to temporarily turn off the resharding of the elasticsearch cluster via restful api calls to the cluster (because if we restart a data node, the shards in the cluster will rebalance). The setting to be temporally disabled is the cluster.routing.allocation.enable, which can be done by issue the following call to the elasticsearch cluster:
curl -X PUT -H "Content-Type: application/json" http://elastic-cluster:9200/_cluster/settings/ -d '
{ "transient": { "cluster.routing.allocation.enable": "none" } }
'
(Note that the "transient" allows the setting to be not permanent)
Now stop, upgrade and restart a data node. At this point, we can reverse the setting for cluster.routing.allocation.enable by running the curl restful below:
curl -X PUT -H "Content-Type: application/json" http://elastic-cluster:9200/_cluster/settings/ -d '
{ "transient": { "cluster.routing.allocation.enable": "all" } }
'
Once this is done, the origin shards for that data node will be up again.
Next, proceed to the second data node and repeat the process above until all data nodes are upgraded.
Master-eligible nodes
In a elasticsearch production cluster, start with one master-eligible node, stop the elasticsearch service, perform upgrade, and then restart the service. Repeat this until all master-eligible nodes have been upgraded. Since master-eligible nodes do not keep shards and replicas, the process is safe with data so far.
Client nodes
Next stop, upgrade, and restart the client nodes one at a time just like with the master-eligible nodes.
Data nodes
Next, before starting to upgrade any data node in the elasticsearch cluster, dynamically adding a setting to temporarily turn off the resharding of the elasticsearch cluster via restful api calls to the cluster (because if we restart a data node, the shards in the cluster will rebalance). The setting to be temporally disabled is the cluster.routing.allocation.enable, which can be done by issue the following call to the elasticsearch cluster:
curl -X PUT -H "Content-Type: application/json" http://elastic-cluster:9200/_cluster/settings/ -d '
{ "transient": { "cluster.routing.allocation.enable": "none" } }
'
(Note that the "transient" allows the setting to be not permanent)
Now stop, upgrade and restart a data node. At this point, we can reverse the setting for cluster.routing.allocation.enable by running the curl restful below:
curl -X PUT -H "Content-Type: application/json" http://elastic-cluster:9200/_cluster/settings/ -d '
{ "transient": { "cluster.routing.allocation.enable": "all" } }
'
Once this is done, the origin shards for that data node will be up again.
Next, proceed to the second data node and repeat the process above until all data nodes are upgraded.
Sunday, January 31, 2016
Quorum and minimum_master_nodes setting for elasticsearch configuration
Theory behind elasticsearch recovery
Elastic search recovery works by having a master election satifying a particular minimum_master_nodes criteria. That is a master node will only be elected if there is a N number of master-eligible nodes (nodes in which node.master=true in their elasticsearch.yml file) to join it. N is specified by discovery.zen.minimum_master_nodes in the elasticsearch.yml file.For the quorum scheme, N should be set to
discovery.zen.minimum_master_nodes = (number of master-eligible nodes) / 2 + 1
The recovery works like this. When the current master node dies, a new master node will only be elected if there is N master-eligible nodes to join it, where N = (number of master-eligible nodes) / 2 + 1
Once the new master node is elected, if later the originally dead master node comes alive. It will have less than N master-eligible nodes to join it, therefore it will have to step down. Thus this scheme ensures that there will no two masters at the same time (which is the so-called split-brain scenario, that is not desirable since all master nodes have higher authority in its being able to update cluster-state in the data nodes and client nodes)
Minimum number of master eligible nodes required for an elasticsearch cluster
The minimum number of master-eligible nodes should be 3. And the zen.discovery.minimum_master_nodes should be equal = 3 / 2 + 1 = 2.Reason: Suppose we only have two master-eligible nodes. If the master node dies, there is only 1 master-eligible node left, two things will happen depending on the value in discovery.zen.minimum_master_nodes:
Case 1: if the zen.discovery.minimum_master_nodes is set to greater than 1, then there won't be any new master node elected, and the cluster will not operate.
Case 2: if we set the zen.discovery.minimum_master_nodes=1, the new master node will be elected, however, when the originally dead master is brought alive again, the original master will not step down since it now also has one master-eligible node to join it, leading to split-brain problem.
Therefore the recommended minimum settings is to increases the number of master-eligible nodes to 3, and set the zen.discovery.minimum_master_nodes=2.
Saturday, January 30, 2016
Change network mask on centos
This post shows how to change the network mask and other ifconfig information on centos. After login to centos run the following command to open and edit the ifcfg-eth0 file:
vi /etc/sysconfig/network-scripts/ifcfg-eth0
Look for the line starting with PREFIX0, change the network mask here (not that this is specified in bits so, "PREFIX0=22" means network mask is 255.255.255.252).
After change is completed, restart the network service by running the following command:
service network restart
vi /etc/sysconfig/network-scripts/ifcfg-eth0
Look for the line starting with PREFIX0, change the network mask here (not that this is specified in bits so, "PREFIX0=22" means network mask is 255.255.255.252).
After change is completed, restart the network service by running the following command:
service network restart
Thursday, January 28, 2016
Message Priority in Akka
http://blog.knoldus.com/2014/03/13/how-to-create-a-priority-based-mailbox-for-an-actor/
http://doc.akka.io/docs/akka/2.4.1/java/mailboxes.html
http://stackoverflow.com/questions/15337329/akka-actor-post-a-message-to-head-of-the-mailbox
http://doc.akka.io/docs/akka/2.4.1/java/mailboxes.html
http://stackoverflow.com/questions/15337329/akka-actor-post-a-message-to-head-of-the-mailbox
Subscribe to:
Posts (Atom)