Wednesday, April 20, 2016

Careful with the use of & bitwise operator in boolean evaluation

On a particular day when i did not sleep much and was running low on coffee, I made the following stupid error in my code:

if(model != null & model.getVocabulary().countWords() > 0) return;

What i meant to write is:

if(model != null && model.getVocabulary().countWords() > 0) return;

When i put this code in a apache spark job and run in the cluster, it threw Null Pointer Exception. Turns out that the missing "&" is a serious mistake as if model is evaluated to null, "&&" will then skip the right hand side evaluation. However, with "&", both sides are evaluated. As a result of evaluating "model.getVocabulary()" with model being null causing the null pointer exception.

Lesson learnt: should have more sleeping hours instead of wasting time debugging some funny error like above :p


Thursday, March 31, 2016

Setup Hadoop YARN on CentOS VMs

This post summarizes my experience in setting up testing environment for YARN using CentOS  VMs.

After we setup the hdfs following the link (http://czcodezone.blogspot.sg/2016/03/setup-hdfs-cluster-in-centos-vms.html). We can go ahead to set up yarn to manage jobs in hadoop.

Hadoop v2 uses application masters in the datanode to work together with nodemanager to manage a job, and uses resource manager in the namenode to schedule resources for a job (a job refers to a particular application or driver from distributed computation framework such as mapreduce or spark).

1. Setup yarn configuration in hadoop


To setup yarn, on each VM (both namenode and datanode), perform the following steps:

1.1. Edit hadoop/etc/hadoop/mapred-site.xml


Run the following command to edit the mapred-site.xml:

```bash
`cd hadoop/etc/hadoop
`cp mapred-site.xml.template mapred-site.xml
`vi mapred-site.xml

In the mapred-site.xml, Modify as follows:

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://centos01:9000</value>
</property>
</configuration>

The "hdfs://centos01:9000" specify the master as the "fs.default.name". (It is important that this is not specified as "hdfs://localhost:9000", otherwise the "hadoop/bin/hdfs dfsadmin -report" will have "Connection refused" exception)


1.2 Edit hadoop/etc/hadoop/yarn-site.xml


Run the following command to edit the yarn-site.xml:

```bash
`vi hadoop/etc/hadoop/yarn-site.xml

In the yarn-site.xml, modify as follows:

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

2. Start the yarn in the namenode.


On the namenode centos01 (please refers to http://czcodezone.blogspot.sg/2016/03/setup-centos-vm-in-virtualbox-for.html), run the following command to start the hdfs and then yarn:

```bash
`hadoop/sbin/start-dfs.sh
`hadoop/sbin/start-yarn.sh
`hadoop/sbin/mr-jobhistory-daemon.sh start historyserver
`jps

3. Stop hdfs and yarn


To stop hdfs and yarn, run the following command:

```bash
`hadoop/sbin/mr-jobhistory-daemon.sh stop historyserver
`hadoop/sbin/stop-yarn.sh
`hadoop/sbin/stop-dfs.sh

Setup Spark Cluster in CentOS VMs

This post summarizes my experience in setting up a test environment for spark cluster using CentOS VMs.

After setup the VMs, we will designate the following topology (refer to this for setting up and configure CentOS VMs using VirtualBox http://czcodezone.blogspot.sg/2016/03/setup-centos-vm-in-virtualbox-for.html):

centos01: master
centos02: slave
centos03: slave
centos04: slave
centos05: slave

where centos01 refers to the hostname of the VM.

1. Configure the spark cluster


Download and unzip the spark-1.6.0-hadoop2.6.tgz to "/root/spark", run the following command on centos01 to specify the list of slaves:

```bash
`cp spark/conf/slaves.template spark/conf/slaves
`vi spark/conf/slaves

In the spark/conf/slaves add the following lines:

centos02
centos03
centos04
centos05

Make sure that the firewalls turned off on centos01/2/3/4/5 and and passwordless ssh from centos01 to centos02/3/4/5.

Run the following command to copy the "/root/spark" from centos01 to centos02/3/4/5:

```bash
`rsync -a /root/spark/ root@centos02:/root/park
`rsync -a /root/spark/ root@centos03:/root/park
`rsync -a /root/spark/ root@centos04:/root/park
`rsync -a /root/spark/ root@centos05:/root/park

2. Start and stop the spark cluster


Run the following command on centos01 to start the spark cluster:

```bash
`spark/sbin/start-all.sh


To stop the spark cluster, run the following comand on centos01:

```bash
`spark/sbin/stop-all.sh

3. Run the spark shell


After the spark cluster is started, we can start the spark shell by running the following command:

```bash
`spark/bin/spark-shell --master spark://centos01:7077

The port 7077 is the default port for spark master centos01

4. Submit a spark job to spark cluster


After the spark cluster has been setup, assuming the master is centos01, run the following command to submit a spark job:

```bash
`spark/bin/spark-submit --class com.tutorials.spark.WordCountDriver --master spark://centos01:7077 word-count.jar

Or more refinely:
```bash
`spark/bin/spark-submit --class com.tutorials.spark.WordCountDriver --master spark://centos01:7077 --executor-memory 2G --total-executor-cores 8 word-count.jar 1000

Below are two other alternatives i tested using YARN cluster and mesos cluster

4.1. Submit a spark job via YARN cluster

Suppose we have a resource management cluster such as Hadoop YARN setup, we can submit the spark job to YARN for processing as well (YARN will the spark master)

To run an application in YARN cluster instead, setup and configure hdfs and yarn using the (link: http://czcodezone.blogspot.sg/2016/03/setup-hdfs-cluster-in-centos-vms.html) and (link: http://czcodezone.blogspot.sg/2016/03/setup-hadoop-yarn-on-centos-vms.html), Start the HDFS and YARN.

run the following command to edit the .bashrc:

```bash
`vi .bashrc

In the spark/conf/spark-env.sh of each VM centos01/2/3/4/5, add the following line:

export HADOOP_HOME_DIR=/root/hadoop
export HADOOP_CONF_DIR=/root/hadoop/etc/hadoop
export HADOOP_YARN_DIR=/root/hadoop/etc/hadoop

Run the following command on each VM to update .bashrc:

```bash
`source .bashrc

To submit a spark job, run the following command:

```bash
``spark/bin/spark-submit  --class com.tutorials.spark.WordCountDriver --master yarn --deploy-mode cluster word-count.jar

Or
```bash
``spark/bin/spark-submit  --class com.tutorials.spark.WordCountDriver --master yarn-cluster word-count.jar

Or more refinely:
```bash
`spark/bin/spark-submit  --class com.tutorials.spark.WordCountDriver --master yarn --deploy-mode cluster --executor-memory 2G --num-executors 20 word-count.jar 1000

4.2. Submit a spark job via mesos cluster


To run an application in mesos cluster instead, setup and configure hdfs and mesos (link: http://czcodezone.blogspot.sg/2016/03/setup-mesos-cluster-in-centos-vms.html), Start the HDFS and MESOS.

Put the spark bin package in the hdfs (run the following command on the hadoop namenode centos01):

```bash
`wget http://www.apache.org/dyn/closer.lua/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz
`hadoop/bin/hdfs dfs -mkdir /pkg
`hadoop/bin/hdfs dfs -put spark-1.6.0-bin-hadoop2.6.tgz /pkg/spark-1.6.0-bin-hadoop2.6.tgz

Run the following command to modify the spark-env.sh in spark/conf

```bash
`vi spark/conf/spark-env.sh

In the spark-env.sh, add the following lines:

export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so
export SPARK_EXECUTOR_URI= hdfs://centos01:9000/pkg/spark-1.6.0-bin-hadoop2.6.tgz

Where centos01 is the hadoop namenode

To submit a spark job, run the following command:

```bash
`spark/bin/spark-submit --class com.tutorials.spark.WordCountDriver --master mesos://mesos01:5050 word-count.jar

Important: mesos01 must be the current leader master node, otherwise, the command such as "spark-shell --master mesos://mesos01:5050" will cause the spark-shell to hang on the line "No credential provided. attempting to register without authentication". The solution is to find out which node is the active leader master node by running the command "mesos-resolve `cat /etc/mesos/zk`" and the luanch the spark shell by specifying the active leader master as in the --master option instead.

Setup HDFS Cluster in CentOS VMs

This post summarizes my experience in setting up a test environment for HDFS cluster using CentOS VMs.

Before we start we like to configure the VMs (Refer to this on how to setup and configure CentOS VMs using VirtualBox for HDFS: http://czcodezone.blogspot.sg/2016/03/setup-centos-vm-in-virtualbox-for.html) to be the following:

centos01/192.168.56.101: run namenode
centos02/192.168.56.102: run datanode
centos03/192.168.56.103: run datanode
centos04/192.168.56.104: run datanode
centos05/192.168.56.105: run datanode

where centos01 is the hostname and the 192.168.56.101 is the ip assigned to host centos01

1. Set up hostname for each computer

In this case, we assume that we have not set up a DNS  for the VMs to know each other, but we do not want to use raw ip addresses. Therefore we need to configure VMs to identify by hostname centos0x.

On  each VM above, do the following (in the following case, we use centos01/192.168.56.101 for illustration, DO replace them with the individual VM's hostname and ipaddress instead).

1.1. Modify /etc/sysconfig/network


Run the following command to edit /etc/hostname:

```bash
`vi /etc/hostname

In the /etc/hostname, put the following line (replace "centos01" accordingly)

centos01

Run the following command to edit /etc/sysconfig/network:

```bash
`vi /etc/sysconfig/network

In the /etc/sysconfig/network, put the following line (replace "centos01" accordingly)

HOSTNAME=centos01

Run the following command to restart network service:

```bash
`service network restart

1.2. Modify /etc/hosts


Run the following command to edit /etc/hosts:

```bash
`vi /etc/hosts

In the /etc/hosts, add the following lines

192.168.56.101 centos01
192.168.56.102 centos02
192.168.56.103 centos03
192.168.56.104 centos04
192.168.56.105 centos05

2. Set up passwordless ssh from the namenode to the datanodes


We need to setup passwordless ssh from the namenode (namely centos01) to the rest of the VMs which serve as datanodes and itself. To do this, on the centos01, run the following command to create the id_dsa.pub public key:

```bash
`mkdir ~/.ssh
`ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
`scp ~/.ssh/id_dsa.pub root@centos02:/root
`scp ~/.ssh/id_dsa.pub root@centos03:/root
`scp ~/.ssh/id_dsa.pub root@centos04:/root
`scp ~/.ssh/id_dsa.pub root@centos05:/root
`cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Now the id_dsa.pub has been copied from centos01 to the other 4 VMs's root directory, we need to append them to /root/.ssh/authorized_keys. for each VM in centos02/3/4/5, run the following command:

```bash
`mkdir ~/.ssh
`touch ~/.ssh/authorized_keys
`cat ~/id_dsa.pub >> ~/.ssh/authorized_keys

3. Configure hadoop on each VMs


Perform the following steps on each VM of centos01/2/3/4/5

3.1 Configure $JAVA_HOME 

Before we start running hdfs, we need to specify the JAVA_HOME in the environment path. Assume that we install the java-1.8.0-openjdk-devel on each VM for java jdk, the installation is at /usr/lib/jvm/java-1.8.0-openjdk. To specify the JAVA_HOME in the environment path. run the following command:

```bash
`vi ~/.bashrc

In the .bashrc, add the following line:

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk

3.2. Download hadoop 


Download the hadoop binary distribution, unzip it to the root directory ~/hadoop.

3.3 Configure ~/hadoop/etc/hadoop/slaves 


Run the folowing command to edit slaves:

```bash
`vi ~/hadoop/etc/hadoop/slaves

In the ~/hadoop/etc/hadoop/slaves, remove the "localhost" and add the following line:

centos02
centos03
centos04
centos05

This slaves files specified the datanodes

3.4. Configure ~/hadoop/etc/hadoop/core-site.xml


Run the following command to edit core-site.xml:

```bash
`vi ~/hadoop/etc/hadoop/core-site.xml

In ~/hadoop/etc/hadoop/core-site.xml, write the following (centos01 refers to the namenode):

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://centos01:9000/</value>
</property>
</configuration>

3.5. Configure ~/hadoop/etc/hdfs-site.xml


Run the following command to create directory for hdfs data:

```bash
`mkdir ~/hadoop_data
`mkdir ~/hadoop_data/data
`mkdir ~/hadoop_data/name
`mkdir ~/hadoop_data/local
`chmod -R 755 ~/hadoop_data

In ~/hadoop/etc/hadoop/hdfs-site.xml, write the following:

<configuration>
<property>
<name>dfs.datanode.data.dir</name>
<value>/root/hadoop_data/data</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/root/hadoop_data/name</value>
</property>
</configuration>

The above specify how the data for namenode and datanode are stored.

4. Start the hdfs cluster


On the namenode centos01, run the following command to format namenode:

```bash
`~/hadoop/bin/hdfs namenode -format

On the namenode centos01, start the hdfs cluster:

```bash
`~/hadoop/sbin/start-dfs.sh

To check what are running on each VM, run the following command on each VM:

```bash
`jps

To check the reporting of the hdfs cluster, run the following command on the namenode centos01:

```bash
`~/hadoop/bin/hdfs dfsadmin -report

Another way to check is to visit the web server hosted by namenode:

```bash
`curl http://centos01:50070/

5. Stop the hdfs cluster


To stop the hdfs cluster, run the following commmand on the namenode centos01:

```bash
`~/hadoop/sbin/stop-dfs.sh


Wednesday, March 30, 2016

Setup Mesos Cluster in CentOS VMs

This post summarizes my experiences in setting up mesos cluster for spark and hadoop cluster deployment.

Firstly, before start, set up a zookeeper cluster (link: http://czcodezone.blogspot.sg/2016/03/setup-zookeeper-cluster-in-centos-vms.html) to run on 3 nodes with hostnames zoo01, zoo02, zoo03  as well as set up 3 mesos master eligable nodes and 4 mesos slave nodes (Refers to this on how to setup CentOS VMs for cluster: http://czcodezone.blogspot.sg/2016/03/setup-centos-vm-in-virtualbox-for.html)

1. Disable Firewall and SELinux

1.1. Disable firewalld


Run the following command to disable the firewalld
```bash
`systemctl stop firewalld.service
`systemctl disable firewalld.service
`iptables -F

1.2. Disable SELinux,


Run the following command to open the /etc/selinux/config

```bash
`vi /etc/selinux/config

Edit the to file to include the following lines:

SELINUX=disabled
#SELINUXTYPE=targeted

Run the following command to shutdown selinux immediately:

```bash
`setenforce 0

2. Start the zookeepers


Start the zookeepers in the zoo01/2/3


2.1. Install Mesos


Run the following line to install development tools

```bash
`yum install -y java-1.8.0-openjdk-devel
`yum install -y maven

Add the following line to the /root/.bashrc file, add the following line:

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk

Run the following commands to install the following libraries required to compile mesos:

```bash
`rpm -Uvh http://repos.mesosphere.io/el/7/noarch/RPMS/mesosphere-el-repo-7-2.noarch.rpm
# to install master and/or slave
`yum -y install mesos
# to install marathon
`yum -y install marathon
# to install chronos
`yum -y install chronos.x86_64

2.2. Configure Master with Zookeeper


Edit /etc/mesos/zk:
zk://zoo01:2181,zoo02:2181,zoo03:2181/mesos

3. Start Mesos


#
# Mesos Master
#
```bash
`systemctl enable mesos-master.service
`systemctl start mesos-master.service
`systemctl mask mesos-slave.service
`systemctl stop mesos-slave.service

#
# Marathon
#
```bash
`systemctl enable marathon.service
`systemctl start marathon.service

#
# Mesos Slave
#
```bash
`systemctl mask mesos-slave.service
`systemctl stop mesos-slave.service

#
# Chronos
#
```bash
`systemctl enable chronos.service
`systemctl start chronos.service

Personal Note: I noticed that if the mesos services are started immediately after the zookeepers are started,
it is possible that the mesos won't be able to detect the master from zookeeper. The way to solve this is to restart the mesos services on each node again.

4. Web Interface


Mesos-Master:
```bash
`curl http://mesos01:5050

Marathon:
```bash
`curl http://mesos01:8080

5. Command line Interface


Getting master from CLI
```bash
`mesos-resolve `cat /etc/mesos/zk`

Killing a framework:

```bash
`curl -XPOST http://mesos02:5050/api/v1/scheduler -d '{ "framework_id": { "value": "[task_id]" }, "type": "TEARDOWN"}' -H Content-Type:application/json

6. Submit a spark job via MESOS cluster


To run a spark application in MESOS cluster instead, setup and configure HDFS (link http://czcodezone.blogspot.sg/2016/03/setup-hdfs-cluster-in-centos-vms.html) and Spark (link http://czcodezone.blogspot.sg/2016/03/setup-spark-cluster-in-centos-vms.html).

Put the spark bin package in the hdfs (run the following command on the hadoop namenode centos01):

```bash
`wget http://www.apache.org/dyn/closer.lua/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz
`hadoop/bin/hdfs dfs -mkdir /pkg
`hadoop/bin/hdfs dfs -put spark-1.6.0-bin-hadoop2.6.tgz /pkg/spark-1.6.0-bin-hadoop2.6.tgz

Run the following command to modify the spark-env.sh in spark/conf

```bash
`vi spark/conf/spark-env.sh

In the spark-env.sh, add the following lines:

export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so
export SPARK_EXECUTOR_URI= hdfs://centos01:9000/pkg/spark-1.6.0-bin-hadoop2.6.tgz

Where centos01 is the hadoop namenode

To submit a spark job, run the following command:

```bash
`spark/bin/spark-submit --class com.tutorials.spark.WordCountDriver --master mesos://mesos01:5050 word-count.jar

Important: mesos01 must be the current leader master node, otherwise, the command such as "spark-shell --master mesos://mesos01:5050" will cause the spark-shell to hang on the line "No credential provided. attempting to register without authentication". The solution is to find out which node is the active leader master node by running the command "mesos-resolve `cat /etc/mesos/zk`" and the luanch the spark shell by specifying the active leader master as in the --master option instead.


Setup ZooKeeper cluster in CentOS VMs.

This post summarizes my experience on setting up a zookeeper cluster in CentOS VMs.

Before start, setup three VMs with the following hostname/ipaddress (Refer to this on how to set up and configure CentOS VMs using VirtualBox  http://czcodezone.blogspot.sg/2016/03/setup-centos-vm-in-virtualbox-for.html):

zoo01/192.168.56.91
zoo02/192.168.56.92
zoo03/192.168.56.93

1. Edit /etc/hosts

On each VM, edit /etc/hosts to add the following lines:

192.168.56.91 zoo01
192.168.56.92 zoo02
192.168.56.93 zoo03

2. Download and configure zookeeper

2.1 Download zookeeper


On each VM, download and unzip the zookeeper to the folder "/root/zookeeper".

2.2. Configure zookeeper


On each VM, run the following command to edit the /root/zookeeper/conf/zoo.cfg:

```bash
`cp zookeeper/conf/zoo_sample.cfg zookeeper/conf/zoo.cfg
`vi zookeeper/conf/zoo.cfg

In the zookeeper/conf/zoo.cfg, add the following configuration:

dataDir=/root/zookeeper_data

server.1=zoo01:2222:2223
server.2=zoo02:2222:2223
server.3=zoo03:2222:2223

Run the following comand to create the directory for holding the data storage for zookeeper in each VM (the /root/zookeeper_data/myid must have an unique id on each VM):

```bash
`ssh root@zoo01
`mkdir /root/zookeeper_data
`touch /root/zookeeper_data/myid
`echo 1 >> /root/zookeeper_data/myid
`ssh root@zoo02
`mkdir /root/zookeeper_data
`touch /root/zookeeper_data/myid
`echo 2 >> /root/zookeeper_data/myid
`ssh root@zoo03
`mkdir /root/zookeeper_data
`touch /root/zookeeper_data/myid
`echo 3 >> /root/zookeeper_data/myid

3. Start zookeeper cluster


To start the zookeeper cluster, on each VM zoo01/2/3, run the following command:

```bash
`zookeeper/bin/zkServer.sh start zookeeper/conf/zoo.cfg

(Note that for testing, you may want to use the command "start-foreground" instead of "start")

4. Test zookeeper running


To check whether the zookeeper cluster is running, run the following command:

```bash
`zookeeper/bin/zkServer.sh status

To testing the zookeeper cluster, run the following command:

```bash
`zookeeper/bin/zkCli.sh -server 192.168.56.91:2181,192.168.56.92:2181,192.168.56.93:2181




Setup Cassandra Cluster on CentOS VMs

This post summarizes my experience in setting up a test environment for cassandra cluster using CentOS VMs.

Suppose we are to setup the cassandra cluster on two VMs cassanra01/192.168.56.131 and cassanra02/192.168.56.132 (link: http://czcodezone.blogspot.sg/2016/03/setup-centos-vm-in-virtualbox-for.html).

Perform all the steps below on each VM:

specify the /etc/hosts as follows:
192.168.56.131 cassandra01
192.168.56.132 cassandra02

Run the following command the install the required softwares:

```bash
`yum install -y java-1.8.0-openjdk-devel

Add the following line in .bashrc:

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk

Download and unzip the cassandra package to "/root/cassandra", create a folder
"/root/cassandra_data" and create the following subfolders:

/root/cassandra_data/data
/root/cassandra_data/log
/root/cassandra_data/cache

Run the followng command to set the write permission for "cassandra_data"
folder:
```bash
`chmod 755 -R /root/cassandra_data

Run the following command to start editing cassandra/conf/cassandra.yml

In the cassandra.yml, edit the following lines (e.g., change "rpc_address: localhost" to "rpc_address: [ipv4_address of the current VM]" as well as "listen_address"; otherwise commented out both "listen_address" and "rpc_address"):

#listen_address: localhost
#rpc_address: localhost
data_file_directories:
- /root/cassandra_data/data
commitlog_directory: /root/cassandra_data/log
saved_caches_directory: /root/cassandra_data/cache
cluster_name: 'myCluster'
seed_provider:
  - seeds: "cassandra01,cassandra02"

Start the cassandra by running the following command ("-R" to trigger it in root user):

```bash
`cassandra/bin/cassandra -R -f

Or ("nohup" to run process which wont be interrupted by SIGNUP, and "&" put the job into the background):

```bash
`nohup cassandra/bin/cassandra -R -f &

Stop the cassandra by running the following command:

or

```bash
`kill -9 `ps -ef | grep cassandra | awk '{print $2}'`

To check if cassandra is running:

```bash
``ps -ef | grep cassandra


Note that by default firewalld will block all the ports including those used by cassandra, a simple (but not safe) solution is to turn off the firewalld by running
the following commands:

```bash
`systemctl stop firewalld.service
`systemctl disable firewalld.service

To monitor the cassandra cluster, run the following command:

```bash
`cassandra/bin/cqlsh cassandra01

Or

`cassandra/bin/nodetool cfstats

Or

`cassandra/bin/nodetool ring


Run Spark on Windows 10 OS

In order for spark to run on Windows 10, we need to download the winutils.exe from https://github.com/srccodes/hadoop-common-2.2.0-bin and put it in the hadoop.bin directory.

To do this download the zip from https://github.com/srccodes/hadoop-common-2.2.0-bin and then put the files in the zip directory into the hadoop/bin directory (full path on my computer is C:\\hadoop\bin).

In your spark code, put this before invoking the JavaSparkContext:

        SparkConf conf = new SparkConf().setAppName("LeftJoin2");

        String home = System.getProperty("hadoop.home.dir");

        if (home == null) {
            System.setProperty("hadoop.home.dir", "C:\\hadoop");
        }

        conf.setIfMissing("spark.master", "local[2]");

        JavaSparkContext context = new JavaSparkContext(conf);

Install and Run MySQL on CentOS

To install mysql run the following command:

```bash
`yum install mysql
`rpm -Uvh http://dev.mysql.com/get/mysql-community-release-el7-5.noarch.rpm
`yum -y install mysql-community-server

Run the following command to start mysql:

`bash
`service mysql start

Run the following command to enable mysql on VM restart:

`bash
`chkconfig --levels 235 mysql on

Run the following command to access mysql

```bash
`mysql -u root

To configure myql for security run the following command:

```bash
`/usr/bin/mysql_secure_installation
Set root password? [Y/n] Y
Remove anonymous users? [Y/n] Y
Disallow root login remotely? [Y/n] n
Remove test database and access to it? [Y/n] Y
Reload privilege tables now? [Y/n] Y

Run the following command to restart mysql

```bash
`systemctl restart mysql.service

After the above steps "mysql -u root" will no longer be able to access mysql, Use the following command instead:

```bash
`mysql -u root -p mysql

To allow java applications to access mysql from another computer, run the following query and then restart mysql:

> update user set host="%" where user="root" and host='localhost';

Furthermore, ports may be blocked by firewalld running in the VM that runs the myql, which will block access by java application from another computer, this can be
resolved by stopping and disabling firewalld:

```bash
`systemctl stop firewalld.service
`systemctl disable firewalld.service

Setup CentOS VM in VirtualBox for Software Development and Distributed Computation on Spark and HDFS

This post summarize my experience on setting up a simple CentOS VM for development environment using VirtualBox that i used to test-run distributed jobs in spark and HDFS cluster.

1. Create vbox centos VM

Launch virtualbox, and create a VM using the centos iso image downloaded. Configure the VM settings to have the network adapter in the "Network" tab:

Host-only adapter

If you need to share a folder in the host computer with the centos VM, add the shared folder in the "Shared Folders" tab of the VM settings and configure it to have:

1. Full Access
2. Auto Mount

In this example, it is assumed the shared folder is named "git" and it is available on "C:\Users\xschen\git" on the host computer.

2. Install the development tools in the centos VM

Launch the centos VM and follows the standard installation steps. Make sure that the three network adapters are enabled when installing the centos.

After the installation is completed, run the following commands to install the necessary tools:

```bash
`yum update
`yum install -y java-1.8.0-openjdk-devel
`yum install -y maven
`yum install -y kernel-devel
`yum install -y gcc
`yum install -y bzip2


The java-1.8.0-openjdk-devel and the maven are used for java development, and the kernel-devel, gcc, and bzip2 can be used for compiling C or C++ based source codes (which will be needed later to install the vboxsf)

3. Access shared folder on the host computer

In order to access the shared folder "git" on the host computer, we must first install the VirtualBoxLinuxAdditions so that vboxsf is available in centos VM.

3.1 Mount and install VBoxGuestAddition

Click the "Device-->Insert Guest Addition CD Image" in the menu of the VM user display, and the VBoxGuestAdditions.iso will be mounted on the VM cdrom. To see the device that mounts the VBoxGuestAdditions.iso, run the following command in the centos VM:

```bash
`ls /dev -l | grep cd

You should see something like /dev/sr0 which indicates mounted iso there. Run the following commands to access the mounted iso:

```bash
`mkdir /mnt/dvd
`mount -r -t iso9660 /dev/sr0 /mnt/dvd

Now run the following command to install the vboxsf:

```bash
`cd /mnt/dvd
`sh VBoxLinuxAdditions.run

if encounter any error, run the following command:

```bash
`yum groupinstall "Development Tools"

After the above commands are successfully executed, you should be able to see vboxsf is available in the centos VM by running the following command:

```bash
`lsmod | grep vbox

3.2. Mount and access the shared folder

Run the following commands to mount and access the shared folder:

```bash
`mkdir /mnt/git
`mount -t vboxsf git /mnt/git

Now to access the shared folder, just enter the following commands:

```bash
`cd /mnt/git
`ls

4. Assign static ip address to a network adapter 

In this example, we want to assign a static ipaddress to the network adapter. It is assumed that:

1. the host computer is in the subnet "192.168.56.*"
2. the static ip address to the host-only network adapter is "192.168.56.101" (which will be in the same subnet as the host computer)

In the centos VM, install and run the ifconfig tool:

```bash
`yum -y install net-tools
`ifconfig

In my computer, i have the enp0s3 as the host-only network adapter. their configuration files ifcfg-enp0s3 (if not exist, simply create the text file of the same name) can be found in the folder "/etc/sysconfig/network-scripts".

4.1 Configure static IP address in the network adapter

Run the following command to open and edit the ifcfg-enp0s3:

```bash
`cd /etc/sysconfig/network-scripts
`vi ifcfg-enp0s3

Add or modify the following settings in the ifcfg-enp0s3:

BOOTPROTO=static
ONBOOT=yes
IPADDR=192.168.56.101

A sample of the ifcfg-enp0s3 is as shown below:

TYPE=Ethernet
BOOTPROTO=static
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
NAME=enp0s3
DEVICE=enp0s3
ONBOOT=yes
IPADDR=192.168.56.101
PREFIX=24
IPV6_PEERDNS=yes
IPV6_PEERROUTES=yes
IPV6_PRIVACY=no


4.2. Restart the network service

Run the following command to restart the network service and re-check the network configuration:

```bash
`service network restart
`ifconfig

You should see that the static ip address

5. Security: Disable firewall and selinux

For the project I am working, I need to run distributed computation jobs using apache spark cluster and HDFS. In order for spark cluster to work, the firewall and selinux must be properly configured. As a quick and dirty way, the firewall and selinux (optional) can be disabled on all VMs running spark cluster (including master and slave nodes in spark and namenodes and datanodes in HDFS).

5.1. Disable firewalld


 To disable firewall, run the command:

```bash
`systemctl stop firewalld.service
`systemctl disable firewalld.service

To restart the the firewall, run the following command:

```bash
`systemctl start firewalld.service
`systemctl enable firewall.service

5.2. Disable selinux (Optional)

Run the command to edit /etc/selinux/config:

```bash
`vi /etc/selinux/config

In the /etc/selinux/config, change the following line:

SELINUX=enforcing
SELINUXTYPE=targeted

to:

SELINUX=disabled
#SELINUXTYPE=targeted

Next to turn off the selinux immediately, run the following command:

```bash
`setenforce 0

Monday, February 1, 2016

Elasticsearch Version Upgrade using Rolling Restart

This upgrade method follows updating one node at a time and restart again. The same method can also be used for restarting a elasticsearch cluster in a safe and efficient way.

Master-eligible nodes


In a elasticsearch production cluster, start with one master-eligible node, stop the elasticsearch service, perform upgrade, and then restart the service. Repeat this until all master-eligible nodes have been upgraded. Since master-eligible nodes do not keep shards and replicas, the process is safe with data so far.

Client nodes


Next stop, upgrade, and restart the client nodes one at a time just like with the master-eligible nodes.

Data nodes


Next, before starting to upgrade any data node in the elasticsearch cluster, dynamically adding a setting to temporarily turn off the resharding of the elasticsearch cluster via restful api calls to the cluster (because if we restart a data node, the shards in the cluster will rebalance). The setting to be temporally disabled is the cluster.routing.allocation.enable, which can be done by issue the following call to the elasticsearch cluster:

curl -X PUT -H "Content-Type: application/json" http://elastic-cluster:9200/_cluster/settings/ -d '
{ "transient": { "cluster.routing.allocation.enable": "none" } }
'
(Note that the "transient" allows the setting to be not permanent)

Now stop, upgrade and restart a data node. At this point, we can reverse the setting for cluster.routing.allocation.enable by running the curl restful below:

curl -X PUT -H "Content-Type: application/json" http://elastic-cluster:9200/_cluster/settings/ -d '
{ "transient": { "cluster.routing.allocation.enable": "all" } }
'
Once this is done, the origin shards for that data node will be up again.

Next, proceed to the second data node and repeat the process above until all data nodes are upgraded.



Sunday, January 31, 2016

Quorum and minimum_master_nodes setting for elasticsearch configuration

Theory behind elasticsearch recovery

Elastic search recovery works by having a master election satifying a particular minimum_master_nodes criteria. That is a master node will only be elected if there is a N number of master-eligible nodes (nodes in which node.master=true in their elasticsearch.yml file) to join it. N is specified by discovery.zen.minimum_master_nodes in the elasticsearch.yml file.

For  the quorum scheme, N should be set to

discovery.zen.minimum_master_nodes = (number of master-eligible nodes) / 2 + 1

The recovery works like this. When the current master node dies, a new master node will only be elected if there is N master-eligible nodes to join it, where N = (number of master-eligible nodes) / 2 + 1

Once the new master node is elected, if later the originally dead master node comes alive. It will have less than N master-eligible nodes to join it, therefore it will have to step down. Thus this scheme ensures that there will no two masters at the same time (which is the so-called split-brain scenario, that is not desirable since all master nodes have higher authority in its being able to update cluster-state in the data nodes and client nodes)

Minimum number of master eligible nodes required for an elasticsearch cluster

The minimum number of master-eligible nodes should be 3. And the zen.discovery.minimum_master_nodes should be equal = 3 / 2 + 1 = 2.

Reason: Suppose we only have two master-eligible nodes. If the master node dies, there is only 1 master-eligible node left, two things will happen depending on the value in discovery.zen.minimum_master_nodes:

Case 1:  if the zen.discovery.minimum_master_nodes is set to greater than 1, then there won't be any new master node elected, and the cluster will not operate.


Case 2:  if we set the zen.discovery.minimum_master_nodes=1, the new master node will be elected, however, when the originally dead master is brought alive again, the original master will not step down since it now also has one master-eligible node to join it, leading to split-brain problem.

Therefore the recommended minimum settings is to increases the number of master-eligible nodes to 3, and set the zen.discovery.minimum_master_nodes=2.

Saturday, January 30, 2016

Change network mask on centos

This post shows how to change the network mask and other ifconfig information on centos. After login to centos run the following command to open and edit the ifcfg-eth0 file:

vi /etc/sysconfig/network-scripts/ifcfg-eth0

Look for the line starting with PREFIX0, change the network mask here (not that this is specified in bits so, "PREFIX0=22" means network mask is 255.255.255.252).

After change is completed, restart the network service by running the following command:

service network restart