Thursday, March 31, 2016

Setup Spark Cluster in CentOS VMs

This post summarizes my experience in setting up a test environment for spark cluster using CentOS VMs.

After setup the VMs, we will designate the following topology (refer to this for setting up and configure CentOS VMs using VirtualBox http://czcodezone.blogspot.sg/2016/03/setup-centos-vm-in-virtualbox-for.html):

centos01: master
centos02: slave
centos03: slave
centos04: slave
centos05: slave

where centos01 refers to the hostname of the VM.

1. Configure the spark cluster


Download and unzip the spark-1.6.0-hadoop2.6.tgz to "/root/spark", run the following command on centos01 to specify the list of slaves:

```bash
`cp spark/conf/slaves.template spark/conf/slaves
`vi spark/conf/slaves

In the spark/conf/slaves add the following lines:

centos02
centos03
centos04
centos05

Make sure that the firewalls turned off on centos01/2/3/4/5 and and passwordless ssh from centos01 to centos02/3/4/5.

Run the following command to copy the "/root/spark" from centos01 to centos02/3/4/5:

```bash
`rsync -a /root/spark/ root@centos02:/root/park
`rsync -a /root/spark/ root@centos03:/root/park
`rsync -a /root/spark/ root@centos04:/root/park
`rsync -a /root/spark/ root@centos05:/root/park

2. Start and stop the spark cluster


Run the following command on centos01 to start the spark cluster:

```bash
`spark/sbin/start-all.sh


To stop the spark cluster, run the following comand on centos01:

```bash
`spark/sbin/stop-all.sh

3. Run the spark shell


After the spark cluster is started, we can start the spark shell by running the following command:

```bash
`spark/bin/spark-shell --master spark://centos01:7077

The port 7077 is the default port for spark master centos01

4. Submit a spark job to spark cluster


After the spark cluster has been setup, assuming the master is centos01, run the following command to submit a spark job:

```bash
`spark/bin/spark-submit --class com.tutorials.spark.WordCountDriver --master spark://centos01:7077 word-count.jar

Or more refinely:
```bash
`spark/bin/spark-submit --class com.tutorials.spark.WordCountDriver --master spark://centos01:7077 --executor-memory 2G --total-executor-cores 8 word-count.jar 1000

Below are two other alternatives i tested using YARN cluster and mesos cluster

4.1. Submit a spark job via YARN cluster

Suppose we have a resource management cluster such as Hadoop YARN setup, we can submit the spark job to YARN for processing as well (YARN will the spark master)

To run an application in YARN cluster instead, setup and configure hdfs and yarn using the (link: http://czcodezone.blogspot.sg/2016/03/setup-hdfs-cluster-in-centos-vms.html) and (link: http://czcodezone.blogspot.sg/2016/03/setup-hadoop-yarn-on-centos-vms.html), Start the HDFS and YARN.

run the following command to edit the .bashrc:

```bash
`vi .bashrc

In the spark/conf/spark-env.sh of each VM centos01/2/3/4/5, add the following line:

export HADOOP_HOME_DIR=/root/hadoop
export HADOOP_CONF_DIR=/root/hadoop/etc/hadoop
export HADOOP_YARN_DIR=/root/hadoop/etc/hadoop

Run the following command on each VM to update .bashrc:

```bash
`source .bashrc

To submit a spark job, run the following command:

```bash
``spark/bin/spark-submit  --class com.tutorials.spark.WordCountDriver --master yarn --deploy-mode cluster word-count.jar

Or
```bash
``spark/bin/spark-submit  --class com.tutorials.spark.WordCountDriver --master yarn-cluster word-count.jar

Or more refinely:
```bash
`spark/bin/spark-submit  --class com.tutorials.spark.WordCountDriver --master yarn --deploy-mode cluster --executor-memory 2G --num-executors 20 word-count.jar 1000

4.2. Submit a spark job via mesos cluster


To run an application in mesos cluster instead, setup and configure hdfs and mesos (link: http://czcodezone.blogspot.sg/2016/03/setup-mesos-cluster-in-centos-vms.html), Start the HDFS and MESOS.

Put the spark bin package in the hdfs (run the following command on the hadoop namenode centos01):

```bash
`wget http://www.apache.org/dyn/closer.lua/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz
`hadoop/bin/hdfs dfs -mkdir /pkg
`hadoop/bin/hdfs dfs -put spark-1.6.0-bin-hadoop2.6.tgz /pkg/spark-1.6.0-bin-hadoop2.6.tgz

Run the following command to modify the spark-env.sh in spark/conf

```bash
`vi spark/conf/spark-env.sh

In the spark-env.sh, add the following lines:

export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so
export SPARK_EXECUTOR_URI= hdfs://centos01:9000/pkg/spark-1.6.0-bin-hadoop2.6.tgz

Where centos01 is the hadoop namenode

To submit a spark job, run the following command:

```bash
`spark/bin/spark-submit --class com.tutorials.spark.WordCountDriver --master mesos://mesos01:5050 word-count.jar

Important: mesos01 must be the current leader master node, otherwise, the command such as "spark-shell --master mesos://mesos01:5050" will cause the spark-shell to hang on the line "No credential provided. attempting to register without authentication". The solution is to find out which node is the active leader master node by running the command "mesos-resolve `cat /etc/mesos/zk`" and the luanch the spark shell by specifying the active leader master as in the --master option instead.

No comments:

Post a Comment