After setup the VMs, we will designate the following topology (refer to this for setting up and configure CentOS VMs using VirtualBox http://czcodezone.blogspot.sg/2016/03/setup-centos-vm-in-virtualbox-for.html):
centos01: master
centos02: slave
centos03: slave
centos04: slave
centos05: slave
where centos01 refers to the hostname of the VM.
1. Configure the spark cluster
Download and unzip the spark-1.6.0-hadoop2.6.tgz to "/root/spark", run the following command on centos01 to specify the list of slaves:
```bash
`cp spark/conf/slaves.template spark/conf/slaves
`vi spark/conf/slaves
In the spark/conf/slaves add the following lines:
centos02
centos03
centos04
centos05
Make sure that the firewalls turned off on centos01/2/3/4/5 and and passwordless ssh from centos01 to centos02/3/4/5.
Run the following command to copy the "/root/spark" from centos01 to centos02/3/4/5:
```bash
`rsync -a /root/spark/ root@centos02:/root/park
`rsync -a /root/spark/ root@centos03:/root/park
`rsync -a /root/spark/ root@centos04:/root/park
`rsync -a /root/spark/ root@centos05:/root/park
2. Start and stop the spark cluster
Run the following command on centos01 to start the spark cluster:
```bash
`spark/sbin/start-all.sh
To stop the spark cluster, run the following comand on centos01:
```bash
`spark/sbin/stop-all.sh
3. Run the spark shell
After the spark cluster is started, we can start the spark shell by running the following command:
```bash
`spark/bin/spark-shell --master spark://centos01:7077
The port 7077 is the default port for spark master centos01
4. Submit a spark job to spark cluster
After the spark cluster has been setup, assuming the master is centos01, run the following command to submit a spark job:
```bash
`spark/bin/spark-submit --class com.tutorials.spark.WordCountDriver --master spark://centos01:7077 word-count.jar
Or more refinely:
```bash
`spark/bin/spark-submit --class com.tutorials.spark.WordCountDriver --master spark://centos01:7077 --executor-memory 2G --total-executor-cores 8 word-count.jar 1000
Below are two other alternatives i tested using YARN cluster and mesos cluster
4.1. Submit a spark job via YARN cluster
Suppose we have a resource management cluster such as Hadoop YARN setup, we can submit the spark job to YARN for processing as well (YARN will the spark master)To run an application in YARN cluster instead, setup and configure hdfs and yarn using the (link: http://czcodezone.blogspot.sg/2016/03/setup-hdfs-cluster-in-centos-vms.html) and (link: http://czcodezone.blogspot.sg/2016/03/setup-hadoop-yarn-on-centos-vms.html), Start the HDFS and YARN.
run the following command to edit the .bashrc:
```bash
`vi .bashrc
In the spark/conf/spark-env.sh of each VM centos01/2/3/4/5, add the following line:
export HADOOP_HOME_DIR=/root/hadoop
export HADOOP_CONF_DIR=/root/hadoop/etc/hadoop
export HADOOP_YARN_DIR=/root/hadoop/etc/hadoop
Run the following command on each VM to update .bashrc:
```bash
`source .bashrc
To submit a spark job, run the following command:
```bash
``spark/bin/spark-submit --class com.tutorials.spark.WordCountDriver --master yarn --deploy-mode cluster word-count.jar
Or
```bash
``spark/bin/spark-submit --class com.tutorials.spark.WordCountDriver --master yarn-cluster word-count.jar
Or more refinely:
```bash
`spark/bin/spark-submit --class com.tutorials.spark.WordCountDriver --master yarn --deploy-mode cluster --executor-memory 2G --num-executors 20 word-count.jar 1000
4.2. Submit a spark job via mesos cluster
To run an application in mesos cluster instead, setup and configure hdfs and mesos (link: http://czcodezone.blogspot.sg/2016/03/setup-mesos-cluster-in-centos-vms.html), Start the HDFS and MESOS.
Put the spark bin package in the hdfs (run the following command on the hadoop namenode centos01):
```bash
`wget http://www.apache.org/dyn/closer.lua/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz
`hadoop/bin/hdfs dfs -mkdir /pkg
`hadoop/bin/hdfs dfs -put spark-1.6.0-bin-hadoop2.6.tgz /pkg/spark-1.6.0-bin-hadoop2.6.tgz
Run the following command to modify the spark-env.sh in spark/conf
```bash
`vi spark/conf/spark-env.sh
In the spark-env.sh, add the following lines:
export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so
export SPARK_EXECUTOR_URI= hdfs://centos01:9000/pkg/spark-1.6.0-bin-hadoop2.6.tgz
Where centos01 is the hadoop namenode
To submit a spark job, run the following command:
```bash
`spark/bin/spark-submit --class com.tutorials.spark.WordCountDriver --master mesos://mesos01:5050 word-count.jar
Important: mesos01 must be the current leader master node, otherwise, the command such as "spark-shell --master mesos://mesos01:5050" will cause the spark-shell to hang on the line "No credential provided. attempting to register without authentication". The solution is to find out which node is the active leader master node by running the command "mesos-resolve `cat /etc/mesos/zk`" and the luanch the spark shell by specifying the active leader master as in the --master option instead.
No comments:
Post a Comment