in tutorial spark ~ read.

Setting up a Standalone Apache Spark Cluster

A few people have asked me how to set up a small Standalone Spark Cluster for testing. Here are the scripts for Ubuntu 15.10 to install Apache Spark 1.6.0 which should have you up and running very quickly. This guide assumes you have done new installation of Ubuntu 15.10 Server or equivalent.

This example uses two Intel NUC machines installed with machine names nucluster0 and nucluster1. You will need to execute these commands on both machines.

1. Environment

1.1. Configure your hosts

You need to ensure all the boxes in the cluster are able to connect to each other.

sudo nano /etc/hosts  

You will need to add all the hosts including the current machine. Once you have done this you should be able to ping between the boxes.

127.0.0.1       localhost  
172.16.10.200   nucluster0  
172.16.10.201   nucluster1  

1.2. Setting up the environment.

Install your core packages of Java, Maven (a build manager) and a Basic Linear Algebra Subprograms library.

sudo apt-get install openjdk-7-jdk  
sudo apt-get install maven  
sudo apt-get install libatlas3-base libopenblas-base  

1.3. Download Apache Spark 1.6.0

Go to the Apache Spark downloads page and select:
Spark release: 1.6.0 (Jan 04 2016)
Package type: Pre-built for Hadoop 2.6 and later
Download type: Select Apache Mirror

Then by clicking on the spark-1.6.0.tgz link you will be able to see your local mirror. In my case it is an Australian mirror.

Download, unpack and move to /opt:

http://apache.mirror.digitalpacific.com.au/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz  
tar xvf spark-1.6.0-bin-hadoop2.6.tgz  
rm spark-1.6.0-bin-hadoop2.6.tgz  
mv spark-1.6.0-bin-hadoop2.6 /opt  

1.4. Install Scala

This code installs Scala 2.10.x as Apache Spark needs to be built from source if using Scala 2.11.x.

If you do a single apt-get install -f you will see some dependency issues but executing the command twice will resolve them.

wget http://downloads.typesafe.com/scala/2.10.6/scala-2.10.6.deb  
sudo dpkg -i scala-2.10.6.deb  
sudo apt-get install -f  
sudo apt-get install -f  
sudo apt-get autoremove  
rm scala-2.10.6.deb  

1.5. Install Scala Build Tools (sbt)

Add the Scala Build Tools source and install.

echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list  
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 642AC823  
sudo apt-get update  
sudo apt-get install sbt  

1.6. Configure password-less SSH

You will need the machines in the cluster to be able to SSH to each other without requiring a password. Run ssh-keygen to create a key on each machine (there will be a few questions) then copy them to the other server.

ssh-keygen  
ssh-copy-id mike@nucluster0  
ssh-copy-id mike@nucluster1  

1.7. (Optional) Check network speed

Because Apache Spark and most other cluster computing requires fast network it is worth confirming you don't have any strange network issues.
One the first machine (nucluster0):

sudo apt-get install iperf  
iperf -s  

One the second machine:

sudo apt-get install iperf  
iperf -c nucluster0  

You should see an output like this confirming my gigabit networking is performing well (937 Mbits/sec = ~ 117MB/sec):

------------------------------------------------------------
Client connecting to nucluster0, TCP port 5001  
TCP window size: 85.0 KByte (default)  
------------------------------------------------------------
[  3] local 172.16.10.201 port 55882 connected with 172.16.10.200 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.09 GBytes   937 Mbits/sec

2. Apache Spark Configuration

There are quite a few files which need to be configured all of which should be located in the /opt/spark-1.6.0-bin-hadoop2.6/conf folder if you followed the guide above.

2.1. spark-env.sh

The main exports here configure where to find the Spark files and how to connect to the master.

You will need to make sure that you configure the correct number of SPARK_WORKER_CORES and SPARK_WORKER_MEMORY correctly for your servers.

export SPARK_HOME=/opt/spark-1.6.0-bin-hadoop2.6/  
export SPARK_MASTER_IP=nucluster0  
export SPARK_MASTER_PORT=7077  
export SPARK_MASTER_WEBUI_PORT=9080  
export SPARK_WORKER_CORES=4  
export SPARK_WORKER_MEMORY=8G  
export SPARK_WORKER_INSTANCES=1  
export SPARK_DAEMON_MEMORY=512M  
export SPARK_LOCAL_DIRS=/tmp/spark  

2.2. spark-defaults.conf

This file contains the Spark job defaults. Each of my two boxes has 16GB RAM but Databricks recommends not exceeding 75% of total system memory for spark.executor.memory + spark.driver.memory. If you have more than 32GB RAM then remove the -XX:+UseCompressedOops flag.

spark.master                      spark://nucluster0:7077  
spark.executor.memory             8G  
spark.driver.maxResultSize        4G  
spark.driver.memory               4G  
spark.sql.shuffle.partitions      5000  
spark.serializer                  org.apache.spark.serializer.KryoSerializer  
spark.kryoserializer.buffer.max   128m  
spark.executor.extraJavaOptions   -XX:+UseCompressedOops  
spark.executor.instances          2  

2.3. slaves

My slaves file is simple but it could list many more machines.

# A Spark Worker will be started on each of the machines listed below.
nucluster0  
nucluster1  

2.4. (Optional) hive-site.xml

If you want to try a running a standalone SQL server on Spark there is a component called thriftserver. To use it you will need to configure a metastore location in hive-site.xml (this file does not exist so you will need to create it). You will probably want to set a more permanent data location than /tmp if you want to use this component more seriously.

<configuration>  
<property>  
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:derby:;databaseName=/tmp/metastore_db;create=true</value>
  <description>JDBC connect string for a JDBC metastore</description>
</property>  
<property>  
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>org.apache.derby.jdbc.EmbeddedDriver</value>
  <description>Driver class name for a JDBC metastore</description>
</property>  
<property>  
  <name>hive.metastore.warehouse.dir</name>
  <value>/tmp/hive-warehouse</value>
  <description>Where to store metastore data</description>
</property>  
</configuration>  

3. Running Apache Spark

Now you have installed the Spark you want to start it right?

3.1. Start the Spark Master

You can do that by executing the command:

/opt/spark-1.6.0-bin-hadoop2.6/sbin/start-all.sh

Once that has run you should be able to see the Spark Master UI at http://localhost:9080/ on the master.

The Spark Master can be stopped by calling:

/opt/spark-1.6.0-bin-hadoop2.6/sbin/stop-all.sh

3.2. Submitting jobs

There are two ways you can submit jobs. Either by executing via the spark-shell which can be achieved like this:

3.2.1 spark-shell

To submit a file interactively.

/opt/spark-1.6.0-bin-hadoop2.6/bin/spark-shell -i [filename].scala
3.2.2 spark-submit

To submit a JAR (and because we installed maven and sbt you should be able to build them).

/opt/spark-1.6.0-bin-hadoop2.6/bin/spark-submit --class "SimpleApp" target/scala-2.10/simple-project_2.10-1.0.jar

3.3. Running the thriftserver

The thriftserver is an application which runs after the Spark Master has been started. It will show up as org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.

Start:

/opt/spark-1.6.0-bin-hadoop2.6/sbin/start-thriftserver.sh

Stop:

/opt/spark-1.6.0-bin-hadoop2.6/sbin/stop-thriftserver.sh

With these instructions you should be good to go.

comments powered by Disqus