A few people have asked me how to set up a small Standalone Spark Cluster for testing. Here are the scripts for Ubuntu 15.10 to install Apache Spark 1.6.0 which should have you up and running very quickly. This guide assumes you have done new installation of Ubuntu 15.10 Server or equivalent.
This example uses two Intel NUC machines installed with machine names
nucluster1. You will need to execute these commands on both machines.
1.1. Configure your hosts
You need to ensure all the boxes in the cluster are able to connect to each other.
sudo nano /etc/hosts
You will need to add all the hosts including the current machine. Once you have done this you should be able to ping between the boxes.
127.0.0.1 localhost 172.16.10.200 nucluster0 172.16.10.201 nucluster1
1.2. Setting up the environment.
Install your core packages of Java, Maven (a build manager) and a Basic Linear Algebra Subprograms library.
sudo apt-get install openjdk-7-jdk sudo apt-get install maven sudo apt-get install libatlas3-base libopenblas-base
1.3. Download Apache Spark 1.6.0
Go to the Apache Spark downloads page and select:
Spark release: 1.6.0 (Jan 04 2016)
Package type: Pre-built for Hadoop 2.6 and later
Download type: Select Apache Mirror
Then by clicking on the spark-1.6.0.tgz link you will be able to see your local mirror. In my case it is an Australian mirror.
Download, unpack and move to /opt:
http://apache.mirror.digitalpacific.com.au/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz tar xvf spark-1.6.0-bin-hadoop2.6.tgz rm spark-1.6.0-bin-hadoop2.6.tgz mv spark-1.6.0-bin-hadoop2.6 /opt
1.4. Install Scala
This code installs Scala 2.10.x as Apache Spark needs to be built from source if using Scala 2.11.x.
If you do a single
apt-get install -f you will see some dependency issues but executing the command twice will resolve them.
wget http://downloads.typesafe.com/scala/2.10.6/scala-2.10.6.deb sudo dpkg -i scala-2.10.6.deb sudo apt-get install -f sudo apt-get install -f sudo apt-get autoremove rm scala-2.10.6.deb
1.5. Install Scala Build Tools (sbt)
Add the Scala Build Tools source and install.
echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 642AC823 sudo apt-get update sudo apt-get install sbt
1.6. Configure password-less SSH
You will need the machines in the cluster to be able to SSH to each other without requiring a password. Run
ssh-keygen to create a key on each machine (there will be a few questions) then copy them to the other server.
ssh-keygen ssh-copy-id mike@nucluster0 ssh-copy-id mike@nucluster1
1.7. (Optional) Check network speed
Because Apache Spark and most other cluster computing requires fast network it is worth confirming you don't have any strange network issues.
One the first machine (
sudo apt-get install iperf iperf -s
One the second machine:
sudo apt-get install iperf iperf -c nucluster0
You should see an output like this confirming my gigabit networking is performing well (937 Mbits/sec = ~ 117MB/sec):
------------------------------------------------------------ Client connecting to nucluster0, TCP port 5001 TCP window size: 85.0 KByte (default) ------------------------------------------------------------ [ 3] local 172.16.10.201 port 55882 connected with 172.16.10.200 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.09 GBytes 937 Mbits/sec
2. Apache Spark Configuration
There are quite a few files which need to be configured all of which should be located in the
/opt/spark-1.6.0-bin-hadoop2.6/conf folder if you followed the guide above.
The main exports here configure where to find the Spark files and how to connect to the master.
You will need to make sure that you configure the correct number of
SPARK_WORKER_MEMORY correctly for your servers.
export SPARK_HOME=/opt/spark-1.6.0-bin-hadoop2.6/ export SPARK_MASTER_IP=nucluster0 export SPARK_MASTER_PORT=7077 export SPARK_MASTER_WEBUI_PORT=9080 export SPARK_WORKER_CORES=4 export SPARK_WORKER_MEMORY=8G export SPARK_WORKER_INSTANCES=1 export SPARK_DAEMON_MEMORY=512M export SPARK_LOCAL_DIRS=/tmp/spark
This file contains the Spark job defaults. Each of my two boxes has 16GB RAM but Databricks recommends not exceeding 75% of total system memory for
spark.driver.memory. If you have more than 32GB RAM then remove the
spark.master spark://nucluster0:7077 spark.executor.memory 8G spark.driver.maxResultSize 4G spark.driver.memory 4G spark.sql.shuffle.partitions 5000 spark.serializer org.apache.spark.serializer.KryoSerializer spark.kryoserializer.buffer.max 128m spark.executor.extraJavaOptions -XX:+UseCompressedOops spark.executor.instances 2
My slaves file is simple but it could list many more machines.
# A Spark Worker will be started on each of the machines listed below. nucluster0 nucluster1
2.4. (Optional) hive-site.xml
If you want to try a running a standalone SQL server on Spark there is a component called
thriftserver. To use it you will need to configure a metastore location in
hive-site.xml (this file does not exist so you will need to create it). You will probably want to set a more permanent data location than
/tmp if you want to use this component more seriously.
<configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:derby:;databaseName=/tmp/metastore_db;create=true</value> <description>JDBC connect string for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>org.apache.derby.jdbc.EmbeddedDriver</value> <description>Driver class name for a JDBC metastore</description> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>/tmp/hive-warehouse</value> <description>Where to store metastore data</description> </property> </configuration>
3. Running Apache Spark
Now you have installed the Spark you want to start it right?
3.1. Start the Spark Master
You can do that by executing the command:
Once that has run you should be able to see the Spark Master UI at http://localhost:9080/ on the master.
The Spark Master can be stopped by calling:
3.2. Submitting jobs
There are two ways you can submit jobs. Either by executing via the
spark-shell which can be achieved like this:
To submit a file interactively.
/opt/spark-1.6.0-bin-hadoop2.6/bin/spark-shell -i [filename].scala
To submit a JAR (and because we installed
sbt you should be able to build them).
/opt/spark-1.6.0-bin-hadoop2.6/bin/spark-submit --class "SimpleApp" target/scala-2.10/simple-project_2.10-1.0.jar
3.3. Running the thriftserver
thriftserver is an application which runs after the Spark Master has been started. It will show up as
With these instructions you should be good to go.