At SVDS, we’ll often run Spark on YARN in production. Add some artful tuning, and this works pretty well. However, developers typically build and test Spark application in standalone mode, not on YARN.
Rather than get bitten by the idiosyncrasies involved in running Spark on YARN vs. standalone when you go to deploy, here’s a way to set up a development environment for Spark that more closely mimics how it’s used in the wild.
A simple Yarn “cluster” on your laptop
Run a Docker image for a cdh standalone instance:
docker run -d --name=mycdh svds/cdh
When the logs
docker logs -f mycdh
stop going wild, you can run the usual Hadoop-isms to set up a workspace:
docker exec -it mycdh hadoop fs -ls /
docker exec -it mycdh hadoop fs -mkdir -p /tmp/blah
Run spark
Then, it’s pretty straightforward to run Spark against YARN:
docker exec -it mycdh \
spark-submit \
--master yarn-cluster \
--class org.apache.spark.examples.SparkPi \
/usr/lib/spark/examples/lib/spark-examples-1.3.0-cdh5.4.3-hadoop2.6.0-cdh5.4.3.jar \
1000
Note that you can submit a Spark job to run in either “yarn-client” or “yarn-cluster” modes.
- In “yarn-client” mode, the spark driver runs outside of yarn and logs to console and all spark executors run as yarn containers.
- In “yarn-cluster” mode, all spark executors run as yarn containers, but then the spark driver also runs as a yarn container. Yarn manages all the logs.
You can also run the Spark shell so that any workers spawned run in YARN:
docker exec -it mycdh spark-shell --master yarn-client
or
docker exec -it mycdh pyspark --master yarn-client
Your Application
Ok, so SparkPi
is all fine and dandy, but how do I run a real application?
Let’s make up an example. Say you build your Spark project on your laptop in the /Users/myname/mysparkproject/
directory.
When you build with Maven or SBT, it typically builds and leaves jars under a
/Users/myname/mysparkproject/target/
directory; for SBT, it’ll look like:
/Users/myname/mysparkproject/target/scala-2.10/
.
The idea here is to make these jars directly accessible from both your laptop’s build process as well as from inside the cdh container.
When you start up the cdh
container, map this local host directory up and into the container:
docker run -d -v ~/mysparkproject/target:/target --name=mycdh svds/cdh
where the -v
option will make ~/mysparkproject/target
available as /target
within the container.
So,
sbt clean assembly
leaves a jar under ~/mysparkproject/target
, which the container sees as /target
and you can run jobs using something like
docker exec -it mycdh \
spark-submit \
--master yarn-cluster \
--name MyFancySparkJob-name \
--class org.markmims.MyFancySparkJob \
/target/scala-2.10/My-assembly-1.0.1.20151013T155727Z.c3c961a51c.jar \
myarg
The --name
arg makes it easier to find in the midst of multiple YARN jobs.
Logs
While a Spark job is running, you can get its YARN “applicationId” from:
docker exec -it mycdh yarn application -list
Or, if it finished already, just list things out with more conditions:
docker exec -it mycdh yarn application -list -appStates FINISHED
You can dig through the YARN-consolidated logs after the job is done by using:
docker exec -it mycdh yarn logs -applicationId <applicationId>
Consoles
Web consoles are critical for application development. Spend time up front getting ports open or forwarded correctly for all environments. Don’t wait until you’re actually trying to debug something critical to figure out how to forward ports to see the staging UI in all environments.
YARN ResourceManager UI
YARN gives you quite a bit of info about the system right from the ResourceManager on its IP address and webGUI port (usually 8088)
open http://<resource-manager-ip>:<resource-manager-port>/
Spark Staging UI
YARN also conveniently proxies access to the spark staging UI for a given application. This looks like:
open http://<resource-manager-ip>:<resource-manager-port>/proxy/<applicationId>
For example:
open http://localhost:8088/proxy/application_1444330488724_0005/
Ports and Docker
There are a few ways to deal with accessing port 8088
of the YARN resource manager from outside of the Docker container. I typically use SSH for everything, and just forward ports out to localhost
on the host. However, most people will expect to access ports directly on the docker-machine ip
address. To do that, you have to map each port when you first spin up the cdh
container using the -P 8088
option:
docker run -d -v target -P 8088 --name=mycdh svds/cdh
Then you should be good to go with something like:
open http://`docker-machine ip`:8088/
to access the YARN console.
Tips and Gotchas
- The Docker image
svds/cdh
is quite large (2GB). I like to do a separatedocker pull
from anydocker run
commands just to isolate the download. In fact, I recommend pinning the CDH version for the same reason. So,docker pull svds/cdh:5.4.0
for instance, then refer to it that way throughoutdocker run -d --name=mycdh svds/cdh:5.4.0
and that’ll insure you’re not littering your laptop’s filesystem with Docker layers from multiple CDH versions. The baresvds/cdh
(equiv tosvds/cdh:latest
) floats with the most recent cloudera versions - I’m using a CDH container here… but there’s an HDP one on the way as well. Keep an eye out for it on SVDS’s dockerhub page.
- Web consoles and forwarding ports through SSH.
Bonus
Ok, so the downside here is that the image is fat. The upside is that it lets you play with the full suite of CDH-based tools. I’ve tested out (besides the Spark variations above):
Impala shell
docker exec mycdh impala-shell
HBase shell
docker exec mycdh hbase shell
Hive
echo "show tables;" | docker exec mycdh beeline -u jdbc:hive2://localhost:10000 -n username -p password -d org.apache.hive.jdbc.HiveDriver