I am engaged on gig where sparkR will be used to run R jobs and currently I am working on config. Once I troubleshoot all issues I will post steps to get Spark cluster working.
Followup from my initial fustration with SparkR. I was pretty close to giving up as why SparkR would not work on cluster I was working on. After much follow up with Shivaram(SparkR package author) we were finally able to get SparkR working as cluster job.
SparkR can be downloaded from https://github.com/amplab-extras/SparkR-pkg
SparkR configuration
Install R
Instruction below are for Ubuntu
ALPHA root@host:~$ nano /etc/apt/sources.list
ALPHA root@host:~$ apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
ALPHA root@host:~$ add-apt-repository ppa:marutter/rdev
ALPHA root@host:~$ apt-get install python-software-properties
ALPHA root@host:~$ add-apt-repository ppa:marutter/rdev
ALPHA root@host:~$ apt-get update
ALPHA root@host:~$ apt-get install r-base
Conf Java for R
wget http://cran.cnr.berkeley.edu/src/contrib/rJava_0.9-6.tar.gz
sudo R CMD INSTALL rJava_0.9-6.tar.gz
Modify spark-env.sh
#!/usr/bin/env bash
export STANDALONE_SPARK_MASTER_HOST=hostname.domain.com
export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST
export SPARK_LOCAL_IP=xxx.xxx.xxx.xxx
### Let’s run everything with JVM runtime, instead of Scala
export SPARK_LAUNCH_WITH_SCALA=0
export SPARK_LIBRARY_PATH=${SPARK_HOME}/lib
export SCALA_LIBRARY_PATH=${SPARK_HOME}/lib
export SPARK_MASTER_WEBUI_PORT=18080
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_PORT=7078
export SPARK_WORKER_WEBUI_PORT=18081
export SPARK_WORKER_DIR=/var/run/spark/work
export SPARK_LOG_DIR=/var/log/spark
if [ -n “$HADOOP_HOME” ]; then
export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:${HADOOP_HOME}/lib/native
fi
### Comment above 2 lines and uncomment the following if
### you want to run with scala version, that is included with the package
#export SCALA_HOME=${SCALA_HOME:-/usr/lib/spark/scala}
#export PATH=$PATH:$SCALA_HOME/bin
Note: This will need to done for worker nodes as well.
Switch user to HDFS
su – hdfs
Git Clone
git clone https://github.com/amplab-extras/SparkR-pkg
Building SparkR
SPARK_HADOOP_VERSION=2.2.0-cdh5.0.0-beta-2
./install-dev.sh
Copy SparkR–pkg to worker nodes
Example : scp –r SparkR-pkg hdfs@worker1:
Execute Test Job
cd SparkR-pkg/
export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
source /etc/spark/conf/spark-env.sh
./sparkR examples/pi.R spark://hostname.domain.com:7077
Sample results
hdfs@xxxx:~/SparkR-pkg$ ./sparkR examples/pi.R spark://xxxx.xxxxx.com:7077
./sparkR: line 13: /tmp/sparkR.profile: Permission denied
Loading required package: SparkR
Loading required package: methods
Loading required package: rJava
[SparkR] Initializing with classpath /var/lib/hadoop-hdfs/SparkR-pkg/lib/SparkR/sparkr-assembly-0.1.jar
14/02/27 16:29:09 INFO Slf4jLogger: Slf4jLogger started
Pi is roughly 3.14018
Num elements in RDD 200000
hdfs@xxxx:~/SparkR-pkg$I
Hi, Abdul! Thanks for such a great post! I’ve been working with installation in the same way, but when I try to execute Pi example on YARN, I get the error:
“Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory”
Could you point me to where should I look to solve it? It seems that I missed some parameters or something like that.
I should mention, that I’m trying to run it on a commodity cluster with pretty weak machines in it. But this issue arise even when I try to get spark context from R also.
Thanks in advance
Hi there,
Sorry I haven’t this till now…do you mind sharing what error you are getting?
I can look at my logs and tell you exactly what I did to make it work.
Thanks,
Abdul