Spark has become the main big data tool, very easy to use as well as very powerful. Built on top of some Hadoop classes, Spark offers the use of the distributed memory (RDD) as if you were working on a single machine, and 3 REPL shells
sparkR for their respective Scala, Python and R languages. It is possible to submit a script with
spark-submit command and to develop or test locally with
--master local option, before launching on a cluster of hundred of instances such as EMR.
Here I recompile Spark on Windows since it avoids problems one could encounter with Windows binaries, such as software version mismatches. Loading and compiling all the required dependencies on a slow network and with standard hardware may require a day or so. The steps are the following :
Download and install Java Development Kit 7 in a path such as C:\Java (it has to be a folder with spaces)
Download and install Python 2.7.11.
Add C:\Python27\;C:\Java to your
Pathenvironment variable, C:\Java to your
Check everything works well :
our install arrives first
Java should be 64bit to increase memory above 2G. Try javac
I also had to change the memory options to
-Xmx2048m (instead of 516m) in C:\Program Files (x86)\sbt\conf\sbtconfig.txt.
- Download and compile Spark :
that will create the Spark assembly JAR.
- There is a small bug for Pyspark
In the main function in python/pyspark/worker.py, add the two last lines to the process function :
Rebuild the pyspark.zip in the python/lib folder or download it here.
Now it’s done. Launch Pypark: