Spark has become the main big data tool, very easy to use as well as very powerful. Built on top of some Hadoop classes, Spark offers the use of the distributed memory (RDD) as if you were working on a single machine, and 3 REPL shells spark-shell, pyspark and sparkR for their respective Scala, Python and R languages. It is possible to submit a script with spark-submit command and to develop or test locally with --master local[1] option, before launching on a cluster of hundred of instances such as EMR.

Here I recompile Spark on Windows since it avoids problems one could encounter with Windows binaries, such as software version mismatches. Loading and compiling all the required dependencies on a slow network and with standard hardware may require a day or so. The steps are the following :

• Download and install Java Development Kit 7 in a path such as C:\Java (it has to be a folder with spaces)

• Add C:\Python27\;C:\Java to your Path environment variable, C:\Java to your JAVA_HOME env var.

Check everything works well :

our install arrives first

Java should be 64bit to increase memory above 2G. Try javac

I also had to change the memory options to -Xmx2048m (instead of 516m) in C:\Program Files (x86)\sbt\conf\sbtconfig.txt.