Which notebooks for my computations ?
iPython was the first shell to introduce this great feature called “notebook”, that enables a nice display of your computations in a web server instead of a standard shell :
This allows you to share your computations with others, that can understand, modify and adapt to their needs in live.
Some specific notebooks appeared for other languages, such as Spark Notebook.
But the most promising one is Zeppelin from Apache Foundation. Zeppelin presents many advantages :
simplicity : for beginner or the marketer in the company, it’s easier for him to manipulate the data. In particular thanks to queries in SparkSQL and a nice display widget
language-agnostic, with a real plugin architecture, named “interpreters”. The “cluster” function of iPython or SparkNotebook is quite difficult to understand and customize. Scala and Python are the first 2 main languages available.
Let’s launch a Spark cluster on EC2 and do some computations in our Zeppelin notebook
Launch of the Spark Cluster on EC2
You need a AWS account, with an EC2 key pair, and credentials with
Your master cluster hostname should appear in the logs :
Generating cluster's SSH key on master... Warning: Permanently added 'ec2-XX-XX-XX-XXX.eu-west-1.compute.amazonaws.com,XX.XX.XX.XXX' (RSA) to the list of known hosts.
Be sure to have following ports open in the master’s EC2 security group (the master security group name is the name of the cluster with ‘-master’ appended, in our case
8080: the Spark master web interface is where the jobs (as well as Spark shells which are long term jobs) are displayed.
7077: the TCP interface to submit jobs,
both to open for access from the instance on which will be installed the Zeppelin notebook.
Download and compile Zeppelin:
Add the line :
Now it’s time to start (or restart) Zeppelin web server
Zeppelin interface is available at
Configure your EC2 Spark Cluster in Zeppelin
Go to the interpreter
- Edit your ‘spark’ interpreter
- In master property, put (in the place of local[
*]) your master hostname with spark:// at the beginning, and the port at the end, in our example this would be
Now you’re ready for computation
Create a new Note and open it.
Add a few lines
Click on start.
You can see your Zeppelin shell running as an application in the Spark cluster at