To persist the data when you close the cluster, you can add for example an EBS of 30G to each instance with option --ebs-vol-size=30, if the data you need to persist will require less than 150GB (5 x 30). You’ll also need to change the HDFS for persistent (see below).
Spark master web interface will be available on the master node on port 8080. The web UI lists all jobs, completed, failed, or running.
Clicking on a running job, for example a spark-shell, will lead to the “job web interface”, named application UI, which are running on other ports 4040, 4041, … and are executed in the job driver process, not the spark master process.
We’re now ready !
Analyze the top ranking pages from October 2008 to February 2010 with Wikipedia statistics
en Barack_Obama 997 123091092
en Barack_Obama%27s_first_100_days 8 850127
en Barack_Obama,_Jr 1 144103
en Barack_Obama,_Sr. 37 938821
en Barack_Obama_%22HOPE%22_poster 4 81005
en Barack_Obama_%22Hope%22_poster 5 102081
That’s it! 63 seconds to get all the data ranked !
Analyze the principal concepts in the Wikipedia pages
By default, it’s the ephemeral HDFS that has been started, with 3.43 TB capacity and a web interface available on port 50070.
Persistent HDFS give by default 31.5 GB of space which is small. If you have not choosen to add the EBS option, persistent HDFS is not worth. If you did choose the EBS option, I believe you choose the EBS capacity in function of your needs and you can switch for persistent HDFS to be able to stop the cluster to save money while keeping the data in a persistent way :
Permanent HDFS web interface will be available on port 60070 by default.
Depending on your choice, export the path to ephermeral or persistent HDFS :
The FR database, of size 12.8GB, is divided into 103 blocks, replicated 3 times, using then 38.63GB of our 3.43 TB of total capacity for the cluster, hence around 10GB of each datanode of 826GB capacity.
The EN database, of size 48.4GB, is divided into 388 blocks replicated 3 times. The data represents 7% of the cluster capacity, which is fine.
and check if everything works well, in particular reading the files
During first step, memory usage on each executor used 378.4 MB out of 3.1 GB, containing around 15 blocks of RDD data.
At last step, it’s 56 blocks for 609.6 MB per node.
The total computation has lasted 6.3h, including 1.8h for the first mapping step, and 4.3h for the RowMatrix computation.
Here is an example concept we get :
Concept terms: commune, district, province, espace, powiat, voïvodie, bavière, situ?e, fran?aise, région
Concept docs: Saint-Laurent-des-Bois, District autonome de Iamalo-Nenetsie, Saint-Aubin-sur-Mer, Bourbach, Saint-Cr?ac, Saint-Remimont, Pont-la-Ville, Saint-Amans, Lavau, Districts de l'île de Man
Index Wikipedia pages with Elasticsearch to search the points of interests by category around a position
Let’s launch an ElasticSearch Cluster with AWS Opsworks creating a very small Chef repository and a layer with awscliaptarkelasticsearchjavascalasbt-extras as recipes.
I’ll begin with a minimal mapping, in particular to avoid dynamic mapping to match wrong types, mapping.json :
Once Elasticsearch installed, let’s create an index and an alias map so that we can create multiple index behind…
Let’s download a Wikipedia XML API and launch Spark Shell :
and parse the data, filter pages with images and coordinates, and send to Elastichsearch for bulk indexation
Let’s see what kind of infobox we have and how many are geo localized :
We can see that we have 10610 communes, 2148 railway stations, 2112 islands…
To find the relevant points of interest around Paris :