Hadoop in pseudodistributed mode

Pseudodistributed mode is the mode that enables you to create a Hadoop cluster of 1 node on your PC. Pseudodistributed mode is the step before going to the real distributed cluster.

To install Hadoop 1.2.1 :

wget http://mirrors.ircam.fr/pub/apache/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz
tar xvzf hadoop-1.2.1.tar.gz
export HADOOP_INSTALL=~/hadoop-1.2.1

To check Hadoop is correctly installed, type hadoop version.

Hadoop will be by default the standalone mode. It will use the local file system (file:///) and a local job tracker.

Let’s go further with pseudodistributed mode.

To enable password-less start :

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

To check, type ssh localhost.

(On Mac OS activate Settings > Sharing > Remote Login)


To use HDFS as default, in conf/core-site.xml :

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

in conf/hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

To format the HDFS namenode :

hadoop namenode -format

To start HDFS :


The namenode will be accessible at http://localhost:50070/.

Mapreduce 1


<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

To start the MapReduce 1


The jobtracker will be available at http://localhost:50030/.

Mapreduce 2 (Yarn)

In conf/yarn-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

To start


The resource manager will be available at http://localhost:8088/.

Now you’re ready to submit your first map reduce job !


Hive is the SQL-like engine based on Hadoop HDFS + MapReduce.

To install Hive, you should have installed Hadoop before, and

wget http://apache.crihan.fr/dist/hive/hive-1.1.0/apache-hive-1.1.0-bin.tar.gz
tar xzf apache-hive-1.1.0-bin.tar.gz
export HIVE_INSTALL=~/apache-hive-1.1.0-bin

To launch the Hive shell, type hive.

Just check everything works by creating your first table :

hive -e "create table dummy (value STRING) ;"
hive -e "show tables;"

Now you’re ready to query high volumes of data as if you were in MYSQL !


Hbase is a great NO-SQL database, based on Hadoop HDFS. To simplify, it is a database with only one key, that is ordered, and split into regions distributed over the cluster of nodes in a redundant way. It’s particularly useful when you have millions of writes to perform simultanuously on billions of documents - where no traditional database can do the job - such as in the case of a social application with many users that like and comment many user-generated contents.

To install Hbase :

wget http://mirrors.ircam.fr/pub/apache/hbase/hbase-1.0.0/hbase-1.0.0-bin.tar.gz
tar xzf hbase-1.0.0-bin.tar.gz
export HBASE_HOME=~/hbase-1.0.0
export JAVA_HOME=/usr

(on Mac you cannot set JAVA_HOME to /usr/bin/java… if you set it to /usr, it will use )

In conf/hbase-site.xml


To start the database :


To launch the Hbase shell, type hbase shell and you can run your commands

create  'table1', 'columnfamily'
put 'table1', 'row1', 'columnfamily:a', 'value1'
scan 'table1'
get 'table1', 'row1'
disable 'table1'; drop 'table1'


Sqoop is a great connector to perform import / export between a database and HDFS.

wget http://apache.crihan.fr/dist/sqoop/1.4.5/sqoop-1.4.5.bin__hadoop-1.0.0.tar.gz
tar xzf sqoop-1.4.5.bin__hadoop-1.0.0.tar.gz
export HADOOP_COMMON_HOME=~/hadoop-1.2.1
export HADOOP_MAPRED_HOME=~/hadoop-1.2.1
export HCAT_HOME=~/hive/hcatalog
export SCOOP_HOME=~/sqoop-1.4.5.bin__hadoop-0.23

To check it works, type sqoop help.


wget http://d3kbcqa49mib13.cloudfront.net/spark-1.3.0-bin-hadoop1.tgz
tar xvzf spark-1.3.0-bin-hadoop1.tgz

To start Spark master and slaves :


Spark interface will be available at http://localhost:8080/

To conclude, here is a nice classification of the different levels of interactions, from @Hortonworks :

Big data technologies

Don’t forget to stop the processes. List running processes with jps command and stop them with :