Hadoop in pseudodistributed mode
Pseudodistributed mode is the mode that enables you to create a Hadoop cluster of 1 node on your PC. Pseudodistributed mode is the step before going to the real distributed cluster.
To install Hadoop 1.2.1 :
wget http://mirrors.ircam.fr/pub/apache/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz tar xvzf hadoop-1.2.1.tar.gz export HADOOP_INSTALL=~/hadoop-1.2.1 export PATH=$PATH:$HADOOP_INSTALL/bin:$HADOOP_INSTALL/sbin
To check Hadoop is correctly installed, type
Hadoop will be by default the standalone mode. It will use the local file system (
file:///) and a local job tracker.
Let’s go further with pseudodistributed mode.
To enable password-less start :
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
To check, type
(On Mac OS activate Settings > Sharing > Remote Login)
To use HDFS as default, in conf/core-site.xml :
To format the HDFS namenode :
hadoop namenode -format
To start HDFS :
The namenode will be accessible at http://localhost:50070/.
To start the MapReduce 1
The jobtracker will be available at http://localhost:50030/.
Mapreduce 2 (Yarn)
The resource manager will be available at http://localhost:8088/.
Now you’re ready to submit your first map reduce job !
Hive is the SQL-like engine based on Hadoop HDFS + MapReduce.
To install Hive, you should have installed Hadoop before, and
wget http://apache.crihan.fr/dist/hive/hive-1.1.0/apache-hive-1.1.0-bin.tar.gz tar xzf apache-hive-1.1.0-bin.tar.gz export HIVE_INSTALL=~/apache-hive-1.1.0-bin export PATH=$PATH:$HIVE_INSTALL/bin
To launch the Hive shell, type
Just check everything works by creating your first table :
hive -e "create table dummy (value STRING) ;" hive -e "show tables;"
Now you’re ready to query high volumes of data as if you were in MYSQL !
Hbase is a great NO-SQL database, based on Hadoop HDFS. To simplify, it is a database with only one key, that is ordered, and split into regions distributed over the cluster of nodes in a redundant way. It’s particularly useful when you have millions of writes to perform simultanuously on billions of documents - where no traditional database can do the job - such as in the case of a social application with many users that like and comment many user-generated contents.
To install Hbase :
wget http://mirrors.ircam.fr/pub/apache/hbase/hbase-1.0.0/hbase-1.0.0-bin.tar.gz tar xzf hbase-1.0.0-bin.tar.gz export HBASE_HOME=~/hbase-1.0.0 export PATH=$PATH:$HBASE_HOME/bin export JAVA_HOME=/usr
(on Mac you cannot set JAVA_HOME to /usr/bin/java… if you set it to /usr, it will use )
To start the database :
To launch the Hbase shell, type
hbase shell and you can run your commands
version status create 'table1', 'columnfamily' put 'table1', 'row1', 'columnfamily:a', 'value1' list scan 'table1' get 'table1', 'row1' disable 'table1'; drop 'table1'
Sqoop is a great connector to perform import / export between a database and HDFS.
wget http://apache.crihan.fr/dist/sqoop/1.4.5/sqoop-1.4.5.bin__hadoop-1.0.0.tar.gz tar xzf sqoop-1.4.5.bin__hadoop-1.0.0.tar.gz export HADOOP_COMMON_HOME=~/hadoop-1.2.1 export HADOOP_MAPRED_HOME=~/hadoop-1.2.1 export HCAT_HOME=~/hive/hcatalog export SCOOP_HOME=~/sqoop-1.4.5.bin__hadoop-0.23 export PATH=$PATH:$SCOOP_HOME/bin
To check it works, type
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.3.0-bin-hadoop1.tgz tar xvzf spark-1.3.0-bin-hadoop1.tgz
To start Spark master and slaves :
Spark interface will be available at http://localhost:8080/
To conclude, here is a nice classification of the different levels of interactions, from @Hortonworks :
Don’t forget to stop the processes. List running processes with
jps command and stop them with :
stop-dfs.sh stop-mapred.sh stop-hbase.sh ~/spark-1.3.0-bin-hadoop1/sbin/stop-all.sh