1.Lunch an “Ubuntu Server 16.04 LTS (HVM), SSD Volume Type - ami-7c803d1c” instance -> continue with the free tier eligible type (selected by default) -> edit the security group and under the Type  select  "All traffic" from any port in & Out

Windows users don't forget to use PuttyGen to create the ppk file.

To login with ubuntu@(ip)  and NOT ec2-user ; select Connection -> SSH -> Auth; click browse and select your ppk file; then click open and yes


2. Login as root user to install base packages (java 8) **

sudo su
sudo apt-get install python-software-properties
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer

Note: If you have any other version of Java, it is fine as long as you keep the directory paths proper in the below steps.

NOTE**:  I had an error "unable to resolve host ip..." when I typed in sudo su, so I solved my problem with this info:  https://forums.aws.amazon.com/message.jspa?messageID=495274

3. Check the java version

java -version


4. Download latest stable Hadoop using wget from one of the Apache mirrors.

wget http://apache.claz.org/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz

wget http://apache.mirrors.pair.com/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz

tar xzf hadoop-2.7.3.tar.gz



5. Create a directory where the hadoop will store its data. We will set this directory path in hdfs-site.

mkdir hadoopdata


6. Add the Hadoop related environment variables in your bash file.

vi ~/.bashrc

Copy and paste these environment variables.

export HADOOP_HOME=/home/ubuntu/hadoop-2.7.3
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export JAVA_HOME=/usr/lib/jvm/java-8-oracle

Save and exit and use this command to refresh the bash settings.

source ~/.bashrc


7. Setting hadoop environment for password less ssh access. Password less SSH Configuration is a mandatory installation requirement. However it is more useful in distributed environment.

ssh-keygen -t rsa -P ''
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

## check password less ssh access to localhost

ssh localhost

#exit from inner localhost shell



8. Set the hadoop config files. We need to set the below files in order for hadoop to function properly.

# go to directory where all the config files are present (cd /home/ubuntu/hadoop-2.7.3/etc/hadoop)

##Add the following text between the configuration tabs.

<description>A base for other temporary directories.</description>

# Before you edit the above file, get the java home directory using readlink in case it is not java-8-oracle:

readlink -f `which java`

Example output: /usr/lib/jvm/java-8-oracle/jre/bin/java (NOTE THE JAVA_HOME PATH. JUST GIVE THE BASE DIRECTORY PATH)

##Need to set JAVA_HOME in hadoop-env.sh to replace the export command that is there;


export JAVA_HOME=/usr/lib/jvm/java-8-oracle


#copy mapred-site.xml from mapred-site.xml.template

cp mapred-site.xml.template mapred-site.xml
vi mapred-site.xml

#Add the following text between the configuration tabs.



##Add the following text between the configuration tabs.



##Add the following text between the configuration tabs.


(cd $HADOOP_Home)

9. Formatting the HDFS file system via NameNode (after installing hadoop, for the first time we have to format the HDFS file system to make it work).  Move up the directory

bin/hdfs namenode -format

10. Issue the following commands to start hadoop

cd sbin/
$ ./start-dfs.sh
$ ./start-yarn.sh

#If you have properly done the step 5, you can start Hadoop from any directory. (Note the user should be the one where you installed Hadoop)

$ ./start-all.sh

#OR you can separately start required services as below:

# Name node:

$ hadoop-daemon.sh start namenode

# Data node:

hadoop-daemon.sh start datanode

# Resource Manager:

$ yarn-daemon.sh start resourcemanager

# Node Manager:

$ yarn-daemon.sh start nodemanager


##Running PI:


$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar pi 16 1100



##Running wordcount


#Download the bible+shakes


$ wget http://lintool.github.com/Cloud9/data/bible+shakes.nopunc.gz


$ wget https://raw.githubusercontent.com/umddb/datascience-fall14/master/lab6/bible%2Bshakes.nopunc


Now create a directory in Hadoop’s Distributed File System using:

$ hdfs dfs -ls /
$ hdfs dfs -mkdir /input

Go to the folder where bible+shakes.nopunc is copied and from that folder run the command


$ bin/hdfs dfs -copyFromLocal bible+shakes.nopunc /input


$ bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /input /out1


$ bin/hdfs dfs -cat /out1/part-r-00000


after creating your own WordCount.java file


$ export HADOOP_CLASSPATH=/home/ubuntu/hadoop-2.7.3/etc/hadoop:/home/ubuntu/hadoop-2.7.3/share/hadoop/common/lib/*:/home/ubuntu/hadoop-2.7.3/share/hadoop/common/*:/home/ubuntu/hadoop-2.7.3/share/hadoop/hdfs:/home/ubuntu/hadoop-2.7.3/share/hadoop/hdfs/lib/*:/home/ubuntu/hadoop-2.7.3/share/hadoop/hdfs/*:/home/ubuntu/hadoop-2.7.3/share/hadoop/yarn/lib/*:/home/ubuntu/hadoop-2.7.3/share/hadoop/yarn/*:/home/ubuntu/hadoop-2.7.3/share/hadoop/mapreduce/lib/*:/home/ubuntu/hadoop-2.7.3/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar


$ mkdir wordcount_classes

$ javac -d wordcount_classes WordCount.java -classpath $HADOOP_CLASSPATH

$ jar -cvf WCN.jar -C wordcount_classes  .



$ bin/hadoop jar WCN.jar WordCount /input /out2


source of wordcount: http://hadoop.apache.org/docs/r3.0.0-alpha2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html