2 - HDFS & Yarn Installation Guide

We will install and configure HDFS & Yarn. In this tutorial it is accepted that:

You have 3 machines (1 master, 2 slaves).
You have installed Ubuntu 18.04.4 LTS in your local (master) machine and in slaves.
You have updated and upgraded all packages in all machines.
You have a valid internet connection for master and slaves.
There is a user with the same name for all machines, such as spark-user.
You have passwordless SSH access to slaves from master.
Hosts (/etc/hosts) file is configured for all machines, like shown below:

Host-Name	IP Address	Info
master	192.168.10.107	Local Ubuntu 18 Machine
slave-1	192.168.10.140	Virtual Ubuntu 18 Machine
slave-2	192.168.10.141	Virtual Ubuntu 18 Machine

Note: If one of this criteria is absent, please check the first guide Virtual Machine Installation & Configuration Guide.

2.1. Install Java 8

This section is for all (1 master, 2 slaves) machines. Perform all steps in all machines.

Note #1: This approach is valid when using version 8 (1.8) of Java. For Java-11 or other versions, you should perform the similar but not the same steps.

Note #2: OpenJDK, which is a free and open-source implementation of the Java Platform - Standard Edition, is used here.

Install Java Runtime Environment (JRE) and Java Development Kit (JRE) in all machines:

sudo apt install openjdk-8-jre
sudo apt install openjdk-8-jdk
java -version

Your JVM (Java Virtual Machine) directory should be located under /usr/lib/jvm directory. We will create a symbolic link to your jvm for abstraction:

cd /usr/lib/jvm
sudo ln -sf java-8-openjdk* current-java

And we can access to our jvm using path /usr/lib/jvm/current-java.

Now we will add java-home path, i.e. $JAVA_HOME:

su spark-user # You must be spark-user, if you are already, ignore this command
echo 'export JAVA_HOME=/usr/lib/jvm/current-java' >> ~/.bashrc

Attention: If java will also be used by another users, you can do the same for these users.

Confirm $JAVA_HOME variable is set correctly, in all machines:

source ~/.bashrc # Reload the changed bashrc file
echo $JAVA_HOME

You should see /usr/lib/jvm/current-java in command-line.

2.2. Download & Install Hadoop 2.7

Go to http://apache.claz.org/hadoop/common and find the current 2.7 release. At the time this document was written (June 23, 2020), Hadoop's version 2.7 was in the 7th sub-version.

We will download Hadoop in Master first and then transfer Hadoop files to slaves. In master machine, download, untar & move Hadoop files using command-line and copy hadoop files to remote slaves:

cd ~ # Go to home directory
wget http://apache.claz.org/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz
tar -xvf hadoop-2.7.7.tar.gz # Untar Hadoop files
rm hadoop-2.7.7.tar.gz # Delete redundant archive file
scp -r ./hadoop-2.7.7 spark-user@slave-1:~/ # Copy files to slave-1
scp -r ./hadoop-2.7.7 spark-user@slave-2:~/ # Copy files to slave-2

In all machines, we will move the hadoop files under /opt directory for maintainability and handle permissions:

# For master Machine
sudo mv hadoop-2.7.7 /opt/ # Move Hadoop files to /opt/ directory
sudo ln -sf /opt/hadoop-2.7.7 /opt/hadoop # Create symbolic link for abstraction
sudo chown spark-user:root /opt/hadoop* -R # Change user:spark-user, group:root
sudo chmod g+rwx /opt/hadoop* -R # Allow group to read-write-execute

# For slave-1 Machine
ssh slave-1
sudo mv hadoop-2.7.7 /opt/ # Move Hadoop files to /opt/ directory
sudo ln -sf /opt/hadoop-2.7.7 /opt/hadoop # Create symbolic link for abstraction
sudo chown spark-user:root /opt/hadoop* -R # Change user:spark-user, group:root
sudo chmod g+rwx /opt/hadoop* -R # Allow group to read-write-execute
exit # Logout from slave-1

# For slave-2 Machine
ssh slave-2
sudo mv hadoop-2.7.7 /opt/ # Move Hadoop files to /opt/ directory
sudo ln -sf /opt/hadoop-2.7.7 /opt/hadoop # Create symbolic link for abstraction
sudo chown spark-user:root /opt/hadoop* -R # Change user:spark-user, group:root
sudo chmod g+rwx /opt/hadoop* -R # Allow group to read-write-execute
exit # Logout from slave-2

In all machines, append hadoop paths to $PATH variable:

echo 'export PATH=$PATH:/opt/hadoop/bin:/opt/hadoop/sbin' >> ~/.bashrc
source ~/.bashrc # Reload the changed bashrc file
echo $PATH # Confirm that $PATH variable is changed properly

Output should be:<OTHER_PATHS>:/opt/hadoop/bin:/opt/hadoop/sbin

In all machines, export Hadoop variables:

# Set Hadoop Variables
echo '
# Bash Variables for Hadoop
export HADOOP_HOME="/opt/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
' >> ~/.bashrc

source ~/.bashrc # Reload the changed bashrc file
echo $HADOOP_HOME $HADOOP_COMMON_HOME $HADOOP_CONF_DIR $HADOOP_HDFS_HOME $HADOOP_MAPRED_HOME

Output should be: /opt/hadoop /opt/hadoop /opt/hadoop/etc/hadoop /opt/hadoop /opt/hadoop

2.3. Configure Hadoop

We will first configure master machine and then transfer configuration file to remote slaves.

In master machine, we will edit $HADOOP_HOME/etc/hadoop/core-site.xml. Your "configuration" tag in this file should look like this:

<configuration>
	<property>
		<name>fs.default.name</name>
		<value>hdfs://master:9000</value>
	</property>
</configuration>

In master machine, we will edit $HADOOP_HOME/etc/hadoop/hdfs-site.xml. Your "configuration" tag in this file should look like this:

<configuration>
	<property>
		<name>dfs.namenode.name.dir</name>
		<value>/opt/hadoop/data/nameNode</value>
	</property>
	<property>
		<name>dfs.datanode.data.dir</name>
		<value>/opt/hadoop/data/dataNode</value>
	</property>
	<property>
		<name>dfs.replication</name>
		<value>2</value>
	</property>
</configuration>

In master machine, we will edit $HADOOP_HOME/etc/hadoop/hadoop-env.sh. Find the line contains export JAVA_HOME and change this line like:

export JAVA_HOME=/usr/lib/jvm/current-java

After editing, the relevant lines of the file should look like:

In master machine, we will edit $HADOOP_HOME/etc/hadoop/slaves. If localhost line is present in the file, delete it. Add the following lines only:

slave-1
slave-2

Note that: The file slaves renamed to workers in further Hadoop versions. So if you follow this tutorial for further versions of Hadoop and sure that a file named workers exists under directory $HADOOP_HOME/etc/hadoop/, then edit workers.

Transfer configuration files directly to slave machines:

scp  $HADOOP_HOME/etc/hadoop/* slave-1:$HADOOP_HOME/etc/hadoop/ # For slave-1
scp  $HADOOP_HOME/etc/hadoop/* slave-2:$HADOOP_HOME/etc/hadoop/ # For slave-2

In master machine, format the HDFS file system:

hdfs namenode -format

In master machine, start HDFS:

start-dfs.sh # or $HADOOP_HOME/sbin/start-dfs.sh

Validate that everything started right by running the jps command as spark-user on all machines.

On master node, you should see SecondaryNameNode and NameNode like shown below:

On slave nodes (slave-1, slave-2), you should see DataNode like shown below:

If everything is OK, you can now access the HDFS web UI by browsing to your Hadoop Master Server port 50070. Go to http://master:50070/:

If you see a web page like shown above and Live Nodes attribute is 2, then everything is OK. This indicates that you have 1 name-node (the one which this web-site runs on) and 2 live data-nodes (as live nodes).

Note that: In other versions of Hadoop, port number may change. For example, port number 9870 is used for Hadoop 3.1.1.

In master machine, if you want to stop HDFS run the following command:

stop-dfs.sh # or $HADOOP_HOME/sbin/stop-dfs.sh

2.4. Configure Yarn

HDFS is installed and configured but it's lack of job scheduling. So we could overcome the problem with Yarn Job Scheduler.

In all machines, export $HADOOP_YARN_HOME variable:

echo '# Bash Variables for Yarn
export HADOOP_YARN_HOME=$HADOOP_HOME' >> ~/.bashrc

source ~/.bashrc # Reload the changed bashrc file
echo $HADOOP_YARN_HOME

Output should be: /opt/hadoop

In master machine, we will edit $HADOOP_HOME/etc/hadoop/yarn-site.xml. Your "configuration" tag in this file should look like this:

<configuration>
	<property>
		<name>yarn.resourcemanager.hostname</name>
		<value>master</value>
	</property>
</configuration>

In master machine, transfer yarn-site.xml file to slave machines:

scp $HADOOP_HOME/etc/hadoop/yarn-site.xml slave-1:$HADOOP_HOME/etc/hadoop/yarn-site.xml # Copy yarn config to slave-1
scp $HADOOP_HOME/etc/hadoop/yarn-site.xml slave-2:$HADOOP_HOME/etc/hadoop/yarn-site.xml # Copy yarn config to slave-2

In master machine, start Yarn:

start-yarn.sh # or $HADOOP_HOME/sbin/start-yarn.sh

Validate that everything started right by running the jps command as spark-user on all machines.

On master node, you should see ResourceManager like shown below:

On slave nodes (slave-1, slave-2), you should see NodeManager like shown below:

Make sure that yarn is running on 2 slave machines:

yarn node --list

You should see that 2 nodes are running like shown below:

In master machine, if you want to stop Yarn run the following command:

stop-yarn.sh # or $HADOOP_HOME/sbin/stop-yarn.sh

2.5. Test Hadoop & Yarn

In master machine, start HDFS & yarn:

su spark-user # Login as spark-user, if you are already, ignore this
start-dfs.sh && start-yarn.sh

Validate that everything is OK so far with Hadoop Web UI. Web-page should look like:

If you see a web page like shown above and Active Nodes attribute is 2, then everything is OK.

Run Hadoop job example Calculate PI:

yarn jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar pi 16 1000

Output should be like shown below:

References

https://dzone.com/articles/install-a-hadoop-cluster-on-ubuntu-18041

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

guide.md

guide.md

2 - HDFS & Yarn Installation Guide

2.1. Install Java 8

2.2. Download & Install Hadoop 2.7

2.3. Configure Hadoop

2.4. Configure Yarn

2.5. Test Hadoop & Yarn

References

Files

guide.md

Latest commit

History

guide.md

File metadata and controls

2 - HDFS & Yarn Installation Guide

2.1. Install Java 8

2.2. Download & Install Hadoop 2.7

2.3. Configure Hadoop

2.4. Configure Yarn

2.5. Test Hadoop & Yarn

References