Installation

Table of Contents

About The Project
Getting Started
- WARNING
- Prerequisites
Installation
Contributing
License
Acknowledgments

About

Straight forward guide to setting up a pseudo-distributed Hadoop installation preferably for educational purposes and not to be used in a production environment due to lack of experience of this guide's writer.

The best documentation can be always found at Official Source

Getting Started

WARNING

Only YOU shall be responsible for any damages that are caused while getting things done. Any contributors to this guide shall not be liable for the careless actions of the user.
It is recommended to try out this guide in a Virtual Machine first.
This may contain incorrect commands, insecure methods, hacky fixes, or may satisfy your requirements partially or not at all.
DO NOT FOLLOW BLINDLY and proceed at your own RISK.

If you find correct mistakes or would like to improve any part see Contribution section.

Prerequisites

Working GNU/Linux installation.
root privilege.
Internet connection.

I shall be using Debian 12 XFCE in a VM installed using VirtualBox. You can use any distribution of your choice (Ubuntu, Linux Mint etc.)

(back to top)

Installation

Installing SSH

To install SSH on Debian-based distributions.

sudo apt install ssh pdsh openssh-server openssh-client

Installing Java

One way to install java is from official repos of your distribution.
On Debian 12, OpenJDK and JRE can be installed through the meta-packages. This will install version 17 of the respective packages.

sudo apt install default-jdk default-jre

To verify installation

java -version

javac -version

But, as of August 2023, Hadoop version 3.3.6 supports only Java 8 (check here)

Now the procedure to install Java 8 will vary depending on your distribution. For example, on Ubuntu (official guide)

sudo apt install openjdk-8-jdk openjdk-8-jre

will get the job done.

But in my case, as I am using Debian 12 (Bookworm), let us check the official docs for Java.

It can be seen that the current default version is 17 whereas Java 8 was the default in Debian Stretch.
One way to solve this is to add a backports repository and then install it in Ubuntu way but I could not get that to work.

So I settled for a hacky fix for this after referring to this stackoverflow answer

Go to the Eclipse Termurin Project website and choose OS type, Architecture, and Java version

https://adoptium.net/temurin/releases/?version=8&os=linux&arch=x64 and download both JDK and JRE.
Unpack the files to get folders, rename them to something sane, and then copy them to the desired location as these will act as JDK installation for us. Execute command for both jdk and jre folders.

sudo cp <folder-name> /usr/lib/jvm/

To find jdk folder, do ls /usr/lib/jvm/
The folder present here, either named by you or named something like java-1.xx-openjdk-amd64 is your jdk folder. (If you have more than java installations then choose the one for Java 8).
Open .bashrc using nano or gedit

nano ~/.bashrc

Add the following at the end of .bashrc (we will be doing this process once again later.)
ENTER CORRECT FOLDER NAME.

export JAVA_HOME=/usr/lib/jvm/<myopenjdk-foldername>
export PATH=$PATH:$JAVA_HOME/bin

reopen terminal or source ~/.bashrc With this Java setup is complete.

(back to top)

Creating a new user.

Generally, we create a separate user to run daemons as a security practice to isolate it from other user data. Hence we will create another user for Hadoop with no root privileges (or you can add that user to the sudo group if you wish to do so).

From main user

sudo adduser hduser

(optionally) provide sudo permission to hduser

sudo adduser hduser sudo

Remember this command to switch to another user

su - hduser

Verify

whoami

Note that your name in the terminal should be changed to hduser@hostname

(back to top)

Configure SSH

We will configure this for passwordless ssh login.
These commands should be executed from the hadoopuser which was created earlier.

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

Copy the public keys from id_rsa.pub to authorized_keys

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Setting the permissions

chmod 600 ~/.ssh/authorized_keys

It is very important to set appropriate permissions, as these are ssh keys, by using 600 we will allow read and write ability to only the OWNER.

Now try to connect

ssh localhost

You may get a prompt asking to add this machine to known hosts. Answering 'yes' will accept the connection request and add localhost to the list of known hosts.
Note that this prompt appears only if there is a change in ssh configuration.

(back to top)

Hadoop Setup

Download hadoop tar file from official website.
I have downloaded version 3.3.6 which is the latest stable version as of August 2023.
Make sure to download the binary (~690 MB) and not the source, unless you wish to compile hadoop from scratch
Open the file manager and extract the tar file present in the Downloads folder.
Now it is time to move the hadoop folder to the hduser home directory. Execute commands from main user

sudo mv /home/<main-user>/Downloads/hadoop-3.3.6/ /home/hduser/

Change ownership to hduser.

chown -R hduser /home/hduser

Rename hadoop folder. Execute command from hadoop user

mv hadoop-3.3.6 hadoop

While in hadoop user home directory. Open .bashrc

nano ~/.bashrc

Add the environment variables at the end

export HADOOP_HOME=/home/hduser/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

It is also time to add another variable after this. We will add JAVA_HOME to hadoop user which we had previously added to main user.
ENTER CORRECT FOLDER NAME.

export JAVA_HOME=/usr/lib/jvm/<myopenjdk-foldername>
export PATH=$PATH:$JAVA_HOME/bin

Now reopen terminal or source ~/.bashrc . If you reopen the terminal then remember to switch to hadoop user.
Add JAVA_HOME in hadoop-env.sh

nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Scroll and find JAVA_HOME comment in hadoop-env.sh, uncomment and (ENTER CORRECT FOLDER NAME.) change to:

export JAVA_HOME=/usr/lib/jvm/<myopenjdk-foldername>

Now we will add content in the configuration files of Hadoop. (Official Docs)
Open core-site.xml to make changes in-between configuration.

nano $HADOOP_HOME/etc/hadoop/core-site.xml

<configuration>
<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:9000</value>
</property>
</configuration>

Open hdfs-site.xml. namenode and datanode will be created in /home/hduser/hadoopdata

nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

<configuration>
<property>
 <name>dfs.replication</name>
 <value>1</value>
</property>

<property>
  <name>dfs.name.dir</name>
  <value>file:///home/hduser/hadoopdata/hdfs/namenode</value>
</property>

<property>
  <name>dfs.data.dir</name>
  <value>file:///home/hduser/hadoopdata/hdfs/datanode</value>
</property>
</configuration>

Add the following configuration to change the default MapReduce framework name value to yarn:

nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

<configuration>
 <property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
 </property>
<property>
  <name>yarn.app.mapreduce.am.env</name>
  <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
  <name>mapreduce.map.env</name>
  <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
  <name>mapreduce.reduce.env</name>
  <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
</configuration>

Adding configurations for the Node Manager, Resource Manager, Containers, and Application Master.

nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

<configuration>
<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>
<property>
  <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
  <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
  <name>yarn.resourcemanager.resource-tracker.address</name>
  <value>localhost:8025</value>
</property>
<property>
  <name>yarn.resourcemanager.scheduler.address</name>
  <value>localhost:8030</value>
</property>
<property>
  <name>yarn.resourcemanager.address</name>
  <value>localhost:8050</value>
</property>
</configuration>

Now it is time to format the NameNode. It is necessary to format NameNode every time you change configuration files.
However, if you change only the hadoop-env.sh, then it might not be required to reformat.
The following command will create namenode and datanode in /home/hduser/hadoopdata.

cd ~/hadoop/sbin
hdfs namenode -format

While testing setup if you get any errors with datanode after changing any config file, You can delete /hadoopdata folder (losing the data) and reformat.

(back to top)

Start Hadoop Cluster

Navigate to ~/hadoop/sbin . Execute commands from hadoop user.

./start-dfs.sh

./start-yarn.sh

Error

If you get an error saying connection refused or permission denied then as explained in this answer:

Add export PDSH_RCMD_TYPE=ssh in .bashrc file, reopen or source and try once again. If you reopen terminal then remember to switch to hadoop user.
Check the processes started with the jps command

jps

To stop the processes use:

./stop-dfs.sh

./stop-yarn.sh

or

./stop-all.sh

(back to top)

Accessing Hadoop frontends

Hadoop NameNode: http://localhost:9870/
Cluster Information: http://localhost:8042/
Node Information: http://localhost:9864/

Contributing

Contributions are what makes the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. Don't forget to give the project a star! Thanks again!

Fork the Project
Create your Feature Branch
Commit your Changes
Push to the Branch
Open a Pull Request

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

Acknowledgments

Useful Resources

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
images		images
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Getting Started

WARNING

Prerequisites

Installation

Installing SSH

Installing Java

Creating a new user.

Configure SSH

Hadoop Setup

Start Hadoop Cluster

Error

Accessing Hadoop frontends

Contributing

License

Acknowledgments

About

Releases

Packages

License

chimms1/Hadoop-Install-Guide

Folders and files

Latest commit

History

Repository files navigation

About

Getting Started

WARNING

Prerequisites

Installation

Installing SSH

Installing Java

Creating a new user.

Configure SSH

Hadoop Setup

Start Hadoop Cluster

Error

Accessing Hadoop frontends

Contributing

License

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages