Table of Contents
Straight forward guide to setting up a pseudo-distributed Hadoop installation preferably for educational purposes and not to be used in a production environment due to lack of experience of this guide's writer.
The best documentation can be always found at Official Source
- Only YOU shall be responsible for any damages that are caused while getting things done. Any contributors to this guide shall not be liable for the careless actions of the user.
- It is recommended to try out this guide in a Virtual Machine first.
- This may contain incorrect commands, insecure methods, hacky fixes, or may satisfy your requirements partially or not at all.
- DO NOT FOLLOW BLINDLY and proceed at your own RISK.
If you find correct mistakes or would like to improve any part see Contribution section.
- Working GNU/Linux installation.
- root privilege.
- Internet connection.
I shall be using Debian 12 XFCE in a VM installed using VirtualBox. You can use any distribution of your choice (Ubuntu, Linux Mint etc.)
To install SSH on Debian-based distributions.
sudo apt install ssh pdsh openssh-server openssh-client
- One way to install java is from official repos of your distribution.
- On Debian 12, OpenJDK and JRE can be installed through the meta-packages. This will install version 17 of the respective packages.
sudo apt install default-jdk default-jre
- To verify installation
java -version
javac -version
But, as of August 2023, Hadoop version 3.3.6 supports only Java 8 (check here)
- Now the procedure to install Java 8 will vary depending on your distribution. For example, on Ubuntu (official guide)
sudo apt install openjdk-8-jdk openjdk-8-jre
will get the job done.
But in my case, as I am using Debian 12 (Bookworm), let us check the official docs for Java.
It can be seen that the current default version is 17 whereas Java 8 was the default in Debian Stretch.
One way to solve this is to add a backports repository and then install it in Ubuntu way but I could not get that to work.
So I settled for a hacky fix for this after referring to this stackoverflow answer
Go to the Eclipse Termurin Project website and choose OS type, Architecture, and Java version
- https://adoptium.net/temurin/releases/?version=8&os=linux&arch=x64 and download both JDK and JRE.
- Unpack the files to get folders, rename them to something sane, and then copy them to the desired location as these will act as JDK installation for us. Execute command for both jdk and jre folders.
sudo cp <folder-name> /usr/lib/jvm/
- To find jdk folder, do
ls /usr/lib/jvm/
- The folder present here, either named by you or named something like java-1.xx-openjdk-amd64 is your jdk folder. (If you have more than java installations then choose the one for Java 8).
- Open .bashrc using nano or gedit
nano ~/.bashrc
- Add the following at the end of .bashrc (we will be doing this process once again later.)
- ENTER CORRECT FOLDER NAME.
export JAVA_HOME=/usr/lib/jvm/<myopenjdk-foldername>
export PATH=$PATH:$JAVA_HOME/bin
- reopen terminal or
source ~/.bashrc
With this Java setup is complete.
Generally, we create a separate user to run daemons as a security practice to isolate it from other user data. Hence we will create another user for Hadoop with no root privileges (or you can add that user to the sudo group if you wish to do so).
- From main user
sudo adduser hduser
- (optionally) provide sudo permission to hduser
sudo adduser hduser sudo
- Remember this command to switch to another user
su - hduser
- Verify
whoami
Note that your name in the terminal should be changed to hduser@hostname
- We will configure this for passwordless ssh login.
These commands should be executed from the hadoopuser which was created earlier.
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
- Copy the public keys from id_rsa.pub to authorized_keys
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
- Setting the permissions
chmod 600 ~/.ssh/authorized_keys
It is very important to set appropriate permissions, as these are ssh keys, by using 600 we will allow read and write ability to only the OWNER.
- Now try to connect
ssh localhost
- You may get a prompt asking to add this machine to known hosts. Answering 'yes' will accept the connection request and add localhost to the list of known hosts.
Note that this prompt appears only if there is a change in ssh configuration.
- Download hadoop tar file from official website.
- I have downloaded version 3.3.6 which is the latest stable version as of August 2023.
- Make sure to download the binary (~690 MB) and not the source, unless you wish to compile hadoop from scratch
- Open the file manager and extract the tar file present in the Downloads folder.
- Now it is time to move the hadoop folder to the hduser home directory. Execute commands from main user
sudo mv /home/<main-user>/Downloads/hadoop-3.3.6/ /home/hduser/
- Change ownership to hduser.
chown -R hduser /home/hduser
- Rename hadoop folder. Execute command from hadoop user
mv hadoop-3.3.6 hadoop
- While in hadoop user home directory. Open .bashrc
nano ~/.bashrc
- Add the environment variables at the end
export HADOOP_HOME=/home/hduser/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
- It is also time to add another variable after this. We will add JAVA_HOME to hadoop user which we had previously added to main user.
- ENTER CORRECT FOLDER NAME.
export JAVA_HOME=/usr/lib/jvm/<myopenjdk-foldername>
export PATH=$PATH:$JAVA_HOME/bin
- Now reopen terminal or
source ~/.bashrc
. If you reopen the terminal then remember to switch to hadoop user. - Add JAVA_HOME in hadoop-env.sh
nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
- Scroll and find JAVA_HOME comment in hadoop-env.sh, uncomment and (ENTER CORRECT FOLDER NAME.) change to:
export JAVA_HOME=/usr/lib/jvm/<myopenjdk-foldername>
- Now we will add content in the configuration files of Hadoop. (Official Docs)
- Open core-site.xml to make changes in-between configuration.
nano $HADOOP_HOME/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
- Open hdfs-site.xml. namenode and datanode will be created in /home/hduser/hadoopdata
nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hduser/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hduser/hadoopdata/hdfs/datanode</value>
</property>
</configuration>
- Add the following configuration to change the default MapReduce framework name value to yarn:
nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
</configuration>
- Adding configurations for the Node Manager, Resource Manager, Containers, and Application Master.
nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>localhost:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>localhost:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>localhost:8050</value>
</property>
</configuration>
- Now it is time to format the NameNode. It is necessary to format NameNode every time you change configuration files.
However, if you change only the hadoop-env.sh, then it might not be required to reformat. - The following command will create namenode and datanode in /home/hduser/hadoopdata.
cd ~/hadoop/sbin
hdfs namenode -format
- While testing setup if you get any errors with datanode after changing any config file, You can delete /hadoopdata folder (losing the data) and reformat.
- Navigate to
~/hadoop/sbin
. Execute commands from hadoop user.
./start-dfs.sh
./start-yarn.sh
If you get an error saying connection refused or permission denied then as explained in this answer:
-
Add
export PDSH_RCMD_TYPE=ssh
in .bashrc file, reopen or source and try once again. If you reopen terminal then remember to switch to hadoop user. -
Check the processes started with the jps command
jps
- To stop the processes use:
./stop-dfs.sh
./stop-yarn.sh
or
./stop-all.sh
- Hadoop NameNode: http://localhost:9870/
- Cluster Information: http://localhost:8042/
- Node Information: http://localhost:9864/
Contributions are what makes the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch
- Commit your Changes
- Push to the Branch
- Open a Pull Request
Distributed under the MIT License. See LICENSE.txt
for more information.
Useful Resources