Skip to content

How to install hadoop in 2023? Solution: Get-shit-done type hadoop install guide. Install hadoop 3.3.6 with Java 8

License

Notifications You must be signed in to change notification settings

chimms1/Hadoop-Install-Guide

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

MIT License Contributors

Stargazers

LinkedIn

Table of Contents
  1. About The Project
  2. Getting Started
  3. Installation
  4. Contributing
  5. License
  6. Acknowledgments

About

Straight forward guide to setting up a pseudo-distributed Hadoop installation preferably for educational purposes and not to be used in a production environment due to lack of experience of this guide's writer.

The best documentation can be always found at Official Source

Getting Started

WARNING

  • Only YOU shall be responsible for any damages that are caused while getting things done. Any contributors to this guide shall not be liable for the careless actions of the user.
  • It is recommended to try out this guide in a Virtual Machine first.
  • This may contain incorrect commands, insecure methods, hacky fixes, or may satisfy your requirements partially or not at all.
  • DO NOT FOLLOW BLINDLY and proceed at your own RISK.

    If you find correct mistakes or would like to improve any part see Contribution section.

Prerequisites

  • Working GNU/Linux installation.
  • root privilege.
  • Internet connection.

I shall be using Debian 12 XFCE in a VM installed using VirtualBox. You can use any distribution of your choice (Ubuntu, Linux Mint etc.)

(back to top)

Installation

Installing SSH

To install SSH on Debian-based distributions.

sudo apt install ssh pdsh openssh-server openssh-client  

Installing Java

  • One way to install java is from official repos of your distribution.
  • On Debian 12, OpenJDK and JRE can be installed through the meta-packages. This will install version 17 of the respective packages.
sudo apt install default-jdk default-jre
  • To verify installation
java -version
javac -version

But, as of August 2023, Hadoop version 3.3.6 supports only Java 8 (check here)

javaversion

  • Now the procedure to install Java 8 will vary depending on your distribution. For example, on Ubuntu (official guide)
sudo apt install openjdk-8-jdk openjdk-8-jre

will get the job done.

But in my case, as I am using Debian 12 (Bookworm), let us check the official docs for Java.

javadebian

It can be seen that the current default version is 17 whereas Java 8 was the default in Debian Stretch.
One way to solve this is to add a backports repository and then install it in Ubuntu way but I could not get that to work.

So I settled for a hacky fix for this after referring to this stackoverflow answer

Go to the Eclipse Termurin Project website and choose OS type, Architecture, and Java version
sudo cp <folder-name> /usr/lib/jvm/
  • To find jdk folder, do ls /usr/lib/jvm/
  • The folder present here, either named by you or named something like java-1.xx-openjdk-amd64 is your jdk folder. (If you have more than java installations then choose the one for Java 8).
  • Open .bashrc using nano or gedit
nano ~/.bashrc
  • Add the following at the end of .bashrc (we will be doing this process once again later.)
  • ENTER CORRECT FOLDER NAME.
export JAVA_HOME=/usr/lib/jvm/<myopenjdk-foldername>
export PATH=$PATH:$JAVA_HOME/bin
  • reopen terminal or source ~/.bashrc With this Java setup is complete.

(back to top)

Creating a new user.

Generally, we create a separate user to run daemons as a security practice to isolate it from other user data. Hence we will create another user for Hadoop with no root privileges (or you can add that user to the sudo group if you wish to do so).

  • From main user
sudo adduser hduser
  • (optionally) provide sudo permission to hduser
sudo adduser hduser sudo
  • Remember this command to switch to another user
su - hduser
  • Verify
whoami

Note that your name in the terminal should be changed to hduser@hostname

(back to top)

Configure SSH

  • We will configure this for passwordless ssh login.
    These commands should be executed from the hadoopuser which was created earlier.
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
  • Copy the public keys from id_rsa.pub to authorized_keys
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
  • Setting the permissions
chmod 600 ~/.ssh/authorized_keys

It is very important to set appropriate permissions, as these are ssh keys, by using 600 we will allow read and write ability to only the OWNER.

  • Now try to connect
ssh localhost
  • You may get a prompt asking to add this machine to known hosts. Answering 'yes' will accept the connection request and add localhost to the list of known hosts.
    Note that this prompt appears only if there is a change in ssh configuration.

(back to top)

Hadoop Setup

  • Download hadoop tar file from official website.
  • I have downloaded version 3.3.6 which is the latest stable version as of August 2023.
  • Make sure to download the binary (~690 MB) and not the source, unless you wish to compile hadoop from scratch
  • Open the file manager and extract the tar file present in the Downloads folder.
  • Now it is time to move the hadoop folder to the hduser home directory. Execute commands from main user
sudo mv /home/<main-user>/Downloads/hadoop-3.3.6/ /home/hduser/
  • Change ownership to hduser.
chown -R hduser /home/hduser
  • Rename hadoop folder. Execute command from hadoop user
mv hadoop-3.3.6 hadoop
  • While in hadoop user home directory. Open .bashrc
nano ~/.bashrc
  • Add the environment variables at the end
export HADOOP_HOME=/home/hduser/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
  • It is also time to add another variable after this. We will add JAVA_HOME to hadoop user which we had previously added to main user.
  • ENTER CORRECT FOLDER NAME.
export JAVA_HOME=/usr/lib/jvm/<myopenjdk-foldername>
export PATH=$PATH:$JAVA_HOME/bin
  • Now reopen terminal or source ~/.bashrc . If you reopen the terminal then remember to switch to hadoop user.
  • Add JAVA_HOME in hadoop-env.sh
nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
  • Scroll and find JAVA_HOME comment in hadoop-env.sh, uncomment and (ENTER CORRECT FOLDER NAME.) change to:
export JAVA_HOME=/usr/lib/jvm/<myopenjdk-foldername>
  • Now we will add content in the configuration files of Hadoop. (Official Docs)
  • Open core-site.xml to make changes in-between configuration.
nano $HADOOP_HOME/etc/hadoop/core-site.xml
<configuration>
<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:9000</value>
</property>
</configuration>
  • Open hdfs-site.xml. namenode and datanode will be created in /home/hduser/hadoopdata
nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
<configuration>
<property>
 <name>dfs.replication</name>
 <value>1</value>
</property>

<property>
  <name>dfs.name.dir</name>
  <value>file:///home/hduser/hadoopdata/hdfs/namenode</value>
</property>

<property>
  <name>dfs.data.dir</name>
  <value>file:///home/hduser/hadoopdata/hdfs/datanode</value>
</property>
</configuration>
  • Add the following configuration to change the default MapReduce framework name value to yarn:
nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
<configuration>
 <property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
 </property>
<property>
  <name>yarn.app.mapreduce.am.env</name>
  <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
  <name>mapreduce.map.env</name>
  <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
  <name>mapreduce.reduce.env</name>
  <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
</configuration>
  • Adding configurations for the Node Manager, Resource Manager, Containers, and Application Master.
nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
<configuration>
<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>
<property>
  <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
  <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
  <name>yarn.resourcemanager.resource-tracker.address</name>
  <value>localhost:8025</value>
</property>
<property>
  <name>yarn.resourcemanager.scheduler.address</name>
  <value>localhost:8030</value>
</property>
<property>
  <name>yarn.resourcemanager.address</name>
  <value>localhost:8050</value>
</property>
</configuration>
  • Now it is time to format the NameNode. It is necessary to format NameNode every time you change configuration files.
    However, if you change only the hadoop-env.sh, then it might not be required to reformat.
  • The following command will create namenode and datanode in /home/hduser/hadoopdata.
cd ~/hadoop/sbin
hdfs namenode -format
  • While testing setup if you get any errors with datanode after changing any config file, You can delete /hadoopdata folder (losing the data) and reformat.

(back to top)

Start Hadoop Cluster

  • Navigate to ~/hadoop/sbin . Execute commands from hadoop user.
./start-dfs.sh
./start-yarn.sh

Error

If you get an error saying connection refused or permission denied then as explained in this answer:

  • Add export PDSH_RCMD_TYPE=ssh in .bashrc file, reopen or source and try once again. If you reopen terminal then remember to switch to hadoop user.

  • Check the processes started with the jps command

jps

jps

  • To stop the processes use:
./stop-dfs.sh
./stop-yarn.sh

or

./stop-all.sh

(back to top)

Accessing Hadoop frontends

Contributing

Contributions are what makes the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch
  3. Commit your Changes
  4. Push to the Branch
  5. Open a Pull Request

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

Acknowledgments

Useful Resources

(back to top)

About

How to install hadoop in 2023? Solution: Get-shit-done type hadoop install guide. Install hadoop 3.3.6 with Java 8

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published