Skip to content

Latest commit

 

History

History
302 lines (260 loc) · 11.3 KB

README.md

File metadata and controls

302 lines (260 loc) · 11.3 KB

How to write, test, and run Hadoop programs locally with IntelliJ and Maven

The following instructions allow you to write, test, and run a Hadoop program locally in IntelliJ, without configuring the Hadoop environment on your own machine or using a cluster.

This tutorial is based on Hadoop: IntelliJ结合Maven本地运行和调试MapReduce程序 (无需搭载Hadoop和HDFS环境), How-to: Create an IntelliJ IDEA Project for Apache Hadoop and Developing Hadoop Mapreduce Application within IntelliJ IDEA on Windows 10.

Requirements

Instructions

Warning: Some steps and some interface details may be slightly different in your version of IntelliJ, due to developments in this program. The main ideas presented next should still be valid though.

Create a new project

In IntelliJ, Go to File, New, Project, then select Maven on the left of the pop-up window, select your JDK, and hit Next. new_project new_maven

Set the Project name and Project location. In this tutorial, we will be "creating" the popular Hadoop example of the WordCount application from the original Hadoop MapReduce Tutorial, so use WordCountas project name. If required, fill in the GroupId (e.g., with your name) and ArtifactId (e.g., with the name of your project, i.e, WordCount in our case), then hit Finish.

name_loc

Configure dependencies

A file called pom.xml should open automatically in the IntelliJ editor. If it does not, find it in the Project browser on the left, and double-click on it to open it.

Paste the following 2 blocks before the last </project> tag.

<repositories>
    <repository>
        <id>apache</id>
        <url>http://maven.apache.org</url>
    </repository>
</repositories>

<dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-minicluster</artifactId>
        <version>3.3.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-mapreduce-client-core</artifactId>
        <version>3.3.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>3.3.0</version>
    </dependency>
</dependencies>

A new version of Hadoop may have come out when you read these instructions. Check the latest versions available in the Maven repository for hadoop-minicluster, hadoop-mapreduce-client-core hadoop-common, and update the version numbers above accordingly.

The full pom.xml is the following:

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>yourname</groupId>
    <artifactId>Wordcount</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>14</maven.compiler.source>
        <maven.compiler.target>14</maven.compiler.target>
    </properties>
    <repositories>
        <repository>
            <id>apache</id>
            <url>http://maven.apache.org</url>
        </repository>
    </repositories>
    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-minicluster</artifactId>
            <version>3.3.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>3.3.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>3.3.0</version>
        </dependency>
    </dependencies>
</project>

Create the WordCount class

Select the Projectsrcmainjava folder on the left pane, then do File, New, Java Class and use WordCount as the name of the class. new_class

Paste the Java code into WordCount.java (this code is taken from the original Hadoop MapReduce Tutorial).

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

wordcount

Prepare to run

The WordCount program scans all text files in the folder specified by the first command line argument, and output the number of lines in which each word appears into a folder specified by the second command line argument.

Create a folder named input under the project's root folder (so, at the same level as the srcfolder), and drag/copy some text files inside this folder. sample_text

Then set the two command line arguments. Select RunEdit Configurations. edit_config

Add a new Application configuration, set the Name to WordCount, set the Main class to WordCount, set Program arguments to input output. This way, the program will read the input from the input folder, and save the results to the output folder. Do not create the output folder, as Hadoop will create the folder automatically. If the folder exists, Hadoop will raise exceptions (thus, you have to manually delete the output folder before every time you run the program). new_app config

Run

Select RunRun 'WordCount' to run the Hadoop program. If you re-run the program, delete the output folder before each run. run_app

Results are saved in the file output/part-r-00000. result

Build Runnable JAR with Dependencies

You can build a single jar file with your program and all necessary dependencies (e.g., Hadoop libraries) so you can transfer the jar file to another machine to run it.

Add the following build block to pom.xml, at the same level of the repositories block and the dependencies block.

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.8.1</version>
            <configuration>
                <source>14</source>
                <target>14</target>
            </configuration>
        </plugin>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>3.2.4</version>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                    <configuration>
                        <filters>
                            <filter>
                                <artifact>*:*</artifact>
                                <excludes>
                                    <exclude>META-INF/*.SF</exclude>
                                    <exclude>META-INF/*.DSA</exclude>
                                    <exclude>META-INF/*.RSA</exclude>
                                </excludes>
                            </filter>
                        </filters>
                        <transformers>
                            <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                <!-- Path to your main class, include package path if needed -->
                                <mainClass>WordCount</mainClass>
                            </transformer>
                            <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
                        </transformers>
                    </configuration>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

Then in a terminal, cd to the directory containing the pom.xml file, and run the following command:

mvn package

This command will build WordCount-1.0-SNAPSHOT-jar-with-dependencies.jar and save it in the target directory. To run your program, execute the following command:

java -jar target/WordCount-1.0-SNAPSHOT-jar-with-dependencies.jar input output

Sample Project

See WordCount.