This repository has been archived by the owner on Aug 4, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path04-streaming-mapreduce.html
201 lines (180 loc) · 9.88 KB
/
04-streaming-mapreduce.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<title>Introduction to Hadoop</title>
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
<link rel="stylesheet" type="text/css" href="css/swc.css" />
<link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
<meta charset="UTF-8" />
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body class="lesson">
<div class="container card">
<div class="banner">
<a href="http://citi.clemson.edu" title="Software Carpentry">
<img alt="Software Carpentry banner" src="img/paw.gif" width="100px" height="auto" />
</a>
</div>
<article>
<div class="row">
<div class="col-md-10 col-md-offset-1">
<a href="index.html"><h1 class="title">Introduction to Hadoop</h1></a>
<h2 class="subtitle">Streaming MapReduce</h2>
<section class="objectives panel panel-warning">
<div class="panel-heading">
<h2 id="learning-objectives"><span class="glyphicon glyphicon-certificate"></span>Learning Objectives</h2>
</div>
<div class="panel-body">
<ul>
<li>Understand how to run Hadoop MapReduce programs</li>
<li>Create Hadoop MapReduce commands to run external programs as mappers and reducers</li>
</ul>
</div>
</section>
<p>Any Application-related interactions between users, the Hadoop MapReduce framework, and HDFS are done though YARN, Hadoop’ default resource manager. The most common interactions include submitting applications to YARN and inquiring about the status of the applications. The generic syntax is:</p>
<pre><code>yarn COMMAND --loglevel loglevel [generic_options] [command_options]</code></pre>
<p>Starting from the least to the most verbose, we have FATAL, ERROR, WARN, INFO, DEBUG, and TRACE. Default level is INFO.</p>
<table style="width:43%;">
<colgroup>
<col width="23%" />
<col width="19%" />
</colgroup>
<thead>
<tr class="header">
<th>generic_options</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><code>-archives <comma separated list of archives></code></td>
<td>Specify archives to be unarchived on the compute machines. Applies only to job</td>
</tr>
<tr class="even">
<td><code>-conf <configuration file></code></td>
<td>Specify an application configuration file</td>
</tr>
<tr class="odd">
<td><code>-D <property>=<value></code></td>
<td>Use value for a given property</td>
</tr>
<tr class="even">
<td><code>-files <comma separated list of files></code></td>
<td>Specify files to be copied to the cluster. Applies only to job</td>
</tr>
<tr class="odd">
<td><code>-jt <local> or <resourcemanager:port></code></td>
<td>Specify a ResourceManager. Applies only to job</td>
</tr>
<tr class="even">
<td><code>-libjars <comma separated list of jars></code></td>
<td>Specify the jar files to include in the classpath. Applies only to job.</td>
</tr>
</tbody>
</table>
<h4 id="command-jar">COMMAND: jar</h4>
<p>Run a jar file as an application on Cypress. Usage:</p>
<pre><code>yarn jar <jar file> [mainClass] args ...</code></pre>
<p>Typically, YARN applications are written in Java and bundled into jar files to be executed. However, Hadoop also supports the execution of non-Java applications via the Hadoop Streaming utility. This utility allows you to use any executable or scripts as the mapper and/or the reducer for a YARN application. Usage:</p>
<pre><code>yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar [generic_options] [streaming_options]</code></pre>
<table>
<thead>
<tr class="header">
<th>streaming_options</th>
<th>Optional/Required</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><code>-input directoryname or filename</code></td>
<td>Required</td>
<td>Input location for mapper</td>
</tr>
<tr class="even">
<td><code>-output directoryname</code></td>
<td>Required</td>
<td>Output location for reducer</td>
</tr>
</tbody>
</table>
<p><code>-mapper executable or JavaClassName</code> | Required | Mapper executable | <code>-reducer executable or JavaClassName</code> | Required | Reducer executable | <code>-file filename</code> | Optional | Make the mapper, reducer, or combiner executable available locally on the compute nodes | <code>-inputformat JavaClassName</code> | Optional | Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default | <code>-outputformat JavaClassName</code> | Optional | Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default | <code>-partitioner JavaClassName</code> | Optional | Class that determines which reduce a key is sent to | <code>-combiner executable or JavaClassName</code> | Optional | Combiner executable for map output | <code>-cmdenv name=value</code> | Optional | Pass environment variable to streaming commands | <code>-inputreader JavaClassName</code> | Optional | For backwards-compatibility: specifies a record reader class (instead of an input format class) | <code>-verbose</code> | Optional | Verbose output | <code>-lazyOutput</code> | Optional | Create output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to Context.write | <code>-numReduceTasks</code> | Optional | Specify the number of reducers | <code>-mapdebug</code> | Optional | Script to call when map task fails | <code>-reducedebug</code> | Optional | Script to call when reduce task fails |</p>
<p>To demonstrate Hadoop Streaming functionality, let’s look at the problem of counting how many instances of “thou” there are in Gutenberg’s complete work of Shakespeare.</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">!<span class="kw">ssh</span> dsciu001 yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -input intro-to-hadoop/gutenberg-shakespeare.txt -output thou-count -mapper <span class="st">"grep -e thou"</span> -reducer <span class="st">"wc -w"</span></code></pre></div>
<pre class="output"><code>...
2016-06-29 14:37:14,692 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1356)) - Job job_1466735278048_0010 completed successfully
2016-06-29 14:37:14,818 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1363)) - Counters: 43
File System Counters
FILE: Number of bytes read=335881
FILE: Number of bytes written=920423
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=5678917
HDFS: Number of bytes written=7
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=173408
Total time spent by all reduces in occupied slots (ms)=43664
Map-Reduce Framework
Map input records=124796
Map output records=6051
Map output bytes=323773
Map output materialized bytes=335887
Input split bytes=236
Combine input records=0
Combine output records=0
Reduce input groups=6049
Reduce shuffle bytes=335887
Reduce input records=6051
Reduce output records=1
Spilled Records=12102
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=312
CPU time spent (ms)=5150
Physical memory (bytes) snapshot=1229393920
Virtual memory (bytes) snapshot=3850330112
Total committed heap usage (bytes)=1543569408
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=5678681
File Output Format Counters
Bytes Written=7
2016-06-29 14:37:14,819 INFO [main] streaming.StreamJob (StreamJob.java:submitAndMonitorJob(1022)) - Output directory: thou-count</code></pre>
<p>Several items need to be paid attention to in the above example. First, the location of the input and output directories are relative. That is, without an initial backslash (<strong>/</strong>), YARN assumes that the path starts from inside user’s home directory on HDFS, which is <strong>/user/user-name</strong>. Second, the output directory must not exist prior to the <em>yarn jar</em> call, otherwise, the command will fail with an <em>output directory already exists</em> error.</p>
</div>
</div>
</article>
<div class="footer">
<a class="label clemson-orange" href="http://citi.clemson.edu">CITI</a>
<a class="label clemson-orange" href="https://github.com/clemsoncoe/hpc-workshop">Source</a>
<a class="label clemson-orange" href="mailto:atrikut@clemson.edu">Contact</a>
<a class="label clemson-orange" href="LICENSE.html">License</a>
</div>
</div>
<!-- Javascript placed at the end of the document so the pages load faster -->
<script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
<script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
</body>
</html>