This repository has been archived by the owner on Aug 4, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path07-integrating-mapreduce.html
296 lines (269 loc) · 18.2 KB
/
07-integrating-mapreduce.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<title>Introduction to Hadoop</title>
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
<link rel="stylesheet" type="text/css" href="css/swc.css" />
<link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
<meta charset="UTF-8" />
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body class="lesson">
<div class="container card">
<div class="banner">
<a href="http://citi.clemson.edu" title="Software Carpentry">
<img alt="Software Carpentry banner" src="img/paw.gif" width="100px" height="auto" />
</a>
</div>
<article>
<div class="row">
<div class="col-md-10 col-md-offset-1">
<a href="index.html"><h1 class="title">Introduction to Hadoop</h1></a>
<h2 class="subtitle">Integrating Python Mapper and Reducer in Hadoop</h2>
<section class="objectives panel panel-warning">
<div class="panel-heading">
<h2 id="learning-objectives"><span class="glyphicon glyphicon-certificate"></span>Learning Objectives</h2>
</div>
<div class="panel-body">
<ul>
<li>Run the combination of Python-based mapper and reducer on the Hadoop infrastructure</li>
<li>Customize reducer for questions that require global access to KEYS</li>
</ul>
</div>
</section>
<p>With the mapper and reducer created and tested, the final step is to run this combination on the Hadoop infrastructure.</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">!<span class="kw">ssh</span> dsciu001 yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -input /repository/movielens/ratings.csv -output ratings -file /home/<span class="kw"><</span>username<span class="kw">></span>/mapreduce/mapper03.py -mapper mapper03.py -file /home/<span class="kw"><</span>username<span class="kw">></span>/mapreduce/mapper03.py -reducer reducer01.py -file /home/<span class="kw"><</span>username<span class="kw">></span>/mapreduce/movies.csv</code></pre></div>
<pre class="output"><code>16/07/06 12:30:13 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/home/lngo/mapreduce/mapper03.py, /home/lngo/mapreduce/reducer01.py, /home/lngo/mapreduce/movies.csv] [/usr/hdp/2.4.2.0-258/hadoop-mapreduce/hadoop-streaming-2.7.1.2.4.2.0-258.jar] /var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir/streamjob1141304963588025269.jar tmpDir=null
16/07/06 12:30:15 INFO impl.TimelineClientImpl: Timeline service address: http://dscim003.palmetto.clemson.edu:8188/ws/v1/timeline/
16/07/06 12:30:15 INFO impl.TimelineClientImpl: Timeline service address: http://dscim003.palmetto.clemson.edu:8188/ws/v1/timeline/
16/07/06 12:30:16 INFO mapred.FileInputFormat: Total input paths to process : 1
16/07/06 12:30:16 INFO mapreduce.JobSubmitter: number of splits:5
16/07/06 12:30:16 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1467819539655_0009
16/07/06 12:30:16 INFO impl.YarnClientImpl: Submitted application application_1467819539655_0009
16/07/06 12:30:17 INFO mapreduce.Job: The url to track the job: http://dscim001.palmetto.clemson.edu:8088/proxy/application_1467819539655_0009/
16/07/06 12:30:17 INFO mapreduce.Job: Running job: job_1467819539655_0009
16/07/06 12:30:26 INFO mapreduce.Job: Job job_1467819539655_0009 running in uber mode : false
16/07/06 12:30:26 INFO mapreduce.Job: map 0% reduce 0%
16/07/06 12:30:39 INFO mapreduce.Job: map 3% reduce 0%
16/07/06 12:30:40 INFO mapreduce.Job: map 5% reduce 0%
16/07/06 12:30:42 INFO mapreduce.Job: map 8% reduce 0%
16/07/06 12:30:43 INFO mapreduce.Job: map 10% reduce 0%
16/07/06 12:30:45 INFO mapreduce.Job: map 11% reduce 0%
16/07/06 12:30:46 INFO mapreduce.Job: map 14% reduce 0%
16/07/06 12:30:48 INFO mapreduce.Job: map 15% reduce 0%
16/07/06 12:30:49 INFO mapreduce.Job: map 20% reduce 0%
16/07/06 12:30:51 INFO mapreduce.Job: map 21% reduce 0%
16/07/06 12:30:52 INFO mapreduce.Job: map 25% reduce 0%
16/07/06 12:30:55 INFO mapreduce.Job: map 30% reduce 0%
16/07/06 12:30:58 INFO mapreduce.Job: map 35% reduce 0%
16/07/06 12:31:01 INFO mapreduce.Job: map 39% reduce 0%
16/07/06 12:31:02 INFO mapreduce.Job: map 40% reduce 0%
16/07/06 12:31:04 INFO mapreduce.Job: map 44% reduce 0%
16/07/06 12:31:05 INFO mapreduce.Job: map 46% reduce 0%
16/07/06 12:31:07 INFO mapreduce.Job: map 48% reduce 0%
16/07/06 12:31:08 INFO mapreduce.Job: map 50% reduce 0%
16/07/06 12:31:10 INFO mapreduce.Job: map 51% reduce 0%
16/07/06 12:31:11 INFO mapreduce.Job: map 54% reduce 0%
16/07/06 12:31:13 INFO mapreduce.Job: map 55% reduce 0%
16/07/06 12:31:14 INFO mapreduce.Job: map 58% reduce 0%
16/07/06 12:31:16 INFO mapreduce.Job: map 65% reduce 0%
16/07/06 12:31:17 INFO mapreduce.Job: map 67% reduce 0%
16/07/06 12:31:20 INFO mapreduce.Job: map 70% reduce 0%
16/07/06 12:31:23 INFO mapreduce.Job: map 72% reduce 0%
16/07/06 12:31:26 INFO mapreduce.Job: map 73% reduce 0%
16/07/06 12:31:29 INFO mapreduce.Job: map 73% reduce 7%
16/07/06 12:31:32 INFO mapreduce.Job: map 80% reduce 7%
16/07/06 12:31:35 INFO mapreduce.Job: map 87% reduce 13%
16/07/06 12:31:38 INFO mapreduce.Job: map 87% reduce 20%
16/07/06 12:31:39 INFO mapreduce.Job: map 100% reduce 20%
16/07/06 12:31:41 INFO mapreduce.Job: map 100% reduce 35%
16/07/06 12:31:44 INFO mapreduce.Job: map 100% reduce 41%
16/07/06 12:31:48 INFO mapreduce.Job: map 100% reduce 45%
16/07/06 12:31:51 INFO mapreduce.Job: map 100% reduce 49%
16/07/06 12:31:54 INFO mapreduce.Job: map 100% reduce 54%
16/07/06 12:31:57 INFO mapreduce.Job: map 100% reduce 58%
16/07/06 12:32:00 INFO mapreduce.Job: map 100% reduce 62%
16/07/06 12:32:03 INFO mapreduce.Job: map 100% reduce 67%
16/07/06 12:32:06 INFO mapreduce.Job: map 100% reduce 68%
16/07/06 12:32:10 INFO mapreduce.Job: map 100% reduce 70%
16/07/06 12:32:13 INFO mapreduce.Job: map 100% reduce 71%
16/07/06 12:32:16 INFO mapreduce.Job: map 100% reduce 73%
16/07/06 12:32:19 INFO mapreduce.Job: map 100% reduce 74%
16/07/06 12:32:22 INFO mapreduce.Job: map 100% reduce 76%
16/07/06 12:32:25 INFO mapreduce.Job: map 100% reduce 77%
16/07/06 12:32:28 INFO mapreduce.Job: map 100% reduce 79%
16/07/06 12:32:31 INFO mapreduce.Job: map 100% reduce 80%
16/07/06 12:32:35 INFO mapreduce.Job: map 100% reduce 82%
16/07/06 12:32:38 INFO mapreduce.Job: map 100% reduce 84%
16/07/06 12:32:42 INFO mapreduce.Job: map 100% reduce 85%
16/07/06 12:32:45 INFO mapreduce.Job: map 100% reduce 87%
16/07/06 12:32:48 INFO mapreduce.Job: map 100% reduce 89%
16/07/06 12:32:51 INFO mapreduce.Job: map 100% reduce 90%
16/07/06 12:32:54 INFO mapreduce.Job: map 100% reduce 92%
16/07/06 12:32:57 INFO mapreduce.Job: map 100% reduce 93%
16/07/06 12:33:00 INFO mapreduce.Job: map 100% reduce 95%
16/07/06 12:33:03 INFO mapreduce.Job: map 100% reduce 96%
16/07/06 12:33:06 INFO mapreduce.Job: map 100% reduce 98%
16/07/06 12:33:09 INFO mapreduce.Job: map 100% reduce 99%
16/07/06 12:33:13 INFO mapreduce.Job: map 100% reduce 100%
16/07/06 12:33:14 INFO mapreduce.Job: Job job_1467819539655_0009 completed successfully
16/07/06 12:33:14 INFO mapreduce.Job: Counters: 50
File System Counters
FILE: Number of bytes read=729355490
FILE: Number of bytes written=1459573425
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=620729398
HDFS: Number of bytes written=1546431
HDFS: Number of read operations=18
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=5
Launched reduce tasks=1
Data-local map tasks=1
Rack-local map tasks=4
Total time spent by all maps in occupied slots (ms)=323066
Total time spent by all reduces in occupied slots (ms)=228592
Total time spent by all map tasks (ms)=323066
Total time spent by all reduce tasks (ms)=114296
Total vcore-seconds taken by all map tasks=323066
Total vcore-seconds taken by all reduce tasks=114296
Total megabyte-seconds taken by all map tasks=2646556672
Total megabyte-seconds taken by all reduce tasks=1872625664
Map-Reduce Framework
Map input records=22884378
Map output records=22884377
Map output bytes=683586345
Map output materialized bytes=729355514
Input split bytes=480
Combine input records=0
Combine output records=0
Reduce input groups=33647
Reduce shuffle bytes=729355514
Reduce input records=22884377
Reduce output records=33647
Spilled Records=45768754
Shuffled Maps =5
Failed Shuffles=0
Merged Map outputs=5
GC time elapsed (ms)=3993
CPU time spent (ms)=553480
Physical memory (bytes) snapshot=14627205120
Virtual memory (bytes) snapshot=61926924288
Total committed heap usage (bytes)=16984834048
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=620728918
File Output Format Counters
Bytes Written=1546431
16/07/06 12:33:14 INFO streaming.StreamJob: Output directory: ratings</code></pre>
<p>The content of the ratings directory includes an empty file serves as a flag to indicate whether the operation was successful or not, and the output files. The number of output files depends on how many reducers we use.</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">!<span class="kw">ssh</span> dsciu001 hdfs dfs -ls ratings <span class="kw">2></span>/dev/null</code></pre></div>
<pre class="output"><code>Found 2 items
-rw-r--r-- 2 lngo hdfs 0 2016-07-06 12:33 ratings/_SUCCESS
-rw-r--r-- 2 lngo hdfs 1546431 2016-07-06 12:33 ratings/part-00000</code></pre>
<p>We can <strong>cat</strong> for the content of the output file</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">!<span class="kw">ssh</span> dsciu001 hdfs dfs -cat ratings/part-00000 <span class="kw">2></span>/dev/null <span class="kw">|</span> <span class="kw">head</span></code></pre></div>
<pre class="output"><code>"Great Performances" Cats (1998) 2.77536231884 574.5 207
#1 Cheerleader Camp (2010) 2.5 12.5 5
#chicagoGirl: The Social Network Takes on a Dictator (2013) 3.66666666667 11.0 3
$ (Dollars) (1971) 2.74074074074 74.0 27
$5 a Day (2008) 2.98 149.0 50
$9.99 (2008) 3.15873015873 199.0 63
$ellebrity (Sellebrity) (2012) 2.25 13.5 6
'49-'17 (1917) 2.5 2.5 1
'71 (2014) 3.64705882353 496.0 136
'Hellboy': The Seeds of Creation (2004) 3.08878504673 330.5 107</code></pre>
<p>It is also possible to increase number of reducers</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">!<span class="kw">ssh</span> dsciu001 yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -D mapreduce.job.reduces=4 -input /repository/movielens/ratings.csv -output ratings3 -file /home/lngo/mapreduce/mapper03.py -mapper /home/lngo/mapreduce/mapper03.py -file /home/lngo/mapreduce/reducer01.py -reducer /home/lngo/mapreduce/reducer01.py -file /home/lngo/mapreduce/movies.csv</code></pre></div>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">!<span class="kw">ssh</span> dsciu001 hdfs dfs -ls ratings3 <span class="kw">2></span>/dev/null</code></pre></div>
<pre class="output"><code>Found 5 items
-rw-r--r-- 2 lngo hdfs 0 2016-07-06 23:22 ratings3/_SUCCESS
-rw-r--r-- 2 lngo hdfs 392511 2016-07-06 23:22 ratings3/part-00000
-rw-r--r-- 2 lngo hdfs 384436 2016-07-06 23:21 ratings3/part-00001
-rw-r--r-- 2 lngo hdfs 383097 2016-07-06 23:22 ratings3/part-00002
-rw-r--r-- 2 lngo hdfs 386387 2016-07-06 23:21 ratings3/part-00003</code></pre>
<p>Aside from performance implication, an important difference between using one and many reducers is demonstrated in cases where we want to perform operations that require a global examination of the data. Let’s say the movie company wishes to identify the movie with highest rating average.</p>
<p>Create a file called reduce02.py with the following content</p>
<pre class="output"><code>#!/usr/bin/env python
import sys
current_movie = None
current_rating_sum = 0
current_rating_count = 0
max_movie = ""
max_average = 0
for line in sys.stdin:
line = line.strip()
movie, rating = line.split("\t", 1)
try:
rating = float(rating)
except ValueError:
continue
if current_movie == movie:
current_rating_sum += rating
current_rating_count += 1
else:
if current_movie:
rating_average = current_rating_sum / current_rating_count
if rating_average > max_average:
max_movie = current_movie
max_average = rating_average
current_movie = movie
current_rating_sum = rating
current_rating_count = 1
if current_movie == movie:
rating_average = current_rating_sum / current_rating_count
if rating_average > max_average:
max_movie = current_movie
max_average = rating_average
print ("%s\t%s" % (max_movie, max_average))</code></pre>
<p>Rerun the Hadoop program using one and four reducers, respectively:</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">!<span class="kw">ssh</span> dsciu001 yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -input /repository/movielens/ratings.csv -output maxRating -file /home/lngo/mapreduce/mapper03.py -mapper /home/lngo/mapreduce/mapper03.py -file /home/lngo/mapreduce/reducer02.py -reducer /home/lngo/mapreduce/reducer02.py -file /home/lngo/mapreduce/movies.csv</code></pre></div>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">!<span class="kw">ssh</span> dsciu001 yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -D mapreduce.job.reduces=4 -input /repository/movielens/ratings.csv -output maxRating4R -file /home/lngo/mapreduce/mapper03.py -mapper /home/lngo/mapreduce/mapper03.py -file /home/lngo/mapreduce/reducer02.py -reducer /home/lngo/mapreduce/reducer02.py -file /home/lngo/mapreduce/movies.csv</code></pre></div>
<p>In the case of one reducer, there is only a single answer for the movie with highest rating average. With four reducers, we have four possible answers. On the other hand, it is quite feasible to infer the final single answer from a set of four possible choices.</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">!<span class="kw">ssh</span> dsciu001 hdfs dfs -cat maxRating/part-00000 <span class="kw">2></span>/dev/null</code></pre></div>
<pre class="output"><code>A Job to Kill For (2006) 5.0</code></pre>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">!<span class="kw">ssh</span> dsciu001 hdfs dfs -cat maxRating4R/part-00000 <span class="kw">2></span>/dev/null</code></pre></div>
<pre class="output"><code>A Job to Kill For (2006) 5.0</code></pre>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">!<span class="kw">ssh</span> dsciu001 hdfs dfs -cat maxRating4R/part-00001 <span class="kw">2></span>/dev/null</code></pre></div>
<pre class="output"><code>10 Attitudes (2001) 5.0</code></pre>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">!<span class="kw">ssh</span> dsciu001 hdfs dfs -cat maxRating4R/part-00002 <span class="kw">2></span>/dev/null</code></pre></div>
<pre class="output"><code>A Gentle Spirit (1987) 5.0</code></pre>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">!<span class="kw">ssh</span> dsciu001 hdfs dfs -cat maxRating4R/part-00003 <span class="kw">2></span>/dev/null</code></pre></div>
<pre class="output"><code>2 (2007) 5.0</code></pre>
<h2 id="check-your-understanding-additional-conditions-on-the-reduce-side" class="challenge">Check your understanding: Additional conditions on the reduce side</h2>
<p>The previous results do not make sense intuitively, as these movies are not well known. It is possible that our results are skewed by movies having too few reviews. Modify the reducer so that we only consider movies that have more than one thousand ratings totally. Name this reducer reducer03.py. Run the Hadoop MapReduce program again with mapper03.py and reducer03.py using one and four reducers respectively. Report the outcome.</p>
<h2 id="check-your-understanding-user-study" class="challenge">Check your understanding: User Study</h2>
<p>User feedback plays an important role in marketing strategies. Implement a Hadoop MapReduce program that identifies the user that rates the most movies over time. Identify the genre that this user rates most favorably.</p>
</div>
</div>
</article>
<div class="footer">
<a class="label clemson-orange" href="http://citi.clemson.edu">CITI</a>
<a class="label clemson-orange" href="https://github.com/clemsoncoe/hpc-workshop">Source</a>
<a class="label clemson-orange" href="mailto:atrikut@clemson.edu">Contact</a>
<a class="label clemson-orange" href="LICENSE.html">License</a>
</div>
</div>
<!-- Javascript placed at the end of the document so the pages load faster -->
<script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
<script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
</body>
</html>