03-mapreduce.html

<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <meta name="generator" content="pandoc">
    <title>Introduction to Hadoop</title>
    <link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
    <link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
    <link rel="stylesheet" type="text/css" href="css/swc.css" />
    <link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
    <meta charset="UTF-8" />
    <!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
    <!--[if lt IE 9]>
      <script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
    <![endif]-->
  </head>
  <body class="lesson">
    <div class="container card">
      <div class="banner">
        <a href="http://citi.clemson.edu" title="Software Carpentry">
          <img alt="Software Carpentry banner" src="img/paw.gif" width="100px" height="auto" />
        </a>
      </div>
      <article>
      <div class="row">
        <div class="col-md-10 col-md-offset-1">
                    <a href="index.html"><h1 class="title">Introduction to Hadoop</h1></a>
          <h2 class="subtitle">Introducing the MapReduce Programming Paradigm</h2>
          <section class="objectives panel panel-warning">
<div class="panel-heading">
<h2 id="learning-objectives"><span class="glyphicon glyphicon-certificate"></span>Learning Objectives</h2>
</div>
<div class="panel-body">
<ul>
<li>Understand the MapReduce programming paradigm</li>
<li>Understand why the MapReduce programming paradigm is a natural approach to handle HDFS data with the Hadoop MapReduce implementation</li>
<li>Understand Key/Value pair</li>
</ul>
</div>
</section>
<p>MapReduce is a programming model that has its roots in functional programming. The ideal targets for MapReduce are collections of data elements (lists, arrays, sets …). There are two core functions in MapReduce: Map and Reduce.</p>
<p>Map operates on all data elements of a collection by applying the same operation (or same set of operations) to each individual element of this collection. The outcome of Map is another collection of new data elements:</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash"><span class="kw">input_data</span> = [0,1,2,3,4]

<span class="kw">def</span> squared (x)<span class="kw">:</span>
    <span class="kw">return</span> x * x

<span class="kw">map_output_1</span> = list(map(squared, input_data))
<span class="kw">print</span> (map_output_1)</code></pre></div>
<pre class="output"><code>[0, 1, 4, 9, 16]</code></pre>
<p>Map operation to filter data:</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash"><span class="kw">input_data</span> = [0,1,2,3,4]

<span class="kw">def</span> squared (x)<span class="kw">:</span>
    <span class="kw">tmp</span> = x * x
    <span class="kw">if</span> <span class="kw">tmp</span> <span class="kw">&gt;</span> 4:
        <span class="kw">return</span> x * x

<span class="kw">map_output_2</span> = list(map(squared, input_data))
<span class="kw">print</span> (map_output_2)</code></pre></div>
<pre class="output"><code>[None, None, None, 9, 16]</code></pre>
<p>Map operation to break up sentences into individual words:</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash"><span class="kw">input_data</span> = [<span class="st">&quot;Ask not what your data can do for you&quot;</span>, <span class="st">&quot;ask what you can do for your data&quot;</span>]

<span class="kw">def</span> parse_words (x)<span class="kw">:</span>
    <span class="kw">return</span> x.split()

<span class="kw">map_output_3</span> = list(map(parse_words, input_data))
<span class="kw">print</span> (map_output_3)</code></pre></div>
<pre class="output"><code>[[&#39;Ask&#39;, &#39;not&#39;, &#39;what&#39;, &#39;your&#39;, &#39;data&#39;, &#39;can&#39;, &#39;do&#39;, &#39;for&#39;, &#39;you&#39;],
[&#39;ask&#39;, &#39;what&#39;, &#39;you&#39;, &#39;can&#39;, &#39;do&#39;, &#39;for&#39;, &#39;your&#39;, &#39;data&#39;]]</code></pre>
<p>A Reduce function will operate on the outcome of a Map operation to either collapse or combine these new data elements into either a single value or a subset of elements.</p>
<p>Reduce function that accumulates values of a list using Python’s built-in <strong><em>reduce</em></strong> function:</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash"><span class="kw">import</span> functools
<span class="kw">def</span> sum(tmp, x)<span class="kw">:</span>
    <span class="kw">return</span> tmp + x

<span class="kw">reduce_output_1</span> = functools.reduce(sum, map_output_1)
<span class="kw">print</span> (reduce_output_1)</code></pre></div>
<pre class="output"><code>30</code></pre>
<p>A user-defined reduce function using standard for loop:</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash"><span class="kw">def</span> sum_reduce(x)<span class="kw">:</span>
    <span class="kw">sum</span> = 0
    <span class="kw">for</span> <span class="kw">data_element</span> in x:
        <span class="kw">sum</span> = sum + data_element
    <span class="kw">return</span> sum

<span class="kw">reduce_output_2</span> = sum_reduce(map_output_1)
<span class="kw">print</span>(reduce_output_2)</code></pre></div>
<pre class="output"><code>30</code></pre>
<h2 id="check-your-understanding-write-a-mapreduce-procedure" class="challenge">Check your understanding: Write a MapReduce procedure</h2>
<p>Given the following data set [-2,5,-10,4,7,9,1], find the largest squared value.</p>
<h2 id="hadoop-mapreduce">Hadoop MapReduce</h2>
<p>We recall from lessons <a href="00-intro.html">1</a> and <a href="02-filedir.html">3</a> that HDFS divides big data files into small blocks and distributes these blocks across a network of computers. In order to support the <strong><em>data locality</em></strong> concept, we need to bring the required computation to these data blocks. The MapReduce programming paradigm lends itself naturally to this concept.</p>
<p>The Map operation can be thought of as having the same operation being applied to each data elements in a collection. Therefore, in HDFS setting, the same Map operation can be applied to individual data blocks of a file. As these blocks are distributed across computers, the processors on these computers can execute the operations in parallel, significantly improving performance.</p>
<p>After the Map operation is completed, since the blocks are located on different computers, the output data of the Map operation is naturally also distributed across these computers. For the Reduce operation, a number of issues must be addressed, such as:</p>
<ol style="list-style-type: decimal">
<li>How can we gather the map output data for reduction?</li>
<li>How can we also speed up the Reduce process?</li>
</ol>
<p>Hadoop MapReduce uses several mechanisms to resolve these issues.</p>
<p><strong>Key/Value Pair</strong>: For Hadoop MapReduce, data are represented not as a single data value, but as a tuple of Key and Value. The key could be a unique identifier or a representative attribute of the data value. The key enables the Hadoop MapReduce framework to group data values of the same type or characteristics together.</p>
<p><strong>Shuffle</strong>: Hadoop MapReduce will <strong><em>shuffles</em></strong> map output data across computers to group data with the same key into collections. The Reduction operation will be applied to these collections. As the collections will be distributed, the Reduce process also happens in parallel.</p>
<p><strong>Partition</strong>: Hadoop MapReduce will <strong><em>partition</em></strong> the placement of these collections such that they are balanced across the computers and minimal data transfer is needed.</p>
<p>Hadooop MapReduce carries default implementations of <strong><em>Shuffle</em></strong> and <strong><em>Partition</em></strong> functions. Together with the management of data distribution via HDFS, that leaves users with only the task of developing the Map and the Reduce operation, in which determining Key and Value is a critical step.</p>
        </div>
      </div>
      </article>
      <div class="footer">
        <a class="label clemson-orange" href="http://citi.clemson.edu">CITI</a>
        <a class="label clemson-orange" href="https://github.com/clemsoncoe/hpc-workshop">Source</a>
        <a class="label clemson-orange" href="mailto:atrikut@clemson.edu">Contact</a>
        <a class="label clemson-orange" href="LICENSE.html">License</a>
      </div>
    </div>
    <!-- Javascript placed at the end of the document so the pages load faster -->
    <script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
    <script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
  </body>
</html>