index.html

<!doctype html>
<html lang="en">

	<head>
		<meta charset="utf-8">
        <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">

        <title>Software Development for Engineering Research - Parallel Computing</title>

        <meta name="description" content="
        Lecture on parallel computing for ME 599: Software Development for Engineering Research
        ">
		<meta name="author" content="Kyle Niemeyer">

		<meta name="apple-mobile-web-app-capable" content="yes" />
		<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />

        <link rel="stylesheet" href="dist/reset.css">
		<link rel="stylesheet" href="dist/reveal.css">
		<link rel="stylesheet" href="dist/theme/simple.css" id="theme">

		<!-- For syntax highlighting -->
		<link rel="stylesheet" href="plugin/highlight/monokai.css">

		<!-- If the query includes 'print-pdf', include the PDF print sheet -->
		<script>
			if( window.location.search.match( /print-pdf/gi ) ) {
				var link = document.createElement( 'link' );
				link.rel = 'stylesheet';
				link.type = 'text/css';
				link.href = 'css/print/pdf.css';
				document.getElementsByTagName( 'head' )[0].appendChild( link );
			}
		</script>

		<!--[if lt IE 9]>
		<script src="lib/js/html5shiv.js"></script>
		<![endif]-->
	</head>

	<body>

		<div class="reveal">

			<!-- Any section element inside of this container is displayed as a slide -->
			<div class="slides">
                <section>
                    <h2>Intro. to Parallel Computing</h2>
                    <h3>Software Development for Engineering Research</h3>
                    <br/>
                    <h3>Kyle Niemeyer. 17 February 2022</h3>
                    <h3>ME 599, Corvallis, OR</h3>
                </section>
                <section>
                    <section>
                        <h2>Today: Introduction to parallel computing (in Python)</h2>
                        <br/>
                        <p class="fragment">If you want more, check out <a href="http://web.engr.oregonstate.edu/~mjb/cs575/">CS 475/575 Parallel Programming</a>.
                        </p>
                    </section>

                    <section>
                        <h3>Approaching parallelism:</h3>
                        <br/>
                        <ul>
                            <li class="fragment">How do I do many things at once?
                            </li>
                            <li class="fragment">How do I do <em>one thing</em> faster?
                            </li>
                        </ul>
                        <br/>
                        <p class="fragment">Computer: perform many simultaneous tasks</p>
                        <p class="fragment">Developer: determine dependencies between code and data</p>
                    </section>

                    <section>
                        <p class="fragment">Up until now: serial thinking, like "x then y then z"</p>
                        <br/>
                        <p class="fragment">Benefit of parallelism: problems execute faster&mdash;sometimes <em>much</em> faster.</p>
                        <p class="fragment">Downside to parallelism: harder to program, debug, open files, print to screen.</p>
                    </section>

                    <section>
                        <h3>Why go parallel?</h3></br>
                        <p class="fragment">First: wait until your code works and you need it.</p>
                        <br/>
                        <ul>
                            <li class="fragment">The problem creates or requires too much data for a normal machine.
                            </li>
                            <li class="fragment">The sun would explode before the computation would finish.
                            </li>
                            <li class="fragment">The algorithm is easy to parallelize.
                            </li>
                            <li class="fragment">The physics itself cannot be simulated at all with smaller resources.
                            </li>
                        </ul>
                    </section>
                </section>

                <section>
                    <section>
                        <h3>Scale and Scalability</h3>
                        <br/>
                        <img src="./images/sense-and-sensibility.jpg" height="500" alt="Scale and Scalability 'book' image">
                    </section>

                    <section>
                        <h3>Scale and Scalability</h3>
                        <br/>
                        <p class="fragment">
                            <strong>Scale:</strong> size of problem
                        </p>
                        <p class="fragment">
                            Scale is proportional to number of processes <code>P</code> used,
                            and thus the maximum <em>degree of parallelism</em> possible.
                        </p>
                        <p class="fragment">
                            <strong>FLOPS:</strong> number of floating-point operations per second
                        </p>
                    </section>

                    <section>
                        <h3>Scaling up code</h3>
                        <br/>
                        <p class="fragment">
                            Scale up slowly&mdash;start with one processor, then 10, 100, etc.
                        </p>
                        <p class="fragment">
                            <strong>scalability:</strong> how easy or hard it is to scale code
                        </p>
                    </section>
                    <section>
                        <p>
                            <strong>strong scaling:</strong> how runtime changes as a function of processor number
                            for a fixed total problem size
                        </p>
                        <p class="fragment">
                            speedup \(s\): ratio of time on one processor, \(t_1\), to time on \(P\) processors, \(t_P\):
                        </p>
                        <div class="fragment">
                            \[\begin{equation}
                                s(P) = \frac{t_1}{t_P}
                            \end{equation} \]
                        </div>
                    </section>
                    <section>
                        <h3>Strong scaling</h3><br/>
                            <p class="fragment">
                                Efficient system: linear strong scaling speedup. For example, <a href="http://www.pyfr.org">PyFR</a> CFD code:

                            </p>
                            <img class="fragment" src="./images/pyfr-strongscaling.png" height="300" alt="pyFR strong scaling performance">
                            <div class="fragment">
                                <small>
                                    Ref: P Vincent et al. "PyFR: Next-Generation High-Order Computational Fluid Dynamics on Many-Core Hardware." 22nd AIAA CFD Conference. 2015.
                                </small>
                            </div>
                    </section>
                    <section>
                        <p class="fragment">
                            <strong>weak scaling:</strong> how runtime changes as a function of processor number
                            for a fixed problem size per processor
                        </p>
                        <p class="fragment">
                            sizeup \(z\), for a problem size \(N\):
                        </p>
                        <div class="fragment">
                            \[\begin{equation}
                                z(P) = \frac{t_1}{t_P} \times \frac{N_P}{N_1}
                            \end{equation} \]
                        </div>
                    </section>
                    <section>
                        <h3>Weak scaling</h3>
                        <p class="fragment">
                            Efficient system: linear sizeup. Or, constant time with additional processors/problem size.
                            For example, <a href="http://www.pyfr.org">PyFR</a> CFD code:
                        </p>
                        <img class="fragment" src="./images/pyfr-weakscaling.png" height="300" alt="pyFR weak scaling performance">
                        <div class="fragment">
                            <small>
                                Ref: P Vincent et al. "PyFR: Next-Generation High-Order Computational Fluid Dynamics on Many-Core Hardware." 22nd AIAA CFD Conference. 2015.
                            </small>
                        </div>
                    </section>

                    <section>
                        <h3>Amdahl's Law</h3>
                        <p class="fragment">
                            Some fraction of an algorithm, \(\alpha\), cannot be parallelized.
                            Then, the maximum speedup/sizeup for \(P\) processors is:
                        </p>
                        <div class="fragment">
                            \[\begin{equation}
                                \max (s(P)) = \frac{1}{\alpha - \frac{1 - \alpha}{P}}
                            \end{equation} \]
                        </div>

                        <p class="fragment">
                            Then, the theoretically max speedup is:
                        </p>
                        <div class="fragment">
                            \[\begin{equation}
                                \max (s) = \frac{1}{\alpha}
                            \end{equation} \]
                        </div>
                    </section>

                </section>

                <section>
                    <h3>Types of problems</h3>
                    <br/>
                    <div style="font-size:30px">
                        <p class="fragment">
                            <strong>embarassingly parallel:</strong> algorithms with a high degree
                            of independence and little communication between parts
                        </p>
                        <p class="fragment">
                            Examples: summing large arrays, matrix multiplication, Monte Carlo simulations,
                            some optimization approaches (e.g., stochastic and genetic algorithms)
                        </p>
                    </div>
                    <div style="font-size:30px">
                        <p class="fragment">
                            Other algorithms have unavoidable bottlenecks: inverting a matrix, ODE integration, etc.
                        </p>
                        <p class="fragment">
                            Not all hope is lost! Some parts may still benefit from parallelization.
                        </p>
                    </div>
                </section>

                <section>
                    <section>
                        <h3>Example: N-body problem</h3>
                        <br/>
                        <p class="fragment" style="font-size:30px">generalization of the classic 2-body problem
                            that governs the equations of motion for two masses. From Newton's law of gravity:
                        </p>
                        <div class="fragment">
                            \[\begin{equation}
                                \frac{dp}{dt} = G \frac{m_1 m_2}{r^2}
                            \end{equation} \]
                        </div>
                    </section>
                    <section>
                        <h3>Equations of motion:</h3>
                        <br/>
                        <div class="fragment" style="font-size:24px">
                            \[\begin{aligned}
                                \mathbf{x}_{i,s} & = G \sum_{j=1, i \neq j} \frac{m_j (\mathbf{x}_{j,s-1} - \mathbf{x}_{i,s-1})}{||\mathbf{x}_{j,s-1} - \mathbf{x}_{i,s-1}||^3} \Delta t^2 + \mathbf{v}_{i,s-1} \Delta t + \mathbf{x}_{i,s-1} \\
                                \mathbf{v}_{i,s} & = G \sum_{j=1, i \neq j} \frac{m_j (\mathbf{x}_{j,s-1} - \mathbf{x}_{i,s-1})}{||\mathbf{x}_{j,s-1} - \mathbf{x}_{i,s-1}||^3} \Delta t + \mathbf{v}_{i,s-1}
                            \end{aligned} \]
                        </div>
                    </section>

                    <section>
                        <h3>No parallelism</h3>
                        <div style="font-size:15px">
                        <pre><code data-trim class="python">
                            import numpy as np

                            def remove_i(x, i):
                                """Drops the ith element of an array."""
                                shape = (x.shape[0]-1,) + x.shape[1:]
                                y = np.empty(shape, dtype=float)
                                y[:i] = x[:i]
                                y[i:] = x[i+1:]
                                return y

                            def a(i, x, G, m):
                                """The acceleration of the ith mass."""
                                x_i = x[i]
                                x_j = remove_i(x, i) # don't compute on itself
                                m_j = remove_i(m, i)
                                diff = x_j - x_i
                                mag3 = np.sum(diff**2, axis=1)**1.5
                                # compute acceleration on ith mass
                                result = G * np.sum(diff * (m_j / mag3)[:,np.newaxis], axis=0)
                                return result
                        </code></pre>
                        <pre><code data-trim class="python">
                            def timestep(x0, v0, G, m, dt):
                                """Computes the next position and velocity for all masses given
                                initial conditions and a time step size.
                                """
                                N = len(x0)
                                x1 = np.empty(x0.shape, dtype=float)
                                v1 = np.empty(v0.shape, dtype=float)
                                for i in range(N): # update locations for all masses each step
                                    a_i0 = a(i, x0, G, m)
                                    v1[i] = a_i0 * dt + v0[i]
                                    x1[i] = a_i0 * dt**2 + v0[i] * dt + x0[i]
                                return x1, v1

                            def initial_cond(N, D):
                                """Generates initial conditions for N unity masses at rest
                                starting at random positions in D-dimensional space.
                                """
                                x0 = np.random.rand(N, D) # use random initial locations
                                v0 = np.zeros((N, D), dtype=float)
                                m = np.ones(N, dtype=float)
                                return x0, v0, m
                        </code></pre>
                        </div>
                    </section>
                    <section>
                        <div>
                        <pre><code data-trim class="python">
                            x0, v0, m = initial_cond(10, 2)
                            x1, v1 = timestep(x0, v0, 1.0, m, 1.0e-3)
                        </code></pre>

                        <p class="fragment">Driver function that simulates \(S\) time steps:</p>
                        <pre><code data-trim class="python fragment">
                            def simulate(N, D, S, G, dt):
                                x0, v0, m = initial_cond(N, D)
                                for s in range(S):
                                    x1, v1 = timestep(x0, v0, G, m, dt)
                                    x0, v0 = x1, v1
                        </code></pre>
                        </div>
                    </section>
                    <section>
                        <h3>Measure performance:</h3>
                        <br/>
                        <div>
                        <pre><code data-trim class="python">
                            import time
                            Ns = [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
                            runtimes = []
                            for N in Ns:
                                start = time.time()
                                simulate(N, 3, 300, 1.0, 1e-3)
                                stop = time.time()
                                runtimes.append(stop - start)
                        </code></pre>
                        </div>
                        <p class="fragment">Does the problem scale quadratically?</p>
                    </section>

                    <section>
                        <h3>Threads</h3>
                        <br/>
                        <p class="fragment">Threads perform work and are not blocked by work of other threads.</p>
                        <p class="fragment">Threads can communicate with each other through state</p>
                    </section>
                    <section>
                        <h3>Threads: probably don't use in Python.</h3>
                        <br/>
                        <p class="fragment">All threads execute in the same process as Python itself.</p>
                        <p class="fragment">Where to use? If you have high latency tasks where you can use spare time for another task (e.g., downloading a large file)</p>
                    </section>

                    <section>
                        <h3>N-body with threading</h3>
                        <div style="font-size:20px">
                        <pre><code data-trim class="python">
                            from threading import Thread

                            class Worker(Thread):
                                """Computes x, v, and a of the ith body."""
                                def __init__(self, *args, **kwargs):
                                    super(Worker, self).__init__(*args, **kwargs)
                                    self.inputs = [] # buffer of work for thraeds
                                    self.results = [] # buffer of return values
                                    self.running = True # setting to False causes run() to end safely
                                    self.daemon = True # allow thread to be terminated with Python
                                    self.start()

                                def run(self):
                                    while self.running:
                                        if len(self.inputs) == 0: # check for work
                                            continue
                                        i, x0, v0, G, m, dt = self.inputs.pop(0)
                                        a_i0 = a(i, x0, G, m) # body of original timestep()
                                        v_i1 = a_i0 * dt + v0[i]
                                        x_i1 = a_i0 * dt**2 + v0[i] * dt + x0[i]
                                        result = (i, x_i1, v_i1)
                                        self.results.append(result)
                        </code></pre>
                        </div>
                    </section>

                    <section>
                        <h3>Thread pools</h3>
                        <p class="fragment">Expensive to create and start threads, so create pool that persists</p>

                        <div style="font-size:20px">
                        <pre><code data-trim class="python fragment">
                            class Pool(object):
                                """A collection of P worker threads that distributes tasks
                                evenly across them.
                                """
                                def __init__(self, size):
                                    self.size = size
                                    # create new workers based on size
                                    self.workers = [Worker() for p in range(size)]

                                def do(self, tasks):
                                    for p in range(self.size):
                                        self.workers[p].inputs += tasks[p::self.size] # evenly distribute tasks
                                    while any([len(worker.inputs) != 0 for worker in self.workers]):
                                        pass # wait for all workers to finish
                                    results = []
                                    for worker in self.workers: # get results back from workers
                                        results += worker.results
                                        worker.results.clear()
                                    return results # return complete list of results for all inputs

                                def __del__(self): # stop workers when pool is shut down
                                    for worker in self.workers:
                                        worker.running = False
                        </code></pre>
                        </div>
                    </section>
                    <section>
                        <div style="font-size:20px">
                        <pre><code data-trim class="python">
                            def timestep(x0, v0, G, m, dt, pool):
                                """Computes the next position and velocity for all masses given
                                initial conditions and a time step size.
                                """
                                N = len(x0)
                                tasks = [(i, x0, v0, G, m, dt) for i in range(N)] # create task for each body
                                results = pool.do(tasks) # run tasks
                                x1 = np.empty(x0.shape, dtype=float)
                                v1 = np.empty(v0.shape, dtype=float)
                                for i, x_i1, v_i1 in results: # rearrange results (probably not in order)
                                    x1[i] = x_i1
                                    v1[i] = v_i1
                                return x1, v1
                        </code></pre>
                        <p class="fragment">How does simulation perform as function of \(P\)?</p>
                        <pre><code data-trim class="python fragment">
                            Ps = [1, 2, 4, 8]
                            runtimes = []
                            for P in Ps:
                                start = time.time()
                                simulate(P, 64, 3, 300, 1.0, 1e-3)
                                stop = time.time()
                                runtimes.append(stop - start)
                        </code></pre>
                        </div>
                    </section>

                    <section>
                        <h3>Multiprocessing</h3>
                        <br/>
                        <p class="fragment">More like threading available in other languages...</p>

                        <p class="fragment">Can't be used directly from interactive interpreter; main module must be importable by fork processes</p>

                        <p class="fragment">Provides <code>Pool</code> class for us, used with powerful <code>map()</code> function.</p>

                    </section>
                    <section>
                        <div style="font-size:20px">
                        <pre><code data-trim class="python">
                            from multiprocessing import Pool

                            # function needs one argument
                            def timestep_i(args):
                                """Computes the next position and velocity for the ith mass."""
                                i, x0, v0, G, m, dt = args # unpack arguments to original function
                                a_i0 = a(i, x0, G, m) # body of original timestep()
                                v_i1 = a_i0 * dt + v0[i]
                                x_i1 = a_i0 * dt**2 + v0[i] * dt + x0[i]
                                return i, x_i1, v_i1
                        </code></pre>

                        <pre><code data-trim class="python fragment">
                            def timestep(x0, v0, G, m, dt, pool):
                                """Computes the next position and velocity for all masses given
                                initial conditions and a time step size.
                                """
                                N = len(x0)
                                tasks = [(i, x0, v0, G, m, dt) for i in range(N)]
                                results = pool.map(timestep_i, tasks) # replace old do() with Pool.map()
                                x1 = np.empty(x0.shape, dtype=float)
                                v1 = np.empty(v0.shape, dtype=float)
                                for i, x_i1, v_i1 in results:
                                    x1[i] = x_i1
                                    v1[i] = v_i1
                                return x1, v1
                        </code></pre>
                        </div>
                    </section>

                    <section>
                        <h3>How does <code>multiprocessing</code> scale?</h3>
                        <pre><code data-trim class="python">
                            import time
                            Ps = [1, 2, 4, 8]
                            runtimes = []
                            for P in Ps:
                                start = time.time()
                                simulate(P, 256, 3, 300, 1.0, 1e-3)
                                stop = time.time()
                                runtimes.append(stop - start)
                        </code></pre>
                    </section>

                </section>

                <section>
                    <section>
                        <h3>MPI: Message-Passing Interface</h3>
                        <p class="fragment">Supercomputing is built on MPI.</p>

                        <p class="fragment">MPI scales from hundreds of thousands to millions of processors.</p>

                        <p class="fragment">Processes are called ranks and given integer identifiers; rank 0 is usually the "master" that controls other ranks. All about communication.</p>

                        <p class="fragment">Can be used in Python and NumPy arrays with <code>mpi4py</code></p>
                    </section>
                </section>
                <section>
                    <section>
                        <h3>numba: just-in-time compiler for Python/NumPy</h3>
                        <p class="fragment">translates Python functions to optimized machine code at runtime using the LLVM compiler</p>

                        <p class="fragment">does not require a separate compiler, just decorators</p>

                        <p class="fragment">can automatically parallelize loops</p>
                    </section>

                    <section>
                        <div style="font-size:28px">
                            <pre><code data-trim class="python">
                            from numba import jit
                            import random

                            @jit(nopython=True)
                            def monte_carlo_pi(nsamples):
                                acc = 0
                                for i in range(nsamples):
                                    x = random.random()
                                    y = random.random()
                                    if (x ** 2 + y ** 2) < 1.0:
                                        acc += 1
                                return 4.0 * acc / nsamples
                            </code></pre>
                        </div>
                    </section>
                </section>
                <section>
                    <section data-markdown>
                        <textarea data-template>
                            ### Takeaways
        
                            If your problem calls for it, consider parallelizing.

                            That said, do not jump here, when implementing smart NumPy array operations,
                            appropriate data structures, or existing solutions may suffice.
                        </textarea>
                    </section>
                </section>
			</div>

		</div>

		<script src="dist/reveal.js"></script>
		<script src="plugin/notes/notes.js"></script>
		<script src="plugin/markdown/markdown.js"></script>
		<script src="plugin/highlight/highlight.js"></script>
        <script src="plugin/math/math.js"></script>
		<script>
			// More info about initialization & config:
			// - https://revealjs.com/initialization/
			// - https://revealjs.com/config/
			Reveal.initialize({
				hash: true,

				// Learn about plugins: https://revealjs.com/plugins/
				plugins: [ RevealMarkdown, RevealHighlight, RevealNotes, RevealMath.KaTeX ],

			});
		</script>

	</body>
</html>