-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathindex.html
583 lines (524 loc) · 28.6 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
<title>Software Development for Engineering Research - Parallel Computing</title>
<meta name="description" content="
Lecture on parallel computing for ME 599: Software Development for Engineering Research
">
<meta name="author" content="Kyle Niemeyer">
<meta name="apple-mobile-web-app-capable" content="yes" />
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />
<link rel="stylesheet" href="dist/reset.css">
<link rel="stylesheet" href="dist/reveal.css">
<link rel="stylesheet" href="dist/theme/simple.css" id="theme">
<!-- For syntax highlighting -->
<link rel="stylesheet" href="plugin/highlight/monokai.css">
<!-- If the query includes 'print-pdf', include the PDF print sheet -->
<script>
if( window.location.search.match( /print-pdf/gi ) ) {
var link = document.createElement( 'link' );
link.rel = 'stylesheet';
link.type = 'text/css';
link.href = 'css/print/pdf.css';
document.getElementsByTagName( 'head' )[0].appendChild( link );
}
</script>
<!--[if lt IE 9]>
<script src="lib/js/html5shiv.js"></script>
<![endif]-->
</head>
<body>
<div class="reveal">
<!-- Any section element inside of this container is displayed as a slide -->
<div class="slides">
<section>
<h2>Intro. to Parallel Computing</h2>
<h3>Software Development for Engineering Research</h3>
<br/>
<h3>Kyle Niemeyer. 17 February 2022</h3>
<h3>ME 599, Corvallis, OR</h3>
</section>
<section>
<section>
<h2>Today: Introduction to parallel computing (in Python)</h2>
<br/>
<p class="fragment">If you want more, check out <a href="http://web.engr.oregonstate.edu/~mjb/cs575/">CS 475/575 Parallel Programming</a>.
</p>
</section>
<section>
<h3>Approaching parallelism:</h3>
<br/>
<ul>
<li class="fragment">How do I do many things at once?
</li>
<li class="fragment">How do I do <em>one thing</em> faster?
</li>
</ul>
<br/>
<p class="fragment">Computer: perform many simultaneous tasks</p>
<p class="fragment">Developer: determine dependencies between code and data</p>
</section>
<section>
<p class="fragment">Up until now: serial thinking, like "x then y then z"</p>
<br/>
<p class="fragment">Benefit of parallelism: problems execute faster—sometimes <em>much</em> faster.</p>
<p class="fragment">Downside to parallelism: harder to program, debug, open files, print to screen.</p>
</section>
<section>
<h3>Why go parallel?</h3></br>
<p class="fragment">First: wait until your code works and you need it.</p>
<br/>
<ul>
<li class="fragment">The problem creates or requires too much data for a normal machine.
</li>
<li class="fragment">The sun would explode before the computation would finish.
</li>
<li class="fragment">The algorithm is easy to parallelize.
</li>
<li class="fragment">The physics itself cannot be simulated at all with smaller resources.
</li>
</ul>
</section>
</section>
<section>
<section>
<h3>Scale and Scalability</h3>
<br/>
<img src="./images/sense-and-sensibility.jpg" height="500" alt="Scale and Scalability 'book' image">
</section>
<section>
<h3>Scale and Scalability</h3>
<br/>
<p class="fragment">
<strong>Scale:</strong> size of problem
</p>
<p class="fragment">
Scale is proportional to number of processes <code>P</code> used,
and thus the maximum <em>degree of parallelism</em> possible.
</p>
<p class="fragment">
<strong>FLOPS:</strong> number of floating-point operations per second
</p>
</section>
<section>
<h3>Scaling up code</h3>
<br/>
<p class="fragment">
Scale up slowly—start with one processor, then 10, 100, etc.
</p>
<p class="fragment">
<strong>scalability:</strong> how easy or hard it is to scale code
</p>
</section>
<section>
<p>
<strong>strong scaling:</strong> how runtime changes as a function of processor number
for a fixed total problem size
</p>
<p class="fragment">
speedup \(s\): ratio of time on one processor, \(t_1\), to time on \(P\) processors, \(t_P\):
</p>
<div class="fragment">
\[\begin{equation}
s(P) = \frac{t_1}{t_P}
\end{equation} \]
</div>
</section>
<section>
<h3>Strong scaling</h3><br/>
<p class="fragment">
Efficient system: linear strong scaling speedup. For example, <a href="http://www.pyfr.org">PyFR</a> CFD code:
</p>
<img class="fragment" src="./images/pyfr-strongscaling.png" height="300" alt="pyFR strong scaling performance">
<div class="fragment">
<small>
Ref: P Vincent et al. "PyFR: Next-Generation High-Order Computational Fluid Dynamics on Many-Core Hardware." 22nd AIAA CFD Conference. 2015.
</small>
</div>
</section>
<section>
<p class="fragment">
<strong>weak scaling:</strong> how runtime changes as a function of processor number
for a fixed problem size per processor
</p>
<p class="fragment">
sizeup \(z\), for a problem size \(N\):
</p>
<div class="fragment">
\[\begin{equation}
z(P) = \frac{t_1}{t_P} \times \frac{N_P}{N_1}
\end{equation} \]
</div>
</section>
<section>
<h3>Weak scaling</h3>
<p class="fragment">
Efficient system: linear sizeup. Or, constant time with additional processors/problem size.
For example, <a href="http://www.pyfr.org">PyFR</a> CFD code:
</p>
<img class="fragment" src="./images/pyfr-weakscaling.png" height="300" alt="pyFR weak scaling performance">
<div class="fragment">
<small>
Ref: P Vincent et al. "PyFR: Next-Generation High-Order Computational Fluid Dynamics on Many-Core Hardware." 22nd AIAA CFD Conference. 2015.
</small>
</div>
</section>
<section>
<h3>Amdahl's Law</h3>
<p class="fragment">
Some fraction of an algorithm, \(\alpha\), cannot be parallelized.
Then, the maximum speedup/sizeup for \(P\) processors is:
</p>
<div class="fragment">
\[\begin{equation}
\max (s(P)) = \frac{1}{\alpha - \frac{1 - \alpha}{P}}
\end{equation} \]
</div>
<p class="fragment">
Then, the theoretically max speedup is:
</p>
<div class="fragment">
\[\begin{equation}
\max (s) = \frac{1}{\alpha}
\end{equation} \]
</div>
</section>
</section>
<section>
<h3>Types of problems</h3>
<br/>
<div style="font-size:30px">
<p class="fragment">
<strong>embarassingly parallel:</strong> algorithms with a high degree
of independence and little communication between parts
</p>
<p class="fragment">
Examples: summing large arrays, matrix multiplication, Monte Carlo simulations,
some optimization approaches (e.g., stochastic and genetic algorithms)
</p>
</div>
<div style="font-size:30px">
<p class="fragment">
Other algorithms have unavoidable bottlenecks: inverting a matrix, ODE integration, etc.
</p>
<p class="fragment">
Not all hope is lost! Some parts may still benefit from parallelization.
</p>
</div>
</section>
<section>
<section>
<h3>Example: N-body problem</h3>
<br/>
<p class="fragment" style="font-size:30px">generalization of the classic 2-body problem
that governs the equations of motion for two masses. From Newton's law of gravity:
</p>
<div class="fragment">
\[\begin{equation}
\frac{dp}{dt} = G \frac{m_1 m_2}{r^2}
\end{equation} \]
</div>
</section>
<section>
<h3>Equations of motion:</h3>
<br/>
<div class="fragment" style="font-size:24px">
\[\begin{aligned}
\mathbf{x}_{i,s} & = G \sum_{j=1, i \neq j} \frac{m_j (\mathbf{x}_{j,s-1} - \mathbf{x}_{i,s-1})}{||\mathbf{x}_{j,s-1} - \mathbf{x}_{i,s-1}||^3} \Delta t^2 + \mathbf{v}_{i,s-1} \Delta t + \mathbf{x}_{i,s-1} \\
\mathbf{v}_{i,s} & = G \sum_{j=1, i \neq j} \frac{m_j (\mathbf{x}_{j,s-1} - \mathbf{x}_{i,s-1})}{||\mathbf{x}_{j,s-1} - \mathbf{x}_{i,s-1}||^3} \Delta t + \mathbf{v}_{i,s-1}
\end{aligned} \]
</div>
</section>
<section>
<h3>No parallelism</h3>
<div style="font-size:15px">
<pre><code data-trim class="python">
import numpy as np
def remove_i(x, i):
"""Drops the ith element of an array."""
shape = (x.shape[0]-1,) + x.shape[1:]
y = np.empty(shape, dtype=float)
y[:i] = x[:i]
y[i:] = x[i+1:]
return y
def a(i, x, G, m):
"""The acceleration of the ith mass."""
x_i = x[i]
x_j = remove_i(x, i) # don't compute on itself
m_j = remove_i(m, i)
diff = x_j - x_i
mag3 = np.sum(diff**2, axis=1)**1.5
# compute acceleration on ith mass
result = G * np.sum(diff * (m_j / mag3)[:,np.newaxis], axis=0)
return result
</code></pre>
<pre><code data-trim class="python">
def timestep(x0, v0, G, m, dt):
"""Computes the next position and velocity for all masses given
initial conditions and a time step size.
"""
N = len(x0)
x1 = np.empty(x0.shape, dtype=float)
v1 = np.empty(v0.shape, dtype=float)
for i in range(N): # update locations for all masses each step
a_i0 = a(i, x0, G, m)
v1[i] = a_i0 * dt + v0[i]
x1[i] = a_i0 * dt**2 + v0[i] * dt + x0[i]
return x1, v1
def initial_cond(N, D):
"""Generates initial conditions for N unity masses at rest
starting at random positions in D-dimensional space.
"""
x0 = np.random.rand(N, D) # use random initial locations
v0 = np.zeros((N, D), dtype=float)
m = np.ones(N, dtype=float)
return x0, v0, m
</code></pre>
</div>
</section>
<section>
<div>
<pre><code data-trim class="python">
x0, v0, m = initial_cond(10, 2)
x1, v1 = timestep(x0, v0, 1.0, m, 1.0e-3)
</code></pre>
<p class="fragment">Driver function that simulates \(S\) time steps:</p>
<pre><code data-trim class="python fragment">
def simulate(N, D, S, G, dt):
x0, v0, m = initial_cond(N, D)
for s in range(S):
x1, v1 = timestep(x0, v0, G, m, dt)
x0, v0 = x1, v1
</code></pre>
</div>
</section>
<section>
<h3>Measure performance:</h3>
<br/>
<div>
<pre><code data-trim class="python">
import time
Ns = [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
runtimes = []
for N in Ns:
start = time.time()
simulate(N, 3, 300, 1.0, 1e-3)
stop = time.time()
runtimes.append(stop - start)
</code></pre>
</div>
<p class="fragment">Does the problem scale quadratically?</p>
</section>
<section>
<h3>Threads</h3>
<br/>
<p class="fragment">Threads perform work and are not blocked by work of other threads.</p>
<p class="fragment">Threads can communicate with each other through state</p>
</section>
<section>
<h3>Threads: probably don't use in Python.</h3>
<br/>
<p class="fragment">All threads execute in the same process as Python itself.</p>
<p class="fragment">Where to use? If you have high latency tasks where you can use spare time for another task (e.g., downloading a large file)</p>
</section>
<section>
<h3>N-body with threading</h3>
<div style="font-size:20px">
<pre><code data-trim class="python">
from threading import Thread
class Worker(Thread):
"""Computes x, v, and a of the ith body."""
def __init__(self, *args, **kwargs):
super(Worker, self).__init__(*args, **kwargs)
self.inputs = [] # buffer of work for thraeds
self.results = [] # buffer of return values
self.running = True # setting to False causes run() to end safely
self.daemon = True # allow thread to be terminated with Python
self.start()
def run(self):
while self.running:
if len(self.inputs) == 0: # check for work
continue
i, x0, v0, G, m, dt = self.inputs.pop(0)
a_i0 = a(i, x0, G, m) # body of original timestep()
v_i1 = a_i0 * dt + v0[i]
x_i1 = a_i0 * dt**2 + v0[i] * dt + x0[i]
result = (i, x_i1, v_i1)
self.results.append(result)
</code></pre>
</div>
</section>
<section>
<h3>Thread pools</h3>
<p class="fragment">Expensive to create and start threads, so create pool that persists</p>
<div style="font-size:20px">
<pre><code data-trim class="python fragment">
class Pool(object):
"""A collection of P worker threads that distributes tasks
evenly across them.
"""
def __init__(self, size):
self.size = size
# create new workers based on size
self.workers = [Worker() for p in range(size)]
def do(self, tasks):
for p in range(self.size):
self.workers[p].inputs += tasks[p::self.size] # evenly distribute tasks
while any([len(worker.inputs) != 0 for worker in self.workers]):
pass # wait for all workers to finish
results = []
for worker in self.workers: # get results back from workers
results += worker.results
worker.results.clear()
return results # return complete list of results for all inputs
def __del__(self): # stop workers when pool is shut down
for worker in self.workers:
worker.running = False
</code></pre>
</div>
</section>
<section>
<div style="font-size:20px">
<pre><code data-trim class="python">
def timestep(x0, v0, G, m, dt, pool):
"""Computes the next position and velocity for all masses given
initial conditions and a time step size.
"""
N = len(x0)
tasks = [(i, x0, v0, G, m, dt) for i in range(N)] # create task for each body
results = pool.do(tasks) # run tasks
x1 = np.empty(x0.shape, dtype=float)
v1 = np.empty(v0.shape, dtype=float)
for i, x_i1, v_i1 in results: # rearrange results (probably not in order)
x1[i] = x_i1
v1[i] = v_i1
return x1, v1
</code></pre>
<p class="fragment">How does simulation perform as function of \(P\)?</p>
<pre><code data-trim class="python fragment">
Ps = [1, 2, 4, 8]
runtimes = []
for P in Ps:
start = time.time()
simulate(P, 64, 3, 300, 1.0, 1e-3)
stop = time.time()
runtimes.append(stop - start)
</code></pre>
</div>
</section>
<section>
<h3>Multiprocessing</h3>
<br/>
<p class="fragment">More like threading available in other languages...</p>
<p class="fragment">Can't be used directly from interactive interpreter; main module must be importable by fork processes</p>
<p class="fragment">Provides <code>Pool</code> class for us, used with powerful <code>map()</code> function.</p>
</section>
<section>
<div style="font-size:20px">
<pre><code data-trim class="python">
from multiprocessing import Pool
# function needs one argument
def timestep_i(args):
"""Computes the next position and velocity for the ith mass."""
i, x0, v0, G, m, dt = args # unpack arguments to original function
a_i0 = a(i, x0, G, m) # body of original timestep()
v_i1 = a_i0 * dt + v0[i]
x_i1 = a_i0 * dt**2 + v0[i] * dt + x0[i]
return i, x_i1, v_i1
</code></pre>
<pre><code data-trim class="python fragment">
def timestep(x0, v0, G, m, dt, pool):
"""Computes the next position and velocity for all masses given
initial conditions and a time step size.
"""
N = len(x0)
tasks = [(i, x0, v0, G, m, dt) for i in range(N)]
results = pool.map(timestep_i, tasks) # replace old do() with Pool.map()
x1 = np.empty(x0.shape, dtype=float)
v1 = np.empty(v0.shape, dtype=float)
for i, x_i1, v_i1 in results:
x1[i] = x_i1
v1[i] = v_i1
return x1, v1
</code></pre>
</div>
</section>
<section>
<h3>How does <code>multiprocessing</code> scale?</h3>
<pre><code data-trim class="python">
import time
Ps = [1, 2, 4, 8]
runtimes = []
for P in Ps:
start = time.time()
simulate(P, 256, 3, 300, 1.0, 1e-3)
stop = time.time()
runtimes.append(stop - start)
</code></pre>
</section>
</section>
<section>
<section>
<h3>MPI: Message-Passing Interface</h3>
<p class="fragment">Supercomputing is built on MPI.</p>
<p class="fragment">MPI scales from hundreds of thousands to millions of processors.</p>
<p class="fragment">Processes are called ranks and given integer identifiers; rank 0 is usually the "master" that controls other ranks. All about communication.</p>
<p class="fragment">Can be used in Python and NumPy arrays with <code>mpi4py</code></p>
</section>
</section>
<section>
<section>
<h3>numba: just-in-time compiler for Python/NumPy</h3>
<p class="fragment">translates Python functions to optimized machine code at runtime using the LLVM compiler</p>
<p class="fragment">does not require a separate compiler, just decorators</p>
<p class="fragment">can automatically parallelize loops</p>
</section>
<section>
<div style="font-size:28px">
<pre><code data-trim class="python">
from numba import jit
import random
@jit(nopython=True)
def monte_carlo_pi(nsamples):
acc = 0
for i in range(nsamples):
x = random.random()
y = random.random()
if (x ** 2 + y ** 2) < 1.0:
acc += 1
return 4.0 * acc / nsamples
</code></pre>
</div>
</section>
</section>
<section>
<section data-markdown>
<textarea data-template>
### Takeaways
If your problem calls for it, consider parallelizing.
That said, do not jump here, when implementing smart NumPy array operations,
appropriate data structures, or existing solutions may suffice.
</textarea>
</section>
</section>
</div>
</div>
<script src="dist/reveal.js"></script>
<script src="plugin/notes/notes.js"></script>
<script src="plugin/markdown/markdown.js"></script>
<script src="plugin/highlight/highlight.js"></script>
<script src="plugin/math/math.js"></script>
<script>
// More info about initialization & config:
// - https://revealjs.com/initialization/
// - https://revealjs.com/config/
Reveal.initialize({
hash: true,
// Learn about plugins: https://revealjs.com/plugins/
plugins: [ RevealMarkdown, RevealHighlight, RevealNotes, RevealMath.KaTeX ],
});
</script>
</body>
</html>