Skip to content

Commit

Permalink
A few improvements (#114)
Browse files Browse the repository at this point in the history
* add julia for hpc course

* mention array slices and column major order

* improve section about at-threads

* drop comment about at-spawn

* drop course (separate PR)

* mention ThreadPinning.jl

* mention a few popular external profilers
  • Loading branch information
carstenbauer authored Jul 15, 2024
1 parent ff73c90 commit 8c541f4
Showing 1 changed file with 13 additions and 7 deletions.
20 changes: 13 additions & 7 deletions optimizing/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,10 @@ No matter which tool you use, if your code is too fast to collect samples, you m
Inspecting the call graph can help identify which types are responsible for the allocations.
}

### External profilers

Apart from the built-in `Profile` standard library, there are a few external profilers that you can use including [Intel VTune](https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html) (in combination with [IntelITT.jl](https://github.com/JuliaPerf/IntelITT.jl)), [NVIDIA Nsight Systems](https://developer.nvidia.com/nsight-systems) (in combination with [NVTX.jl](https://github.com/JuliaGPU/NVTX.jl)), and [Tracy](https://docs.julialang.org/en/v1/devdocs/external_profilers/#Tracy-Profiler).

## Type stability

\tldr{Use JET.jl to automatically detect type instabilities in your code, and `@code_warntype` or Cthulhu.jl to do so manually. DispatchDoctor.jl can help prevent them altogether.}
Expand Down Expand Up @@ -220,7 +224,7 @@ A more direct approach is to error whenever a type instability occurs: the macro

After ensuring type stability, one should try to reduce the number of heap allocations a program makes.
Again, the Julia manual has a series of tricks related to [arrays and allocations](https://docs.julialang.org/en/v1.12-dev/manual/performance-tips/#Memory-management-and-arrays) which you should take a look at.
In particular, try to modify existing arrays instead of allocating new objects.
In particular, try to modify existing arrays instead of allocating new objects (caution with array slices) and try to access arrays in the right order (column major order).

And again, you can also choose to error whenever an allocation occurs, with the help of [AllocCheck.jl](https://github.com/JuliaLang/AllocCheck.jl).
By annotating a function with `@check_allocs`, if the function is run and the compiler detects that it might allocate, it will throw an error.
Expand Down Expand Up @@ -315,8 +319,8 @@ The README of StaticCompiler.jl contains a more [detailed guide](https://github.

\tldr{Use `Threads` or OhMyThreads.jl on a single machine, `Distributed` or MPI.jl on a computing cluster. GPU-compatible code is easy to write and run.}

Code can be made to run faster through parallel execution with [multithreading](https://docs.julialang.org/en/v1/manual/multi-threading/) or [multiprocessing / distributed computing](https://docs.julialang.org/en/v1/manual/distributed-computing/).
Many common operations such as maps and reductions can be trivially parallelised through either method by using their respective Julia packages.
Code can be made to run faster through parallel execution with [multithreading](https://docs.julialang.org/en/v1/manual/multi-threading/) (shared-memory parallelism) or [multiprocessing / distributed computing](https://docs.julialang.org/en/v1/manual/distributed-computing/).
Many common operations such as maps and reductions can be trivially parallelised through either method by using their respective Julia packages (e.g `pmap` from Distributed.jl and `tmap` from OhMyThreads.jl).
Multithreading is available on almost all modern hardware, whereas distributed computing is most useful to users of high-performance computing clusters.

### Multithreading
Expand All @@ -342,10 +346,9 @@ Once Julia is running, you can check if this was successful by calling `Threads.
}

Regardless of the number of threads, you can parallelise a for loop with the macro `Threads.@threads`.
The macros `@spawn` and `@async` function similarly, but require more manual management of the results, which can result in bugs and performance footguns.
For this reason `@threads` is recommended for those who do not wish to use third-party packages.
The macros `@spawn` and `@async` function similarly, but require more manual management of tasks and their results. For this reason `@threads` is recommended for those who do not wish to use third-party packages.

When you design multithreaded code, you need to be careful to avoid "race conditions", i.e. situations when competing threads try to write different things to the same memory location.
When designing multithreaded code, you should generally try to write to shared memory as rarely as possible. Where it cannot be avoided, you need to be careful to avoid "race conditions", i.e. situations when competing threads try to write different things to the same memory location.
It is usually a good idea to separate memory accesses with loop indices, as in the example below:

```julia @threads-forloop
Expand All @@ -354,12 +357,15 @@ Threads.@threads for i in 1:4
results[i] = i^2
end
```
Almost always, it is [**not** a good idea to use `threadid()`](https://julialang.org/blog/2023/07/PSA-dont-use-threadid/).

Managing threads and their memory use is made much easier by [OhMyThreads.jl](https://github.com/JuliaFolds2/OhMyThreads.jl), which provides a user-friendly alternative to `Threads`.
Even if you manage to avoid any race conditions in your multithreaded code, it is very easy to run into subtle performance issues (like [false sharing](https://en.wikipedia.org/wiki/False_sharing)). For these reasons, you might want to consider using a high-level package like [OhMyThreads.jl](https://github.com/JuliaFolds2/OhMyThreads.jl), which provides a user-friendly alternative to `Threads` and makes managing threads and their memory use much easier.
The helpful [translation guide](https://juliafolds2.github.io/OhMyThreads.jl/stable/translation/) will get you started in a jiffy.

If the latency of spinning up new threads becomes a bottleneck, check out [Polyester.jl](https://github.com/JuliaSIMD/Polyester.jl) for very lightweight threads that are quicker to start.

If you're on Linux, you should consider using [ThreadPinning.jl](https://github.com/carstenbauer/ThreadPinning.jl) to pin your Julia threads to CPU cores to obtain stable and optimal performance. The package can also be used to visualize where the Julia threads are running on your system (see `threadinfo()`).

\advanced{
Some widely used parallel programming packages like [LoopVectorization.jl](https://github.com/JuliaSIMD/LoopVectorization.jl) (which also powers [Octavian.jl](https://github.com/JuliaLinearAlgebra/Octavian.jl)) or [ThreadsX.jl](https://github.com/tkf/ThreadsX.jl) are no longer maintained.
}
Expand Down

0 comments on commit 8c541f4

Please sign in to comment.