Skip to content

Latest commit

 

History

History
199 lines (142 loc) · 9.75 KB

README.md

File metadata and controls

199 lines (142 loc) · 9.75 KB

PyPI package stats: PyPI - Downloads

$$\text{\color{red} \Huge LEGACY CODEBASE}$$

Note: This repository holds the legacy cuFINUFFT codebase. Further development will take place in the FINUFFT repository. Please direct any issues or pull requests to that repository.

cuFINUFFT v1.3

cuFINUFFT is a very efficient GPU implementation of the 1-, 2-, and 3-dimensional nonuniform FFT of types 1 and 2, in single and double precision, based on the CPU code FINUFFT.

cuFINUFFT introduces several algorithmic innovations, including load-balancing, bin-sorting for cache-aware access, and use of fast shared memory. Our tests show an acceleration over FINUFFT of up to 10x on modern hardware, and up to 100x faster than other established GPU NUFFT codes:

The linear transforms it can perform may be summarized as follows: type 1 maps nonuniform data (locations and corresponding strengths) to the uniformly spaced coefficients of a Fourier series (or its bi- or tri-variate generalization, according to dimension). Type 2 does the adjoint operation of type 1, ie maps in the reverse order. However, note that type 2 and type 1 are not generally each other's inverse, unlike for the FFT case! These transforms are performed to a user-presribed tolerance, at close-to-FFT speeds; under the hood, this involves detailed kernel design, custom spreading/interpolation stages, and plain FFTs performed by cuFFT. See the documentation for FINUFFT for a full mathematical description of the transforms and their applications to signal processing, imaging, and scientific computing.

Note: We are currently in the process of adapting the cuFINUFFT interface to closer match that of FINUFFT. This will likely break code depending on the current interface once the next release is published. At this point we will publish a migration guide that will detail the exact changes to the interfaces.

Main developer: Yu-hsuan Melody Shih (NYU). Main other contributors: Garrett Wright (Princeton), Joakim Andén (KTH/Flatiron), Johannes Blaschke (LBNL), Alex Barnett (Flatiron). See github for full list of contributors. This project came out of Melody's 2018 and 2019 summer internships at the Flatiron Institute, advised by CCM project leader Alex Barnett.

Installation

Note for most Python users, you may skip to the Python Package section first, and consider installing from source if that solution is not adequate for your needs. Note that 1D is not available in Python yet. Here's the C++ install process:

  • Make sure you have the prerequisites: a C++ compiler (eg g++) and a recent CUDA installation (nvcc).
  • Get the code: git clone https://github.com/flatironinstitute/cufinufft.git
  • Review the Makefile: If you need to customize build settings, create and edit a make.inc. Example:
    • To override the standard CUDA /usr/local/cuda location your make.inc should contain: CUDA_ROOT=/your/path/to/cuda.
    • For examples, see one for IBM machines (targets/make.inc.power9), and another for the Courant Institute cluster (sites/make.inc.CIMS).
  • Compile: make all -j (this takes several minutes)
  • Run test codes: make check which should complete in less than a minute without error.
  • You may then want to try individual test drivers, such as bin/cufinufft2d1_test_32 2 1e3 1e3 1e7 1e-3 which tests the single-precision 2D type 1. Most such executables document their usage when called with no arguments.

Basic usage and interface

Please see the codes in examples/ to see how to call cuFINUFFT and link to from C++/CUDA, and to call from Python.

The default use of the cuFINUFFT API has four stages, that match those of the plan interface to FINUFFT (in turn modeled on those of, eg, FFTW or NFFT). Here they are from C++:

  1. Plan one transform, or a set of transforms sharing nonuniform points, specifying overall dimension, numbers of Fourier modes, etc:

    ier = cufinufft_makeplan(type, dim, nmodes, iflag, ntransf, tol, maxbatchsize, &plan, NULL);
  2. Set the locations of nonuniform points from the arrays x, y, and possibly z:

    ier = cufinufft_setpts(M, x, y, z, 0, NULL, NULL, NULL, plan);

    (Note that here arguments 5-8 are reserved for future type 3 implementation, to match the FINUFFT interface).

  3. Perform the transform(s) using these nonuniform point arrays, which reads strengths c and writes into modes fk for type 1, or vice versa for type 2:

    ier = cufinufft_execute(c, fk, plan);
  4. Destroy the plan (clean up):

    ier = cufinufft_destroy(plan);

In each case the returned integer ier is a status indicator. Here is the full C++ documentation.

It is also possible to change advanced options by changing the last NULL argument of the cufinufft_makeplan call to a pointer to an options struct, opts. This struct should first be initialized via cufinufft_default_opts(type, dim, &opts); before the user changes any fields. For examples of this advanced usage, see test/cufinufft*.cu

Library installation

It is up to the user to decide how exactly to link or otherwise install the libraries produced in lib. If you plan to use the Python wrapper you will minimally need to extend your LD_LIBRARY_PATH, such as with export LD_LIBRARY_PATH=${PWD}/lib:${LD_LIBRARY_PATH} or a more permanent installation path of your choosing.

If you would like to always have this installation in your library path, you can add to your shell rc with something like the following:

echo "\n# cufinufft librarypath \nexport LD_LIBRARY_PATH=${PWD}/lib:${LD_LIBRARY_PATH}" >> ~/.bashrc

Because CUDA itself has similar library/path requirements, it is expected the user is somewhat familiar. If not, please ask, we might be able to help.

Python wrapper

For those installing from source, this code comes with a Python wrapper module cufinufft, which depends on pycuda. Once you have successfully installed and tested the CUDA library, you may run make python to manually install the additional Python package.

Python package

General Python users, or Python software packages which would like to automatically depend on cufinufft using setuptools may use a precompiled binary distribution. This totally avoids installing from source and managing libraries for supported systems.

Binary distributions are specific to both hardware and software. We currently provide binary wheels targeting Linux systems covered by manylinux2010 for CUDA 10 forward with compatible GPUs. If you have such a system, you may run:

pip install cufinufft

For other cases, the Python wrapper should be able to be built from source.

Advanced topics

Advanced Makefile Usage

If you want to test/benchmark the spreader and interpolator (the performance-critical components of the NUFFT algorithm), without building the whole library, do this with make checkspread.

In general for make tasks, it's possible to specify the target architecture using the target variable, eg:

make target=power9 -j

By default, the makefile assumes the x86_64 architecture. We've included site-specific configurations -- such as Cori at NERSC, or Summit at OLCF -- which can be accessed using the site variable, eg:

make site=olcf_summit

The currently supported targets and sites are:

  1. Sites
    1. NERSC Cori (site=nersc_cori)
    2. NERSC Cori GPU (site=nersc_cgpu)
    3. OLCF Summit (site=olcf_summit) -- automatically sets target=power9
    4. CIMS (site=CIMS)
    5. Flatiron Institute, rusty cluster GPU node (site=FI)
  2. Targets
    1. Default (x86_64) -- do not specify target variable
    2. IBM power9 (target=power9)

A general note about expanding the platform support: targets should contain settings that are specific to a compiler/hardware architecture, whereas sites should contain settings that are specific to a HPC facility's software environment. The site-specific script is loaded before the target-specific settings, hence it is possible to specify a target in a site make.inc.* (but not the other way around).

Makefile preprocessors

  • TIME - timing for each stage. Enable by adding "-DTIME" to NVCCFLAGS.
  • SPREADTIME - more detailed timing from spreading and interpolation
  • DEBUG - debug mode outputs all the middle stages' result

Other notes

  • If you are interested in optimizing for GPU Compute Capability, you may want to specify NVARCH=-arch=sm_XX in your make.inc to reduce compile times, or for other performance reasons. See Matching SM Architectures.

Tasks for developers

  • 1D version is close to finished (needs vectorized testers and Py interfaces)
  • Type 3 transforms (which are quite tricky) as in FINUFFT are in progress (at least in 3D) on a PR, thanks to Simon Frasch; please go and test!
  • We need some more tutorial examples in C++ and Python
  • Please help us to write MATLAB (gpuArray) and Julia interfaces
  • There are various Tensorflow and related interfaces in progress (please help with them or test them): https://github.com/mrphys/tensorflow-nufft https://github.com/dfm/jax-finufft
  • Please see Issues and PRs for other things you can help fix or test

References

  • cuFINUFFT: a load-balanced GPU library for general-purpose nonuniform FFTs, Yu-hsuan Shih, Garrett Wright, Joakim Andén, Johannes Blaschke, Alex H. Barnett, PDSEC2021 conference (best paper prize). https://arxiv.org/abs/2102.08463