-
Notifications
You must be signed in to change notification settings - Fork 102
QUDA's eigensolvers
QUDA currently offers two types of eigensolver: A Thick Restarted Lanczos Method (TRLM) for hermitian operators, and an Implicitly Restarted Arnoldi (IRAM) for non-hermitian ones. We make good use of the TRLM throughout the library for solvers that require deflation. Eigen decomposition of normal operators (M^{dag}M of MM^{dag}) can be used in CG type solvers, and we take advantage of the fact that an eigenvector of M^{dag}M(MM^{dag}) is a right(left) singular vector of M when we need to deflate a GCR type solve. Computed eigenvectors and singular vectors can be saved using QIO with flags described below; more details are available on QIO and QMP.
QUDA's Lanczos implementation is located in lib/eig_trlm.cpp
and an example of how to call it from the host is given in tests/eigensolver_test.cpp
. We display here an exhaustive list of command line variables one can pass to the test executable eigensolver_test
.
[Option Group: Eigensolver]
Options controlling eigensolver
Options:
--eig-amax FLOAT:POSITIVE The maximum in the polynomial acceleration
--eig-amin FLOAT:POSITIVE The minimum in the polynomial acceleration
--eig-ARPACK-logfile TEXT The filename storing the log from arpack
--eig-arpack-check BOOL Cross check the device data against ARPACK (requires ARPACK, default false)
--eig-use-eigen-qr BOOL Use Eigen to eigensolve the upper Hessenberg in IRAM, else use QUDA's QR code. (default true)
--eig-compute-svd BOOL Solve the MdagM problem, use to compute SVD of M (default false)
--eig-max-restarts INT Perform n iterations of the restart in the eigensolver
--eig-block-size INT The block size to use in the block variant eigensolver
--eig-n-ev-deflate INT The number of converged eigenpairs that will be used in the deflation routines (default eig_n_conv)
--eig-n-conv INT The number of converged eigenvalues requested (default eig_n_ev)
--eig-n-ev INT The size of eigenvector search space in the eigensolver
--eig-n-kr INT The size of the Krylov subspace to use in the eigensolver
--eig-batched-rotate INT The maximum number of extra eigenvectors the solver may allocate to perform a Ritz rotation.
--eig-poly-deg INT TODO
--eig-require-convergence BOOL
If true, the solver will error out if convergence is not attained. If false, a warning will be given (default true)
--eig-save-vec TEXT Save eigenvectors to <file> (requires QIO)
--eig-load-vec TEXT:FILE Load eigenvectors to <file> (requires QIO)
--eig-save-partfile BOOL
If saving eigenvectors, save in partfile format instead of singlefile (default false)
--eig-save-prec ENUM:value in {double,single,half,quarter}
If saving eigenvectors, use this precision to save. No-op if eig-save-prec is greater than or equal to precision of eigensolver (default = double)
--eig-io-parity-inflate BOOL
Whether to inflate single-parity eigenvectors onto dual parity full fields for file I/O (default = false)
--eig-spectrum ENUM:value in {SR,LR,SM,LM,SI,LI}
The spectrum part to be calulated. S=smallest L=largest R=real M=modulus I=imaginary
--eig-tol FLOAT The tolerance to use in the eigensolver (default 1e-6)
--eig-qr-tol FLOAT The tolerance to use in the qr (default 1e-11)
--eig-type ENUM:value in {trlm,blktrlm,iram,blkiram}
The type of eigensolver to use (default trlm)
--eig-use-dagger BOOL Solve the Mdag problem instead of M (MMdag if eig-use-normop == true) (default false)
--eig-use-normop BOOL Solve the MdagM problem instead of M (MMdag if eig-use-dagger == true) (default false)
--eig-use-poly-acc BOOL Use Chebyshev polynomial acceleration in the eigensolver
Please be reminded that Lanczos solvers can only solve Hermitian matrices. We therefore default the solver to use the MdagM
matrix operator and the eigenvectors returned to you will be right singular vectors of M
. If you want the left singular vectors, use --eig-use-normop true
and --eig-use-dagger true
. Alternatively, use --eig-compute-svd true
to get n_ev/2 right and left singular vectors.
Without getting into the gritty details, e.g. Sorensen and Yang or Saad et al. the Chebyshev polynomial is used to suppress the unwanted components of the starting vector by projecting them to zero, and project the wanted part of the spectrum to the interval [0,1]. In an ideal sense, where the degree of the polynomial tends to infinity, the smallest eigenvalue will be mapped to just less than 1, and subsequent eigenvalues are mapped to values which become increasingly less than one. We show here, for clarity, typical output from a problem using Chebyshev acceleration. It is from a 16^3 * 64 anisotropic(2.38) wilson lattice of bare mass -0.4083:
RitzValue[0000]: (+9.9972832202911377e-01, +0.0000000000000000e+00) residual 3.1605766707798466e-05
RitzValue[0001]: (+9.9958580732345581e-01, +0.0000000000000000e+00) residual 2.0094249339308590e-04
RitzValue[0002]: (+9.9901866912841797e-01, +0.0000000000000000e+00) residual 5.6352277169935405e-05
RitzValue[0003]: (+9.9836647510528564e-01, +0.0000000000000000e+00) residual 5.1965605234727263e-04
RitzValue[0004]: (+9.9829232692718506e-01, +0.0000000000000000e+00) residual 1.5537920990027487e-04
RitzValue[0005]: (+9.9735695123672485e-01, +0.0000000000000000e+00) residual 3.2478163484483957e-04
RitzValue[0006]: (+9.9749773740768433e-01, +0.0000000000000000e+00) residual 3.2569846371188760e-04
RitzValue[0007]: (+9.9716216325759888e-01, +0.0000000000000000e+00) residual 2.5043883942998946e-04
RitzValue[0008]: (+9.9619883298873901e-01, +0.0000000000000000e+00) residual 8.3959067706018686e-04
RitzValue[0009]: (+9.9668407440185547e-01, +0.0000000000000000e+00) residual 2.1174944413360208e-04
RitzValue[0010]: (+9.9492019414901733e-01, +0.0000000000000000e+00) residual 1.1761279165511951e-04
RitzValue[0011]: (+9.9444347620010376e-01, +0.0000000000000000e+00) residual 2.4872852372936904e-04
RitzValue[0012]: (+9.9441063404083252e-01, +0.0000000000000000e+00) residual 1.1501581320771948e-04
RitzValue[0013]: (+9.9277526140213013e-01, +0.0000000000000000e+00) residual 1.7359634512104094e-04
RitzValue[0014]: (+9.9197441339492798e-01, +0.0000000000000000e+00) residual 5.5310265452135354e-05
RitzValue[0015]: (+9.9104154109954834e-01, +0.0000000000000000e+00) residual 4.6948611270636320e-05
RitzValue[0016]: (+9.9117732048034668e-01, +0.0000000000000000e+00) residual 2.2287280444288626e-05
RitzValue[0017]: (+9.8939478397369385e-01, +0.0000000000000000e+00) residual 1.4243196346797049e-04
RitzValue[0018]: (+9.8869597911834717e-01, +0.0000000000000000e+00) residual 5.7452856708550826e-05
RitzValue[0019]: (+9.8847001791000366e-01, +0.0000000000000000e+00) residual 3.8555528590222821e-05
RitzValue[0020]: (+9.8708879947662354e-01, +0.0000000000000000e+00) residual 5.3698193369200453e-05
RitzValue[0021]: (+9.8707050085067749e-01, +0.0000000000000000e+00) residual 1.8155218276660889e-04
RitzValue[0022]: (+9.8647540807723999e-01, +0.0000000000000000e+00) residual 6.3825000324868597e-06
RitzValue[0023]: (+9.8499268293380737e-01, +0.0000000000000000e+00) residual 2.8260240014787996e-06
EigValue[0000]: (+5.8665948191173413e-05, -9.7787320798929240e-14) residual 2.6585898922348861e-06
EigValue[0001]: (+5.3048188648382042e-05, -3.7522475982113617e-13) residual 1.0272436838931753e-06
EigValue[0002]: (+5.0591878513172599e-05, -3.8294328262414887e-14) residual 9.9061492164764786e-07
EigValue[0003]: (+5.0165841520283739e-05, -5.4687054368902595e-15) residual 1.0198635891356389e-06
EigValue[0004]: (+4.5083095304827671e-05, -1.1975237326611779e-13) residual 9.4613312739966204e-07
EigValue[0005]: (+4.4194353478463463e-05, +5.1124821610349177e-14) residual 9.5671668987051817e-07
EigValue[0006]: (+4.1298225830736364e-05, -1.4549534396403399e-13) residual 8.8610005377631751e-07
EigValue[0007]: (+3.5031547172718082e-05, +5.1144499575396692e-14) residual 9.7692600320442580e-07
EigValue[0008]: (+3.4374891371186802e-05, -2.3588335690513686e-13) residual 8.7738942511350615e-07
EigValue[0009]: (+3.1444132453230166e-05, +1.6659880566817665e-13) residual 9.6951055184035795e-07
EigValue[0010]: (+2.8076446118621838e-05, +3.0161444946074933e-13) residual 9.1210210939607350e-07
EigValue[0011]: (+2.2725920794555377e-05, -7.3138044052788593e-14) residual 9.4517548632211401e-07
EigValue[0012]: (+2.0862047144768000e-05, -2.8598314005820170e-13) residual 8.8274771314900136e-07
EigValue[0013]: (+1.9806635854017933e-05, -8.0396183208300365e-14) residual 9.5918073839129647e-07
EigValue[0014]: (+1.7295770148667583e-05, -2.9442324505256452e-14) residual 9.9019746357953409e-07
EigValue[0015]: (+1.2190263018742501e-05, +3.1244004825356804e-14) residual 9.7811835075845011e-07
EigValue[0016]: (+1.0912924172349022e-05, +2.0094914684970965e-13) residual 9.4419738161377609e-07
EigValue[0017]: (+1.0137079996516918e-05, +5.1747163338553289e-14) residual 8.8925912677950691e-07
EigValue[0018]: (+8.6395046842409958e-06, -1.3791511884154658e-13) residual 9.7570136858848855e-07
EigValue[0019]: (+8.1884400907224808e-06, -5.5533948323545103e-14) residual 9.3171115622681100e-07
EigValue[0020]: (+4.5215041065618726e-06, +3.8041330642200254e-14) residual 9.5015946044441080e-07
EigValue[0021]: (+4.0359849110594490e-06, +1.6142714165863333e-14) residual 9.9929286534461426e-07
EigValue[0022]: (+1.4269838528372897e-06, -7.2583627441699820e-14) residual 1.0032989621322486e-06
EigValue[0023]: (+1.0424518999431268e-06, +1.5489829374121987e-14) residual 1.0020094123319723e-06
Notice that The Ritz values are all ~1.0 and the largest Ritz value was found first. This is typical of an iterative solver such as Lanczos where the larger extreme of the spectrum appears soonest in the Krylov search space. The Chebyshev polynomial filter projected the smallest eigenvalue (1.042e-6) to the largest Ritz value (~0.997) and was therefore found first, and MUCH more quickly than if no filtering were applied.
The values one should use for Chebyshev polynomial acceleration are dependent on the problem. You will have to perform some degree of tuning on a typical problem, such as a thermalized configuration in your ensemble. The maximum eigenvalue will correspond to --eig-amax
. If you know this value, set --eig-amax
to ~110% of its value. I.e., if your maximum eigenvalue is 4.0, set --eig-amax 4.4
just to be safe. It is possible your ensemble could have larger eigenvalues than 4.0, and you must ensure --eig-amax
can cover it. If you do not know this value, pass --eig-amax 0.0
or do not pass the flag at all, and QUDA will estimate this value using power iterations.
The value of --eig-amin
is slightly trickier. It needs to be small enough to suppress the part of the eigenspecturm you don't want (which is most of it) but not so small that it suppresses any part you wish to find. The author has seen from experience that it pays to be very loose with this value and not attempt to make it too small too soon. You should set --eig-amin
to be around 1e0, perform a compuation, and lower --eig-amin
as you see fit. Keep lowering --eig-amin
and you will find that it converges faster and faster. Eventually, you will see that the eigensolver fails to converge at all, which means you have lowered --eig-amin
too far. Again, from experience, the author has seen that when the largest eigenvalue in the Krylov space is say ~1e-5, a value of --eig-amin 1e-4
(one order of magnitude larger) typically solves the problem the fastest.
We note that the --eig-spectrum
flag should specify which portion of the spectrum of the original operator should be computed. In other news, there's no need to "overthink" setting that flag when polynomial acceleration is enabled or disabled. As an example: when you request the smallest part of the portion (--eig-spectrum SR
) but also enable polynomial acceleration, "under the hood" it'll use LR for the polynomial operator.
We give here a procedure with a worked example to illustrate the above in a more algorithmic fashion.
- Without using poly acc, compute the largest eigenvalue of a typical problem matrix. Increase this value by ~5-10%. It will serve as the
--eig-amax
value. Use the following command:
./eigensolve_test --sdim 16 --tdim 16 --eig-n-conv 1 --eig-n-ev 24 --eig-n-kr 48 --eig-tol 1e-6 --mass 0.01 --eig-spectrum LR --solve-type direct-pc --eig-use-poly-acc false --eig-max-restarts 100
You should get the following result (to machine precision) for the largest eigenvalue:
EigValue[0000]: (+3.4324538833125655e+00, -1.0615358655876489e-10) residual 9.7341637113558738e-07
- Use a conservatively large overestimate for
--eig-amin
and compute the low spectrum. Decrease--eig-amin
making note of the wall clock time to convergence. If your guess of--eig-amin
is too small, the algorithm will fail to converge and no new eigenvectors will be locked. Use the following command:
./eigensolve_test --sdim 16 --tdim 16 --eig-n-conv 24 --eig-n-ev 24 --eig-n-kr 48 --eig-amin 1.0 --eig-amax 3.8 --eig-poly-deg 50 --eig-tol 1e-6 --mass 0.01 --eig-spectrum SR --solve-type direct-pc --eig-use-poly-acc true --eig-max-restarts 100
You should see that the solver converged with the following output:
TRLM computed the requested 24 vectors in 10 restart steps and 252 OP*x operations.
- Once you have discerned working values for
--eig-amin
and--eig-amax
, you should experiment with the degree of the Chebyshev polynomial, and the value ofamin
. A larger degree polynomial will perform more mat-vec operations in the solver, which can lead to fewer restarts to solution, but will cost more compute time. One must find the optimum balance between number of mat-vec operations, and the restart count to convergence. Try the following three commands. See which one is fastest.
command 1: Increase the polynomial size to 200, amin is unchanged.
./eigensolve_test --sdim 16 --tdim 16 --eig-n-conv 24 --eig-n-ev 24 --eig-n-kr 48 --eig-amin 1.0 --eig-amax 3.8 --eig-poly-deg 200 --eig-tol 1e-6 --mass 0.01 --eig-spectrum SR --solve-type direct-pc --eig-use-poly-acc true --eig-max-restarts 100
command 2: Polynomial size is unchanged, amin is reduced to 0.1
./eigensolve_test --sdim 16 --tdim 16 --eig-n-conv 24 --eig-n-ev 24 --eig-n-kr 48 --eig-amin 0.1 --eig-amax 3.8 --eig-poly-deg 50 --eig-tol 1e-6 --mass 0.01 --eig-spectrum SR --solve-type direct-pc --eig-use-poly-acc true --eig-max-restarts 100
command 3: Polynomial size 200, amin is reduced to 0.1
./eigensolve_test --sdim 16 --tdim 16 --eig-n-conv 24 --eig-n-ev 24 --eig-n-kr 48 --eig-amin 0.1 --eig-amax 3.8 --eig-poly-deg 200 --eig-tol 1e-6 --mass 0.01 --eig-spectrum SR --solve-type direct-pc --eig-use-poly-acc true --eig-max-restarts 100
You will see that command 2 had the fastest wall clock, but command 3 had the fewest iterations. This indicates that amin
is a critical parameter in eigen-decompositions, which you should take care to tune. A good rule of thumb is to find the best value of amin first with a fairly large polynomial degree, then reduce the degree of the polynomial until it is as small as it can be while still converging quickly.
As a final caveat, please be aware that the random lattices used in these tests are near degenerate and do not represent physical problems one will encounter. To be sure of your eigensolver tuning, please use a real example.
One can see the solver in action when it is used in the multigrid+lanczos solver
There are a few subtle differences involved in using the staggered eigensolver, accessible for reference via the test executable staggered_eigensolve_test
. The following discussion assumes QUDA has been compiled with support for staggered fermions and HISQ link construction enabled, as well as with bindings to QMP and QIO.
A representative command for calculating the low eigenvalues of the Schur-preconditioned HISQ operator (here, 4m^2 – D_eo D_oe
) from a provided gauge field, taking advantage of polynomial acceleration, is:
./staggered_eigensolve_test --dim [dimensions] --gridsize [node decomposition] --load-gauge [field] \
--dslash-type asqtad --mass [mass] --compute-fat-long true --tadpole-coeff [tadpole] \
--prec double --recon 18 \
--eig-n-conv 16 --eig-n-ev 18 --eig-n-kr 32 --eig-tol 1e-4 --eig-spectrum SR \
--eig-use-pc true --eig-use-normop false --eig-use-poly-acc true --eig-poly-deg 100 --eig-amin 1e-2
Breaking down the command further:
First line:
-
--dim
and--gridsize
correspond to the local volume and grid topology, as is standard for the QUDA test routines. -
--load-gauge [field]
:field
corresponds to the absolute or relative path to a gauge field you want to load.
Second line:
-
--dslash-type asqtad
: this labeling is a bit legacy, in nearly all cases now this corresponds to HISQ links. -
--mass
: mostly self-explanatory, but of note QUDA has the convention that the mass in the staggered operator is 2x the input mass (thus the 4 m^2 in the Schur operator noted above). -
--compute-fat-long true
: specifies that the HISQ fat and long links should be constructed. -
--tadpole-coeff [tadpole]
: self-explanatory - Deflation and non-relativistic quarks generally don't intersect, but for completeness one can specify the
epsilon
correction to the Naik term via--epsilon-naik [value]
.
Third line:
-
--prec double
: double precision operator -
--recon 18
: no reconstruction
Fourth line (the meat and potatoes of deflation, more details above):
-
--eig-n-conv 16
: the desired number of eigenvalues, 16 being representative. -
--eig-n-ev 18
: details of the Lanczos process, this is the number of candidate vectors preserved per restart. Empirically we’ve found making this slightly bigger than--eig-n-conv
is sufficient. -
--eig-n-kr 32
: details of the Lanczos process, this is the total size of the Krylov space generated before a restart. Empirically we’ve found making this 1.5x to 2x the desired number of eigenvalues works well. -
--eig-tol 1e-4
: The desired tolerance for the final eigenvectors -
--eig-use-normop false
: This one’s important for deflating the staggered Schur operator in particular: the default is true, which makes sense for any operator besides the staggered Schur preconditioned operator, but here we want to set it to false. -
--eig-use-pc true
: This one is used to specify using the even-even Schur op--eig-use-poly-acc true
: Take advantage of polynomial acceleration--eig-poly-deg 100
: Degree of polynomial to use, higher order = fewer Lanczos restarts but more operator applications, so there’s a time to solution trade-off. That said, regardless this needs to be high enough to “pull” the low eigenvalues “far enough apart”, otherwise some intermediate eigenvalues might be missed by the Lanczos process. A baseline diagnostic of this is to monitor the number of converged eigenvalues per restart, which can be seen by setting the verbosity to--verbosity verbose
: if the number of converged eigenvalues doesn’t monotonically increase, you need a higher order polynomial. This scales with volume more than condition number, which follows from a density of states-style argument. All that said, empirically we’ve found 100 is more than sufficient for ~32^4, and 300 is well more than sufficient for ~48^. These haven't been tuned aggressively. -
--eig-amin 1e-2
: the lower end of the suppression window for polynomial acceleration: any eigenvalue smaller than this will be enhanced, as is desired. For any problem worth deflating found 1e-2 is “good enough”---in my experiments for HISQ MG, this is true for volumes ranging from 48^3x96 to 144^3x288, to give you an idea of the robustness of this empirical observation. You want there to be at least--eig-n-conv
eigenvalues below--eig-amin
. If you’re just playing around with the eigensolver on small volumes, you may need to increase this to 1e-1 just to capture enough eigenvalues. -
--eig-amax 24
: the upper end of the suppression window. Unlike with-eig-amin
, you want this to be greater than the largest eigenvalue of the Schur operator, otherwise the Lanczos process will stall. You can make some free field spectrum arguments that 24 is an overly conservative value, but may as well err somewhat on the side of caution. You can remove this flag to exactly compute the largest eigenvalue as a first step.
We have implemented a block variant of the TRLM, as first described in this paper this paper. Some important modifications to the algorithm described in the paper have been made:
-
This implementation infers the residuals of the eigenvectors, rather than explicitly measuring the change in the eigenvalues during convergence. It uses the vector norms of the last block of vectors, in analogy to the way the non-blocked TRLM uses the final beta value.
-
Rather than keeping the arrow position fixed (as in the paper) this implementation has a variable arrow position to save on computation. Blocks of vectors, not individual ones, are deemed to be kept, locked, and converged.
-
In order to maintain feature 2., this implementation restricts the user to request values of n_ev and n_kr that are multiples of the block size.
Using the block TRLM is done in almost exactly the same fashion as the TRLM, with an addition of two parameter changes: the eigensolver type, and the block size:
--eig-type blktrlm
--eig-block-size <N>
The block TRLM in its present state is not as performant as TRLM. This is because block solvers in general must take advantage of block type matrix vector operations, which are not yet implemented in QUDA. When it is implemented, the following will apply.
The TRLM solver applies a Matrix * Vector type operation for every Lanczos vector. This is inefficient because the matrix data must be sent to the registers for every Lanczos vector. In contrast, when using a block solver, one needs only send N vectors and 1 matrix worth of information to the registers. I.e., one reduces the bandwidth burden on the matrix data transfer from N to 1, which can be hugely significant.
What is expected to be observed when using block solvers is that the algorithm will slow down
and more iterations of the algorithm are needed to reach convergence. This is because the N vectors in the block do not maintain orthonormality throughout the solve, and so individual vectors in the block can overlap in the search space, thus wasting algorithmic time. Conversely, when the hardware and software is capable of reducing the bandwidth of the matrix data transfer, the wall clock of the algorithm decreases significantly because the time it takes to transfer data and perform the computations is reduced.
The trick to getting the most out of a block solver is finding balance (as dictated by your problem and your architecture) between wall clock speed-up, and algorithmic slow-down. We give here a table of data displaying this balance:
Block Size | Blocked Lanczos Steps | Actual Lanczos Steps | Wall Clock |
---|---|---|---|
1 | 192 | 192 | 19.691888 |
2 | 126 | 252 | 25.566567 |
3 | 91 | 273 | 27.402543 |
4 | 72 | 288 | 28.373102 |
6 | 48 | 288 | 28.781603 |
8 | 36 | 288 | 28.105595 |
12 | 32 | 384 | 38.734093 |
16 | 28 | 448 | 44.646723 |
24 | 23 | 552 | 57.735302 |
By this data, we would need to see a 3x speed up for 24 multi RHS to break even.
The command used to compute this data is given, adjust as required:
#!/bin/bash
PREC=double
PRECON=single
SLOPPY=half
export QUDA_RESOURCE_PATH=.
DEG=100
# single solve
COMMAND="mpirun -np 2 ./eigensolve_test --sdim 16 --tdim 32 --gridsize 1 1 1 2 --load-gauge /scratch/lattices/wl_16_64_5p5_x2p38_um0p4086_cfg_1000.lime --mass -0.42 \
--eig-tol 1e-10 --anisotropy 2.38 --solve-type direct-pc --eig-poly-deg ${DEG} --eig-amin 7e-4 --eig-use-poly-acc true --eig-spectrum SR --verbosity verbose \
--eig-use-normop true --prec-precondition ${PRECON} --prec ${PREC} --prec-sloppy ${SLOPPY} --eig-n-ev 48 --eig-n-kr 192 --eig-batched-rotate 5 \
--eig-type trlm --eig-block-size 10 --eig-max-restarts 100 --eig-require-convergence false "
echo $COMMAND
$COMMAND
# block solve
for BLOCK in 1 2 3 4 6 8 12 16 24 ; do
COMMAND="mpirun -np 2 ./eigensolve_test --sdim 16 --tdim 32 --gridsize 1 1 1 2 --load-gauge /scratch/lattices/wl_16_64_5p5_x2p38_um0p4086_cfg_1000.lime --mass -0.42 \
--eig-tol 1e-10 --anisotropy 2.38 --solve-type direct-pc --eig-poly-deg ${DEG} --eig-amin 7e-4 --eig-use-poly-acc true --eig-spectrum SR --verbosity verbose \
--eig-use-normop true --prec-precondition ${PRECON} --prec ${PREC} --prec-sloppy ${SLOPPY} --eig-n-ev 48 --eig-n-kr 192 --eig-batched-rotate 5 --eig-type blktrlm \
--eig-block-size ${BLOCK} --eig-max-restarts 100 --eig-require-convergence false "
echo $COMMAND
$COMMAND
wait
done
We have a partial implementation of ARPACK's IRAM. It will compute the (Largest/Smallest - Real/Imaginary/Modulus) spectrum of any operator, but at the moment it can not accommodate the shift and invert, nor can it handle generalised eigenvalue problems. We do not allow the use of Chebyshev acceleration with IRAM (even if the operator is hermitian) nor do we allow for the computation of imaginary spectra for hermitian operators as they are numerically equivalent to zero. Aside from these restrictions, the IRAM works in very much the same way as the TRLM in that all the other options one would pass to TRLM will work for IRAM. There are some notable additional options in the IRAM solver:
--eig-use-eigen-qr <bool>
--eig-qr-tol <double>
for command line, or
eig_param.use_eigen_qr = <bool>
eig_param.qr_tol = <double>
in the QudaEigParam
structure. These control how IRAM will solve the upper Hessenberg matrix between restarts. When using use_eigen_qr = true
QUDA will use Eigen routines. These can be optionally accelerated by linking QUDA against OpenBLAS's LAPACKE library at compile time and setting the OMP_NUM_THREADS
environment variable to a sensible number for your system. Instructions for linking OpenBLAS are here. When use_eigen_qr = false
it will use some hand written eigensolver routines, which can also be accelerated by setting QUDA_OPENMP
to ON
at compile time. When using the hand written routines, it is advisable to use a qr_tol
value that it at least one order of magnitude smaller than the eig_tol
value.
We find that by using our hand written QR with OpenMP we see around a factor of 2x-3x
speed up over Eigen. This is due to our ability to control the QR tolerance, a feature that is not available in Eigen's QR (from upper Hessenberg). Discussion on the matter can be found here
One can use the eigensolver to deflate the initial guess when using a CG
, PCG
, or GCR
type solve. The eigensolver's precision is set with the flag --prec-precondition
. The eigensolver will compute the desired dumber of eigenvectors of the MdagM
operator (or staggered Schur preconditioned operator SSPO) and store them in memory. For CG and types solves or the SSPO, the deflation is done using the eigenvalues and eigenvectors. If a GCR
type solver is called, the SVD of M
is computed and stored, and an SVD type deflation is performed. One can also control the number of eigenvectors used in the deflation routines via the parameter --eig-n-ev-deflate/--mg-eig-n-ev-deflate/eig_param.n_ev_deflate
. This is useful when performing studies to decide how many eigenvectors are optimal for your workflow. One would compute a large number of eigenpairs, save them to disk, then use a variety of values of n_ev_deflate
in subsequent tests.
An example command is given:
./invert_test --sdim 8 --tdim 8 --prec double --prec-precondition single --prec-sloppy half --niter 10000 --eig-amax 50 --eig-amin 2.2 --eig-poly-deg 20 --eig-n-ev 32 --eig-n-conv 28 --eig-n-kr 96 --eig-n-ev-deflate 24 --eig-tol 1e-7 --tol 1e-6 --inv-deflate true --verbosity verbose
Notice that we instruct the eigensolver to compute the deflation space in single precision. The flag --inv-deflate
instructs QUDA to perform the deflation.
We have implemented efficient routines to deflate the coarse grid operator in Staggered and Wilson types solves using TRLM. However, it is highly beneficial towards algorithmic development to observe the complex spectrum of the coarse grid. Consider the MG+Lanczos example given here. We can adjust this command to give us the complex spectrum of the coarse operator with the following adjustments:
ARGSTR=$ARGSTR"--mg-eig 2 true --mg-eig-type 2 trlm --mg-eig-use-dagger 2 false --mg-eig-use-normop 2 true "
becomes
ARGSTR=$ARGSTR"--mg-eig 2 true --mg-eig-type 2 iram --mg-eig-use-dagger 2 false --mg-eig-use-normop 2 false "
i.e., simply changing the solver type to iram
and specifying an M type (not a normal MdagM type) operator. NB This will NOT deflate the coarse operator, merely dump the complex spectrum to stdout. Also, you must ensure that the coarse level solver is (ca-)gcr so that an M type operator can be used.
The Arnoldi can be used to compute the interior spectrum of many of the operators in QUDA which support gamma_5 Hermiticity. This can be accomplished via the flag --eig-compute-gamma5 true
. While the target application of this is computing the interior gamma_5 spectrum of the Wilson operator for DWF studies, it is broadly applicable for general investigations.
The current implementation supports the Wilson, clover, twisted mass, and twisted clover operators, as well as their asymmetric preconditioned variety. We note that, while the twisted operators do not have a Hermitian-indefinite gamma5 spectrum, they do have a [Hermitian indefinite] + i mu
spectrum (up to normalization), which can still be computed with the Arnoldi.
We also include support for the "gamma5"-Hermitian version of the staggered and HISQ operators, where "gamma5" is epsilon[\vec{x}] = (-1)^{x+y+z+t}
.
Representative examples:
Wilson:
> ./eigensolve_test --dim 16 16 16 16 --eig-use-poly-acc false --eig-type iram --eig-use-normop false --verbosity verbose --eig-spectrum SM --eig-n-conv 10 --eig-n-kr 96 --eig-compute-gamma5 true --dslash-type wilson
[...]
Eval[0000] = (-2.2776337787369458e-01,-3.2263610973431538e-11) residual = +7.7167819972379089e-07
Eval[0001] = (+2.2779790087688284e-01,+3.2532345366317151e-13) residual = +6.9921385394785853e-07
Eval[0002] = (+2.2798689420140045e-01,-3.5381381230707588e-11) residual = +6.9829633453627656e-07
Asymmetric even preconditioned clover (--compute-clover true
required)
> ./eigensolve_test --dim 16 16 16 16 --eig-use-poly-acc false --eig-type iram --eig-use-normop false --verbosity verbose --eig-spectrum SM --eig-n-conv 10 --eig-n-kr 96 --eig-compute-gamma5 true --dslash-type clover --compute-clover true --load-gauge l16t16b7p0 --solution-type mat-pc --solve-type direct-pc --matpc even-even-asym
[...]
Eval[0000] = (-2.2705569436510367e-01,-8.9887692022208311e-11) residual = +4.5530078377857505e-07
Eval[0001] = (+2.2736315626509213e-01,+1.5220950148903908e-10) residual = +4.5366328734401143e-07
Eval[0002] = (+2.2900355216901019e-01,-1.3146028458323850e-10) residual = +4.5202748439085751e-07
Asymmetric odd preconditioned twisted clover (--compute-clover true
required, note constant imaginary shift)
> ./eigensolve_test --dim 16 16 16 16 --eig-use-poly-acc false --eig-type iram --eig-use-normop false --verbosity verbose --eig-spectrum SM --eig-n-conv 10 --eig-n-kr 96 --eig-compute-gamma5 true --dslash-type twisted-clover --compute-clover true
[...]
Eval[0000] = (-2.1060260859486429e-01,+2.4390240320399430e-02) residual = +6.7901923507127621e-07
Eval[0001] = (+2.1077583222908181e-01,+2.4390241308534611e-02) residual = +6.6459152887689322e-07
Eval[0002] = (-2.1102031348749012e-01,+2.4390242371951198e-02) residual = +6.5539653817631313e-07
Eval[0003] = (+2.1111254105264204e-01,+2.4390243893740154e-02) residual = +6.3994287029589538e-07
HISQ operator (--compute-fat-long true
required, using a physical gauge field generally required for convergence)
> ./staggered_eigensolve_test --dim 16 16 16 16 --eig-use-poly-acc false --eig-type iram --eig-use-normop false --verbosity verbose --eig-spectrum SM --eig-n-conv 10 --eig-n-kr 96 --eig-compute-gamma5 true --dslash-type asqtad --compute-fat-long true --load-gauge l16t16b7p0
[...]
Eval[0000] = (-2.4226606040231682e-01,+8.4508341326245649e-11) residual = +1.4307993349381401e-06
Eval[0001] = (+2.4226605646187499e-01,-1.9121371911302636e-10) residual = +1.5292143171336735e-06
Eval[0002] = (+2.4241999460522604e-01,-4.9661614037817044e-11) residual = +1.4682627090793682e-06
Eval[0003] = (-2.4241997529230855e-01,+9.2030564554501668e-11) residual = +1.6315856020426295e-06
The eigensolver will allocate extra device memory during the Krylov compression step. This can become a severe burden when dealing with large lattice sizes, forcing HPC users to employ much larger partitions, thus suffering performance loss due to strong scaling violations. To combat this, we have implemented a rotation routine that will allocate as few as one extra vector throughout the entire solve. To use this feature, set the following to maximum desired number of extra vectors
--eig-batched-rotate <int>
for command line, or
eig_param.batched_rotate = <int>
in the QudaEigParam
structure. In preliminary testing, very little performance loss is seen when using 20 or so extra vectors. Tuning runs will take up more time for larger values of the batch size because there is usually a remainder batch to perform which can take on arbitrary values from 1 to the batch size. Each new instance will trigger a tune.
As a final caveat, the rotation routines will default to the regular rotation scheme during some particular restart if the batch size is greater than the extra space needed for that restart.
In order to use the ARPACK library to solve eigen problems in QUDA, follow the instructions given here. The ARPACK routines work in exactly the same way as the QUDA eigensolver routines, and as such accepts the same command line flags and QUDA parameters in the QudaEigParam
structure. The only extra parameter of note is:
--eig-arpack-check BOOL Cross check the device data against ARPACK (requires ARPACK, default false)
If you want to see how QUDA's eigensolver compares to ARPACK, set --eig-arpack-check true
. However, be warned... ARPACK will take some time to converge...