Skip to content

Latest commit

 

History

History
1065 lines (1032 loc) · 104 KB

附录H数学方法.md

File metadata and controls

1065 lines (1032 loc) · 104 KB

附录H 数学方法

参考手册列出了设备代码中支持的 C/C++ 标准库数学函数的所有函数及其描述,以及所有内部函数(仅在设备代码中支持)。

本附录在适用时提供了其中一些功能的准确性信息。它使用 ULP 进行量化。有关最后位置单元 (ULP: Unit in the Last Place, 上面是直译的,这里可以理解为最小精度单元) 定义的更多信息,请参阅 Jean-Michel Muller's paper On the definition of ulp(x), RR-5504, LIP RR-2005-09, INRIA, LIP. 2005, pp.16 at https://hal.inria.fr/inria-00070503/document

设备代码中支持的数学函数不设置全局 errno 变量,也不报告任何浮点异常来指示错误;因此,如果需要错误诊断机制,用户应该对函数的输入和输出实施额外的筛选。用户负责指针参数的有效性。用户不得将未初始化的参数传递给数学函数,因为这可能导致未定义的行为:函数在用户程序中内联,因此受到编译器优化的影响。

H.1. Standard Functions

本节中的函数可用于主机和设备代码。

本节指定每个函数在设备上执行时的错误范围,以及在主机不提供函数的情况下在主机上执行时的错误范围。

错误界限是从广泛但并非详尽的测试中生成的,因此它们不是保证界限。

Single-Precision Floating-Point Functions

加法和乘法符合 IEEE 标准,因此最大误差为 0.5 ulp。

将单精度浮点操作数舍入为整数的推荐方法是 rintf(),而不是 roundf()。 原因是 roundf() 映射到设备上的 4 条指令序列,而 rintf() 映射到单个指令。 truncf()ceilf()floorf() 也都映射到一条指令。

Table 7. Single-Precision Mathematical Standard Library Functions with Maximum ULP Error. The maximum error is stated as the absolute value of the difference in ulps between a correctly rounded single-precision result and the result returned by the CUDA library function.
Function Maximum ulp error
x+y

0 (IEEE-754 round-to-nearest-even)

x*y

0 (IEEE-754 round-to-nearest-even)

x/y

0 for compute capability ≥ 2 when compiled with -prec-div=true

2 (full range), otherwise

1/x

0 for compute capability ≥ 2 when compiled with -prec-div=true

1 (full range), otherwise

rsqrtf(x)

1/sqrtf(x)

2 (full range)

Applies to 1/sqrtf(x) only when it is converted to rsqrtf(x) by the compiler.

sqrtf(x)

0 when compiled with -prec-sqrt=true

Otherwise 1 for compute capability ≥ 5.2

and 3 for older architectures

cbrtf(x) 1 (full range)
rcbrtf(x) 1 (full range)
hypotf(x,y) 3 (full range)
rhypotf(x,y) 2 (full range)
norm3df(x,y,z) 3 (full range)
rnorm3df(x,y,z) 2 (full range)
norm4df(x,y,z,t) 3 (full range)
rnorm4df(x,y,z,t) 2 (full range)
normf(dim,arr) An error bound can't be provided because a fast algorithm is used with accuracy loss due to round-off
rnormf(dim,arr) An error bound can't be provided because a fast algorithm is used with accuracy loss due to round-off
expf(x) 2 (full range)
exp2f(x) 2 (full range)
exp10f(x) 2 (full range)
expm1f(x) 1 (full range)
logf(x) 1 (full range)
log2f(x) 1 (full range)
log10f(x) 2 (full range)
log1pf(x) 1 (full range)
sinf(x) 2 (full range)
cosf(x) 2 (full range)
tanf(x) 4 (full range)
sincosf(x,sptr,cptr) 2 (full range)
sinpif(x) 2 (full range)
cospif(x) 2 (full range)
sincospif(x,sptr,cptr) 2 (full range)
asinf(x) 4 (full range)
acosf(x) 3 (full range)
atanf(x) 2 (full range)
atan2f(y,x) 3 (full range)
sinhf(x) 3 (full range)
coshf(x) 2 (full range)
tanhf(x) 2 (full range)
asinhf(x) 3 (full range)
acoshf(x) 4 (full range)
atanhf(x) 3 (full range)
powf(x,y) 9 (full range)
erff(x) 2 (full range)
erfcf(x) 4 (full range)
erfinvf(x) 2 (full range)
erfcinvf(x) 4 (full range)
erfcxf(x) 4 (full range)
normcdff(x) 5 (full range)
normcdfinvf(x) 5 (full range)
lgammaf(x) 6 (outside interval -10.001 ... -2.264; larger inside)
tgammaf(x) 11 (full range)
fmaf(x,y,z) 0 (full range)
frexpf(x,exp) 0 (full range)
ldexpf(x,exp) 0 (full range)
scalbnf(x,n) 0 (full range)
scalblnf(x,l) 0 (full range)
logbf(x) 0 (full range)
ilogbf(x) 0 (full range)
j0f(x)

9 for |x| < 8

otherwise, the maximum absolute error is 2.2 x 10-6

j1f(x)

9 for |x| < 8

otherwise, the maximum absolute error is 2.2 x 10-6

jnf(n,x) For n = 128, the maximum absolute error is 2.2 x 10-6
y0f(x)

9 for |x| < 8

otherwise, the maximum absolute error is 2.2 x 10-6

y1f(x)

9 for |x| < 8

otherwise, the maximum absolute error is 2.2 x 10-6

ynf(n,x)

ceil(2 + 2.5n) for |x| < n

otherwise, the maximum absolute error is 2.2 x 10-6

cyl_bessel_i0f(x) 6 (full range)
cyl_bessel_i1f(x) 6 (full range)
fmodf(x,y) 0 (full range)
remainderf(x,y) 0 (full range)
remquof(x,y,iptr) 0 (full range)
modff(x,iptr) 0 (full range)
fdimf(x,y) 0 (full range)
truncf(x) 0 (full range)
roundf(x) 0 (full range)
rintf(x) 0 (full range)
nearbyintf(x) 0 (full range)
ceilf(x) 0 (full range)
floorf(x) 0 (full range)
lrintf(x) 0 (full range)
lroundf(x) 0 (full range)
llrintf(x) 0 (full range)
llroundf(x) 0 (full range)

Double-Precision Floating-Point Functions

将双精度浮点操作数舍入为整数的推荐方法是 rint(),而不是 round()。 原因是 round() 映射到设备上的 5 条指令序列,而 rint() 映射到单个指令。 trunc()、ceil() 和 floor() 也都映射到一条指令。

.
Function Maximum ulp error
x+y

0 (IEEE-754 round-to-nearest-even)

x*y

0 (IEEE-754 round-to-nearest-even)

x/y

0 (IEEE-754 round-to-nearest-even)

1/x

0 (IEEE-754 round-to-nearest-even)

sqrt(x) 0 (IEEE-754 round-to-nearest-even)
rsqrt(x)

1 (full range)

cbrt(x) 1 (full range)
rcbrt(x) 1 (full range)
hypot(x,y) 2 (full range)
rhypot(x,y) 1 (full range)
norm3d(x,y,z) 2 (full range)
rnorm3d(x,y,z) 1 (full range)
norm4d(x,y,z,t) 2 (full range)
rnorm4d(x,y,z,t) 1 (full range)
norm(dim,arr) An error bound can't be provided because a fast algorithm is used with accuracy loss due to round-off
rnorm(dim,arr) An error bound can't be provided because a fast algorithm is used with accuracy loss due to round-off
exp(x) 1 (full range)
exp2(x) 1 (full range)
exp10(x) 1 (full range)
expm1(x) 1 (full range)
log(x) 1 (full range)
log2(x) 1 (full range)
log10(x) 1 (full range)
log1p(x) 1 (full range)
sin(x) 2 (full range)
cos(x) 2 (full range)
tan(x) 2 (full range)
sincos(x,sptr,cptr) 2 (full range)
sinpi(x) 2 (full range)
cospi(x) 2 (full range)
sincospi(x,sptr,cptr) 2 (full range)
asin(x) 2 (full range)
acos(x) 2 (full range)
atan(x) 2 (full range)
atan2(y,x) 2 (full range)
sinh(x) 2 (full range)
cosh(x) 1 (full range)
tanh(x) 1 (full range)
asinh(x) 2 (full range)
acosh(x) 2 (full range)
atanh(x) 2 (full range)
pow(x,y) 2 (full range)
erf(x) 2 (full range)
erfc(x) 5 (full range)
erfinv(x) 5 (full range)
erfcinv(x) 6 (full range)
erfcx(x) 4 (full range)
normcdf(x) 5 (full range)
normcdfinv(x) 8 (full range)
lgamma(x) 4 (outside interval -11.0001 ... -2.2637; larger inside)
tgamma(x) 8 (full range)
fma(x,y,z) 0 (IEEE-754 round-to-nearest-even)
frexp(x,exp) 0 (full range)
ldexp(x,exp) 0 (full range)
scalbn(x,n) 0 (full range)
scalbln(x,l) 0 (full range)
logb(x) 0 (full range)
ilogb(x) 0 (full range)
j0(x)

7 for |x| < 8

otherwise, the maximum absolute error is 5 x 10-12

j1(x)

7 for |x| < 8

otherwise, the maximum absolute error is 5 x 10-12

jn(n,x) For n = 128, the maximum absolute error is 5 x 10-12
y0(x)

7 for |x| < 8

otherwise, the maximum absolute error is 5 x 10-12

y1(x)

7 for |x| < 8

otherwise, the maximum absolute error is 5 x 10-12

yn(n,x)

For |x| > 1.5n, the maximum absolute error is 5 x 10-12

cyl_bessel_i0(x) 6 (full range)
cyl_bessel_i1(x) 6 (full range)
fmod(x,y) 0 (full range)
remainder(x,y) 0 (full range)
remquo(x,y,iptr) 0 (full range)
modf(x,iptr) 0 (full range)
fdim(x,y) 0 (full range)
trunc(x) 0 (full range)
round(x) 0 (full range)
rint(x) 0 (full range)
nearbyint(x) 0 (full range)
ceil(x) 0 (full range)
floor(x) 0 (full range)
lrint(x) 0 (full range)
lround(x) 0 (full range)
llrint(x) 0 (full range)
llround(x) 0 (full range)

H.2. Intrinsic Functions

本节中的函数只能在设备代码中使用。

在这些函数中,有一些标准函数的精度较低但速度更快的版本。它们具有相同的名称,前缀为 __(例如 __sinf(x))。 它们更快,因为它们映射到更少的本机指令。 编译器有一个选项 (-use_fast_math),它强制下表 中的每个函数编译为其内在对应项。 除了降低受影响函数的准确性外,还可能导致特殊情况处理的一些差异。 一种更健壮的方法是通过调用内联函数来选择性地替换数学函数调用,仅在性能增益值得考虑的情况下以及可以容忍更改的属性(例如降低的准确性和不同的特殊情况处理)的情况下。

Table 9. Functions Affected by -use_fast_math
Operator/Function Device Function
x/y

__fdividef(x,y)

sinf(x)

__sinf(x)

cosf(x)

__cosf(x)

tanf(x) __tanf(x)
sincosf(x,sptr,cptr) __sincosf(x,sptr,cptr)
logf(x)

__logf(x)

log2f(x) __log2f(x)
log10f(x) __log10f(x)
expf(x) __expf(x)
exp10f(x) __exp10f(x)
powf(x,y) __powf(x,y)

Single-Precision Floating-Point Functions

__fadd_[rn,rz,ru,rd]()__fmul_[rn,rz,ru,rd]() 映射到编译器从不合并到 FMAD 中的加法和乘法运算。相比之下,由“*”和“+”运算符生成的加法和乘法将经常组合到 FMAD 中。

_rn 为后缀的函数使用舍入到最接近的偶数舍入模式运行。

_rz 为后缀的函数使用向零舍入模式进行舍入操作。

_ru 为后缀的函数使用向上舍入(到正无穷大)舍入模式运行。

_rd 为后缀的函数使用向下舍入(到负无穷大)舍入模式进行操作。

浮点除法的准确性取决于代码是使用 -prec-div=false 还是 -prec-div=true 编译的。使用-prec-div=false编译代码时,正则除法/运算符和__fdividef(x,y)精度相同,但对于2126 < |y| <2128__fdividef(x,y) 提供的结果为零,而 / 运算符提供的正确结果在下表 中规定的精度范围内。此外,对于 2126 < |y| <2128,如果 x 为无穷大,则 __fdividef(x,y) 提供 NaN(作为无穷大乘以零的结果),而 / 运算符返回无穷大。另一方面,当使用 -prec-div=true 或根本没有任何 -prec-div 选项编译代码时, / 运算符符合 IEEE 标准,因为它的默认值为 true。

Function Error bounds
__fadd_[rn,rz,ru,rd](x,y)

IEEE-compliant.

__fsub_[rn,rz,ru,rd](x,y)

IEEE-compliant.

__fmul_[rn,rz,ru,rd](x,y)

IEEE-compliant.

__fmaf_[rn,rz,ru,rd](x,y,z)

IEEE-compliant.

__frcp_[rn,rz,ru,rd](x) IEEE-compliant.
__fsqrt_[rn,rz,ru,rd](x) IEEE-compliant.
__frsqrt_rn(x) IEEE-compliant.
__fdiv_[rn,rz,ru,rd](x,y)

IEEE-compliant.

__fdividef(x,y) For |y| in [2-126, 2126], the maximum ulp error is 2.
__expf(x) The maximum ulp error is 2 + floor(abs(1.16 * x)).
__exp10f(x) The maximum ulp error is 2+ floor(abs(2.95 * x)).
__logf(x) For x in [0.5, 2], the maximum absolute error is 2-21.41, otherwise, the maximum ulp error is 3.
__log2f(x) For x in [0.5, 2], the maximum absolute error is 2-22, otherwise, the maximum ulp error is 2.
__log10f(x) For x in [0.5, 2], the maximum absolute error is 2-24, otherwise, the maximum ulp error is 3.
__sinf(x) For x in [-π,π], the maximum absolute error is 2-21.41, and larger otherwise.
__cosf(x) For x in [-π,π], the maximum absolute error is 2-21.19, and larger otherwise.
__sincosf(x,sptr,cptr) Same as __sinf(x) and __cosf(x).
__tanf(x) Derived from its implementation as __sinf(x) * (1/__cosf(x)).
__powf(x, y) Derived from its implementation as exp2f(y * __log2f(x)).

Double-Precision Floating-Point Functions

__dadd_rn()__dmul_rn() 映射到编译器从不合并到 FMAD 中的加法和乘法运算。 相比之下,由“*”和“+”运算符生成的加法和乘法将经常组合到 FMAD 中。

Table 11. Double-Precision Floating-Point Intrinsic Functions. (Supported by the CUDA Runtime Library with Respective Error Bounds)
Function Error bounds
__dadd_[rn,rz,ru,rd](x,y)

IEEE-compliant.

__dsub_[rn,rz,ru,rd](x,y)

IEEE-compliant.

__dmul_[rn,rz,ru,rd](x,y)

IEEE-compliant.

__fma_[rn,rz,ru,rd](x,y,z)

IEEE-compliant.

__ddiv_[rn,rz,ru,rd](x,y)(x,y)

IEEE-compliant.

Requires compute capability > 2.

__drcp_[rn,rz,ru,rd](x)

IEEE-compliant.

Requires compute capability > 2.

__dsqrt_[rn,rz,ru,rd](x)

IEEE-compliant.

Requires compute capability > 2.