Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significantly slower broadcasting #356

Open
astrozot opened this issue Aug 4, 2024 · 3 comments
Open

Significantly slower broadcasting #356

astrozot opened this issue Aug 4, 2024 · 3 comments

Comments

@astrozot
Copy link

astrozot commented Aug 4, 2024

I have been testing the speed of broadcast operations with OffsetArrays of StaticArrays, and it looks like there is a significant penalty in time. In fact, on my laptop I see

julia> using StaticArrays, OffsetArrays, BenchmarkTools

julia> xs = [SVector{2}(rand(2)) for _  1:10_000];

julia> ys = similar(xs);

julia> @benchmark (@. $ys = $xs / (first($xs)^2 + last($xs)^2))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   9.419 μs … 43.605 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     10.142 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.559 μs ±  2.524 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █    ▁ ▅                                                     
  █▅▆▄▆█▃█▂▃▂▂▃▂▂▃▃▂▂▃▂▁▂▂▂▂▂▂▂▂▃▂▁▂▂▁▁▁▁▁▃▄▅▂▂▂▄▂▃▁▃▂█▆▂▁▁▅▆ ▃
  9.42 μs         Histogram: frequency by time        15.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

When using OffsetArrays instead these are the results:

julia> oxs = OffsetArray(xs, -1000);

julia> oys = OffsetArray(ys, -1000);

julia> @benchmark (@. $oys = $oxs / (first($oxs)^2 + last($oxs)^2))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  48.245 μs … 159.888 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     50.522 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   53.394 μs ±  10.011 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇███▄▁            ▃  ▃ ▂    ▁  ▄   ▃                         ▂
  ██████▇▇▁▇▆▆▇▅▇▃█▆█▇██▆█▇▆█▅█▆▅█▆▆▅█▆▆▆▆▆▆█▆▆▆▅▆▆▅▆▆▅▆▆▆▆▆▆▆ █
  48.2 μs       Histogram: log(frequency) by time      97.1 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

The $&gt; 5 \times$ penalty is even larger than the SIMD vector size of my laptop (4 for Float64 arrays).

Any help or clarification is really appreciated!

@astrozot
Copy link
Author

astrozot commented Aug 4, 2024

Just to clarify: the issue disappears when working with normal arrays, as in

julia> as = rand(10_000);

julia> bs = similar(as);

julia> @benchmark (@. $bs = $as / (1.0 + $as^2))
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min  max):  4.668 μs  20.108 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     6.278 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.231 μs ±  1.076 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇   ▁      ▅▁▇▄█▄▇▃▆▁▂▁▁                                   ▂
  ██████▃██▁███████████████▇▇▇▆▆▆▇▇▇▇▇▇▇▇▇▇▇▅▆▆▇▆▄▅▆▆▆▅▆▅▅▄▆ █
  4.67 μs      Histogram: log(frequency) by time     10.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

It also disappears if, instead of broadcasting operations, I use explicit loops:

julia> function f!(ys, xs)
           for i ∈ eachindex(xs)
               @inbounds ys[i] = xs[i] / (first(xs[i])^2 + last(xs[i])^2)
           end
           ys
       end
f! (generic function with 1 method)

julia> @benchmark f!($ys, $xs)
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min … max):  8.789 μs … 39.339 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.053 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.876 μs ±  2.016 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▇▅▅▄▅▅▃▄ ▂            ▃ ▂  ▂  ▂  ▁▄ ▃ ▄                   ▂
  █████████▅█▃██▃▁█▁█▁▇▆▁█▃█▁▃█▁▁█▁▄██▅█▆█▆▆▅█▅▄▄▃▆▄▅▃▃▁▁▄▄▄ █
  8.79 μs      Histogram: log(frequency) by time     16.5 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark f!($oys, $oxs)
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min … max):   8.808 μs … 44.438 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      9.446 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.056 μs ±  1.749 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █ ▆▂▇ ▇ ▅  ▇ ▇▂             ▄   ▃   ▃  ▁    ▁   ▅▂   ▄    ▁ ▂
  █▅███▅█▆██▆█▄██▃█▁▄█▃▁█▁▁█▁▁█▇▁▁█▃▁▃█▃▁█▇▁▁▁█▁▅▅██▆▆▅█▅▅▆▆█ █
  8.81 μs      Histogram: log(frequency) by time      14.5 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

@jishnub
Copy link
Member

jishnub commented Aug 4, 2024

I think the issue is that StaticArrays unrolls the loops, and OffsetArrays currently doesn't have specialized broadcast styles defined and can't access the specialized methods.

@astrozot
Copy link
Author

astrozot commented Aug 5, 2024

Thank you @jishnub , thank you for your reply. Looking at the source code, I saw a link to a similar issue reported here:

https://discourse.julialang.org/t/why-is-there-a-performance-hit-on-broadcasting-with-offsetarrays/32194

I am not entirely sure the situation is identical...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants