Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'by' operations much slower when verbose=TRUE #6286

Closed
joshhwuu opened this issue Jul 16, 2024 · 12 comments · Fixed by #6296
Closed

'by' operations much slower when verbose=TRUE #6286

joshhwuu opened this issue Jul 16, 2024 · 12 comments · Fixed by #6296

Comments

@joshhwuu
Copy link
Member

joshhwuu commented Jul 16, 2024

Seen when benchmarking #6228 , seems that by operations run much slower with verbose=TRUE

dt = data.table(a = 1:1000000)
system.time(copy(dt)[, 1, by = a])
#   user  system elapsed 
#  0.263   0.001   0.202

system.time(copy(dt)[, 1, by = a, verbose=TRUE])
#   user  system elapsed 
#  1.375   1.505   2.820

My initial guess for the culprits are the numerous calls to clock() when verbose=TRUE

# Output of sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Vancouver
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.15.99 testthat_3.2.1.1   devtools_2.4.5     usethis_2.2.3     

loaded via a namespace (and not attached):
 [1] miniUI_0.1.1.1    compiler_4.4.1    brio_1.1.5        promises_1.3.0   
 [5] Rcpp_1.0.12       stringr_1.5.1     callr_3.7.6       later_1.3.2      
 [9] fastmap_1.2.0     mime_0.12         R6_2.5.1          htmlwidgets_1.6.4
[13] desc_1.4.3        profvis_0.3.8     rprojroot_2.0.4   shiny_1.8.1.1    
[17] rlang_1.1.4       cachem_1.1.0      stringi_1.8.4     httpuv_1.6.15    
[21] fs_1.6.4          pkgload_1.3.4     memoise_2.0.1     cli_3.6.2        
[25] withr_3.0.0       magrittr_2.0.3    ps_1.7.6          processx_3.8.4   
[29] digest_0.6.35     rstudioapi_0.16.0 xtable_1.8-4      remotes_2.5.0    
[33] lifecycle_1.0.4   vctrs_0.6.5       glue_1.7.0        urlchecker_1.0.1 
[37] sessioninfo_1.2.2 pkgbuild_1.4.4    purrr_1.0.2       tools_4.4.1      
[41] ellipsis_0.3.2    htmltools_0.5.8.1

@joshhwuu
Copy link
Member Author

Investigating:

I replaced all calls to clock() with wallclock, and it seems that the performance improved dramatically:

# with wallclock()
system.time(copy(dt)[, 1, by = a, verbose=TRUE])
# Detected that j uses these columns: <none>
# Finding groups using forderv ... forder.c received 1000000 rows and 1 columns
# 0.008s elapsed (0.007s cpu) 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.005s elapsed (0.041s cpu) 
# Optimization is on but left j unchanged (single plain symbol): '1'
# Making each group and running j (GForce FALSE) ... 
#   memcpy contiguous groups took 0.031s for 1000000 groups
#   eval(j) took 0.028s for 1000000 calls
# 0.274s elapsed (0.273s cpu) 
#    user  system elapsed 
#   0.322   0.020   0.289

# with clock() ie current
system.time(copy(dt)[, 1, by = a, verbose=TRUE])
# Detected that j uses these columns: <none>
# Finding groups using forderv ... forder.c received 1000000 rows and 1 columns
# 0.008s elapsed (0.024s cpu) 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.005s elapsed (0.043s cpu) 
# Optimization is on but left j unchanged (single plain symbol): '1'
# Making each group and running j (GForce FALSE) ... 
#   memcpy contiguous groups took 0.453s for 1000000 groups
#   eval(j) took 0.438s for 1000000 calls
# 1.955s elapsed (0.509s cpu) 
#    user  system elapsed 
#   0.577   1.446   1.969

Can anyone comment on the difference between using clock() and wallclock()? If there aren't any issues with wallclock() I'll file a PR to refactor the current calls.

@Anirban166
Copy link
Member

Can reproduce. Adding a zero adds to a more noticeable difference:

library(data.table) # 1.15.99
dt = data.table(a = 1:10000000)
system.time(copy(dt)[, 1, by = a])
#   user  system elapsed 
#  1.924   0.039   1.849 
system.time(copy(dt)[, 1, by = a, verbose=TRUE])
# Detected that j uses these columns: <none>
# Finding groups using forderv ... forder.c received 10000000 rows and 1 columns
# 0.069s elapsed (0.155s cpu) 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.034s elapsed (0.032s cpu)
# Optimization is on but left j unchanged (single plain symbol): '1'
# Making each group and running j (GForce FALSE) ... 
#   memcpy contiguous groups took 4.140s for 10000000 groups
#   eval(j) took 4.099s for 10000000 calls
# 18.0s elapsed (6.423s cpu) 
#   user  system elapsed 
#  6.611  11.598  18.096 

@joshhwuu
Copy link
Member Author

Interesting, this doesn't happen with 1.15.4
I'll see if I can find the reason

library(data.table) # 1.15.4
dt = data.table(a = 1:10000000)
system.time(copy(dt)[, 1, by = a, verbose=TRUE])
# Detected that j uses these columns: <none>
# Finding groups using forderv ... forder.c received 10000000 rows and 1 columns
# 0.370s elapsed (0.020s cpu) 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.050s elapsed # (0.050s cpu) 
# Optimization is on but left j unchanged (single plain symbol): '1'
# Making each group and running j (GForce FALSE) ... 
#   memcpy contiguous groups took 0.258s for 10000000 groups
#   eval(j) took 0.234s for 10000000 calls
# 2.230s elapsed (2.150s cpu) 
#    user  system elapsed 
#    2.23    0.06    2.67

@Anirban166
Copy link
Member

Can anyone comment on the difference between using clock() and wallclock()?

I reckon clock() is used for profiling CPU time and likely ends up making more system calls in the process (has been found to be slower before, for e.g. this comment)

wallclock() does seem to have a simpler implementation, so in theory the overhead difference could be true.

Interesting, this doesn't happen with 1.15.4
I'll see if I can find the reason

Good find! Seems like a potential performance regression sneaked in.

@joshhwuu
Copy link
Member Author

joshhwuu commented Jul 16, 2024

NVM, it seems that this only affects non-Windows. AFAIK there are slight implementation differences between clock()s in Linux and Windows. I just tried to recreate this on Windows R with 1.15.99, but there is no issue:

library(data.table)
# data.table 1.15.99 IN DEVELOPMENT built 2024-07-16 23:30:59 UTC; joshu using 8 threads (see ?getDTthreads).  Latest news: r-datatable.com
dt = data.table(a = 1:1000000)
system.time(copy(dt)[, 1, by = a])
#    user  system elapsed 
#    0.17    0.00    0.17 
system.time(copy(dt)[, 1, by = a, verbose=TRUE])
# Detected that j uses these columns: <none>
# Finding groups using forderv ... forder.c received 1000000 rows and 1 columns
# 0.020s elapsed (0.000s cpu) 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
# Optimization is on but left j unchanged (single plain symbol): '1'
# Making each group and running j (GForce FALSE) ... 
#   memcpy contiguous groups took 0.029s for 1000000 groups
#   eval(j) took 0.022s for 1000000 calls
# 0.230s elapsed (0.220s cpu) 
#    user  system elapsed 
#    0.22    0.03    0.27 

Found this SO thread

@tdhock
Copy link
Member

tdhock commented Jul 17, 2024

I don't think anybody cares about performance when verbose=TRUE, since it is for debugging

@joshhwuu
Copy link
Member Author

I don't think anybody cares about performance when verbose=TRUE, since it is for debugging

Hm, I can understand that but switching to wallclock() is trivial as it is already implemented and gives us consistent performance over different platforms. Additionally, IMO a performance difference of over 10x seems a little egregious, even for a debug mode

@tdhock
Copy link
Member

tdhock commented Jul 17, 2024

ok if you want to make the change, then make sure to add an atime performance test

@tdhock
Copy link
Member

tdhock commented Jul 17, 2024

https://github.com/Rdatatable/data.table/wiki/Performance-testing

Anirban166 added a commit to Anirban166/data.table that referenced this issue Jul 18, 2024
…ent)) is invalid or OS was the source of difference rather than the version
@jangorecki
Copy link
Member

jangorecki commented Jul 21, 2024

I can imagine that for verbose=T we could sometimes add extra forder calls (or potentially other heavy calls) just to report more information to the console. I would say this is acceptable. For performance verbose needs to be F.

@aitap
Copy link
Contributor

aitap commented Aug 25, 2024

Edit: let's keep a distinction between CPU time spent executing instructions by the current process (returned by clock on non-Windows, getrusage on POSIX, possibly clock_gettime(CLOCK_PROCESS_CPUTIME_ID) on some POSIX implementations; GetProcessTimes on Windows because clock doesn't conform there) and elapsed time that will include waiting for I/O and other processes (measured by clock_gettime(CLOCK_MONOTONIC) and sometimes (if not adjusted in the background) timespec_get/time/clock_gettime(CLOCK_REALTIME)/gettimeofday; QueryPerformanceCounter or clock on Windows). It looks like it usually costs a system call to know the CPU time, but there are faster ways of obtaining a monotonically increasing time value.

Since the values of wallclock() are frequently subtracted to obtain durations, can we use clock_gettime(CLOCK_MONOTONIC) instead of CLOCK_REALTIME? Subtracting wall-clock values may cause some confusing bugs if the system time is adjusted in the background. POSIX requires both CLOCK_REALTIME and CLOCK_MONOTONIC.

Empirically, on amd64 Linux, both CLOCK_REALTIME and CLOCK_MONOTONIC use the vDSO and thus avoid the system call penalty, but clock() is required to measure the CPU time, so uses CLOCK_PROCESS_CPUTIME_ID and costs an extra system call, as does getrusage.

@MichaelChirico
Copy link
Member

Thanks for pointing out CLOCK_MONOTONIC!

wallclock() (according to a comment that's unfortunately been lost) is adapted from R's own currentTime(), which uses CLOCK_REALTIME:

0b3cd65#diff-4052380a202e1373dcfdfa2fff9850d61b26758fc0475260b2871e70f0d153a1R129

https://github.com/r-devel/r-svn/blob/4778112deccaf909e93a0bec77d6c5ccd0e9155f/src/main/times.c#L105

CLOCK_REALTIME has always been used (13 years) since currentTime() was implemented, I don't see any discussion of the issue you raise on r-devel or bugzilla either:

r-devel/r-svn@8cf6136
https://github.com/search?q=repo%3AMichaelChirico%2Fr-mailing-list-archive%20%22CLOCK_REALTIME%22&type=code
long bugzilla link

I would suggest starting this as a discussion on r-devel first; at a minimum, it's out of scope here & gets its own issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants