Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a GitHub Action to check for performance regressions within the incoming changes from PRs, along with two test cases based on historical regressions #6078

Merged
merged 15 commits into from
Apr 16, 2024
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions .github/workflows/performance-tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
name: Autocomment atime-based performance regression analysis on PRs

on:
pull_request:
branches:
- '*'
types:
- opened
- reopened
- synchronize
paths:
- 'R/**'
- 'src/**'

jobs:
comment:
runs-on: ubuntu-latest
container: ghcr.io/iterative/cml:0-dvc2-base1
env:
GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
repo_token: ${{ secrets.GITHUB_TOKEN }}
steps:
- uses: Anirban166/Autocomment-atime-results@v1.1.6
108 changes: 108 additions & 0 deletions inst/atime/tests.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# A function to customize R package metadata and source files to facilitate version-specific installation and testing.
#
# This is specifically tailored for handling data.table which requires specific changes in non-standard files (such as the object file name in Makevars and version checking code in onLoad.R)
# to support testing across different versions (base and HEAD for PRs, commits associated with historical regressions, etc.) of the package.
# It appends a SHA1 hash to the package name (PKG.SHA), ensuring each version can be installed and tested separately.
#
# @param old.Package Current name of the package.
# @param new.Package New name of the package, including a SHA hash.
# @param sha SHA1 hash used for differentiating versions.
# @param new.pkg.path Path to the package files.
#
# @details
# The function modifies:
# - DESCRIPTION, updating the package name.
# - Makevars, customizing the shared object file name and adjusting the build settings.
# - R/onLoad.R, adapting custom version checking for package loading operations.
# - NAMESPACE, changing namespace settings for dynamic linking.
#
# @examples
# pkg.edit.fun("data.table", "data.table.some_SHA1_hash", "some_SHA1_hash", "/path/to/data.table")
#
# @return None (performs in-place file modifications)
# @note This setup is typically unnecessary for most packages but essential for data.table due to its unique configuration.
pkg.edit.fun = function(old.Package, new.Package, sha, new.pkg.path) {
Anirban166 marked this conversation as resolved.
Show resolved Hide resolved
pkg_find_replace <- function(glob, FIND, REPLACE) {
atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
}
Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
new.Package_ <- paste0(Package_, "_", sha)
pkg_find_replace(
"DESCRIPTION",
paste0("Package:\\s+", old.Package),
paste("Package:", new.Package))
pkg_find_replace(
file.path("src", "Makevars.*in"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
sprintf('packageVersion\\("%s"\\)', old.Package),
sprintf('packageVersion\\("%s"\\)', new.Package))
pkg_find_replace(
file.path("src", "init.c"),
paste0("R_init_", Package_regex),
paste0("R_init_", gsub("[.]", "_", new.Package_)))
pkg_find_replace(
"NAMESPACE",
sprintf('useDynLib\\("?%s"?', Package_regex),
paste0('useDynLib(', new.Package_))
}

# A list of performance tests.
#
# Each entry in this list corresponds to a performance test and contains a sublist with three mandatory arguments:
# - N: A numeric sequence of data sizes to vary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

off-topic / thinking aloud: @tdhock have you considered extending {atime} for timing surfaces in 2 dimensions? i'm wondering if we'd benefit from (nrow, ncol) 2D benchmarking in some case. Now we are restricted to marginal approach (nrow, then ncol), but the interaction effects can sometimes be interesting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atime is inherently 1 dimensional (N varies), but you can make nrow and ncol vary with N

# - setup: An expression evaluated for every data size before measuring time/memory.
# - expr: An expression that will be evaluated for benchmarking performance across different git commit versions.
# This must call a function from data.table using a syntax with double or triple colon prefix.
# The package name before the colons will be replaced by a new package name that uses the commit SHA hash.
# (For instance, data.table:::[.data.table will become data.table.some_40_digit_SHA1_hash:::[.data.table)
#
# Optional parameters that may be useful to configure tests:
# - times: Number of times each expression is evaluated (default is 10).
# - seconds.limit: The maximum median timing (in seconds) of an expression. No timings for larger N are computed past that threshold.
# - sha.vec: Named character vector or a list of vectors that specify data.table-specific commit SHAs for testing across those different git commit versions.
# For historical regressions, use 'Before', 'Regression', and 'Fixed' (otherwise something like 'Slow' or 'Fast' ideally).
# @note Please check https://github.com/tdhock/atime/blob/main/vignettes/data.table.Rmd for more information.
test.list <- list(
Anirban166 marked this conversation as resolved.
Show resolved Hide resolved
# Performance regression discussed in: https://github.com/Rdatatable/data.table/issues/4311
# Fixed in: https://github.com/Rdatatable/data.table/pull/4440
"Test regression fixed in #4440" = list(
pkg.edit.fun = pkg.edit.fun,
N = 10^seq(3,8),
setup = quote({
set.seed(1L)
dt <- data.table(a = sample.int(N))
setindexv(dt, "a")
}),
expr = quote(data.table:::shallow(dt)),
Before = "9d3b9202fddb980345025a4f6ac451ed26a423be", # This needs to be changed later. Currently assigned to the merge commit in the PR that fixed the regression (https://github.com/Rdatatable/data.table/pull/4440) as the source of regression (or the particular commit that led to it) is not clear. In addition, older versions of data.table are having problems when being installed in this manner. (This includes commits from before Mar 20, 2020 or when the issue that discovered or first mentioned the regression was created)
Anirban166 marked this conversation as resolved.
Show resolved Hide resolved
Regression = "b1b1832b0d2d4032b46477d9fe6efb15006664f4", # Parent of the first commit (https://github.com/Rdatatable/data.table/commit/0f0e7127b880df8459b0ed064dc841acd22f5b73) in the PR (https://github.com/Rdatatable/data.table/pull/4440/commits) that fixes the regression
Fixed = "9d3b9202fddb980345025a4f6ac451ed26a423be"), # Merge commit in the PR that fixed the regression (https://github.com/Rdatatable/data.table/pull/4440)

# Test based on: https://github.com/Rdatatable/data.table/issues/5424
# Performance regression introduced from a commit in: https://github.com/Rdatatable/data.table/pull/4491
# Fixed in: https://github.com/Rdatatable/data.table/pull/5463
"Test regression fixed in #5463" = list(
pkg.edit.fun = pkg.edit.fun,
N = 10^seq(3, 8),
setup = quote({
n <- N/100
set.seed(2L)
dt <- data.table(
g = sample(seq_len(n), N, TRUE),
x = runif(N),
key = "g")
dt_mod <- copy(dt)
}),
expr = quote(data.table:::`[.data.table`(dt_mod, , N := .N, by = g)),
Before = "be2f72e6f5c90622fe72e1c315ca05769a9dc854", # Commit preceding the regression causing commit (https://github.com/Rdatatable/data.table/pull/4491/commits/e793f53466d99f86e70fc2611b708ae8c601a451) in the PR that introduced the issue (https://github.com/Rdatatable/data.table/pull/4491/commits)
Anirban166 marked this conversation as resolved.
Show resolved Hide resolved
Regression = "e793f53466d99f86e70fc2611b708ae8c601a451", # Commit responsible for regression in the PR that introduced the issue (https://github.com/Rdatatable/data.table/pull/4491/commits)
Fixed = "58409197426ced4714af842650b0cc3b9e2cb842") # Last commit in the PR that fixed the regression (https://github.com/Rdatatable/data.table/pull/5463/commits)
)
Loading