Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust the launch bounds to get_json_object to avoid spilling #2015

Merged

Conversation

revans2
Copy link
Collaborator

@revans2 revans2 commented May 3, 2024

We found in practice that the get_json_object kernel could be very slow due to register spilling. This asks CUDA to avoid spilling if possible.

Note that this has been upmerged to include changes equivalent to #1924

The comparison of the benchmarks are the following

['baseline.json', 'updated.json']
# get_json_object

## [0] NVIDIA RTX A6000

|  size_bytes  |  max_depth  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |           Diff |   %Diff |  Status  |
|--------------|-------------|------------|-------------|------------|-------------|----------------|---------|----------|
|   1000000    |      2      |   4.418 ms |       0.67% |   3.901 ms |       0.64% |    -516.929 us | -11.70% |   FAIL   |
|   10000000   |      2      |   4.923 ms |       0.40% |   4.369 ms |       1.21% |    -554.720 us | -11.27% |   FAIL   |
|  100000000   |      2      |  57.260 ms |       1.21% |  17.878 ms |       0.43% |  -39382.238 us | -68.78% |   FAIL   |
|  1000000000  |      2      | 513.896 ms |       0.31% | 151.902 ms |       0.22% | -361993.797 us | -70.44% |   FAIL   |
|   1000000    |      4      |   4.491 ms |       0.26% |   3.957 ms |       0.13% |    -534.262 us | -11.90% |   FAIL   |
|   10000000   |      4      |   5.006 ms |       3.18% |   4.419 ms |       1.82% |    -586.499 us | -11.72% |   FAIL   |
|  100000000   |      4      |  57.266 ms |       1.15% |  17.909 ms |       0.48% |  -39357.476 us | -68.73% |   FAIL   |
|  1000000000  |      4      | 513.419 ms |       0.50% | 150.293 ms |       0.13% | -363125.961 us | -70.73% |   FAIL   |
|   1000000    |      6      |   4.609 ms |       0.29% |   4.024 ms |       0.25% |    -585.725 us | -12.71% |   FAIL   |
|   10000000   |      6      |   5.097 ms |       0.50% |   4.455 ms |       0.44% |    -642.529 us | -12.61% |   FAIL   |
|  100000000   |      6      |  58.795 ms |       1.04% |  17.975 ms |       0.46% |  -40820.301 us | -69.43% |   FAIL   |
|  1000000000  |      6      | 518.776 ms |       0.47% | 151.316 ms |       0.21% | -367460.065 us | -70.83% |   FAIL   |
|   1000000    |      8      |   4.624 ms |       0.25% |   4.049 ms |       0.19% |    -575.162 us | -12.44% |   FAIL   |
|   10000000   |      8      |   5.134 ms |       0.30% |   4.499 ms |       2.72% |    -635.060 us | -12.37% |   FAIL   |
|  100000000   |      8      |  59.252 ms |       1.10% |  18.132 ms |       0.26% |  -41119.616 us | -69.40% |   FAIL   |
|  1000000000  |      8      | 526.177 ms |       0.54% | 152.719 ms |       0.21% | -373457.061 us | -70.98% |   FAIL   |

# Summary

- Total Matches: 16
  - Pass    (diff <= min_noise): 0
  - Unknown (infinite noise):    0
  - Failure (diff > min_noise):  16

In my other tests based on NVIDIA/spark-rapids#10729 I saw speedups against the current version of get_json_object of 1.6x to 3.5x

For the legacy version that didn't have any validation I saw improvements between 1.04x and and 1.8x.

I ran the integration tests and they all passed also, so I think this is good to go.

@revans2
Copy link
Collaborator Author

revans2 commented May 3, 2024

build

@revans2
Copy link
Collaborator Author

revans2 commented May 3, 2024

@ttnghia could you take a look at this over #1924

@revans2
Copy link
Collaborator Author

revans2 commented May 3, 2024

build

@thirtiseven
Copy link
Collaborator

Just merged #1924 with a new comment address commit 28387a9 , please upmerge this, thanks

return char_range(_data + pos, len);
}

__device__ inline char_range slice_from(cudf::size_type pos) const
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slice_from is useless.

// save current token start pos, used by coping current token text
char const* current_token_start_pos;
cudf::size_type current_token_start_pos;
Copy link
Collaborator

@res-life res-life May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the following variables can be removed.

cudf::size_type current_token_start_pos;
cudf::size_type number_token_len;

If postpone the validatation into write_raw and copy_current_structure
Then we can remove these two variables.

Current logic is:
Validation a token will go to the end char of the token and save the start pos the the token. If need to copy the current token, have to go back and scan the token again.

Updated logic will be:
Change current token to expected token by only detect the first char of the next token. Postpone the validatation into write_raw and copy_current_structure.
e.g.:
{'k' : '123456789'}
When executing next_token when current pos is pointing to the string '123456789', just scanning 1st char ' decides the next expected token is String. Let curr pos point to the start of '123456789' and set current expected token as STRING. Then when executing write_raw and copy_current_structure, do the validation when copying.
After change, the get_json_object will only scan the whole JSON 1 time without go back.

I have a draft (about 30%) for this, but not sure the improvement gain.

@revans2
Copy link
Collaborator Author

revans2 commented May 6, 2024

build

@revans2 revans2 requested a review from ttnghia May 6, 2024 16:05
@revans2
Copy link
Collaborator Author

revans2 commented May 6, 2024

I have upmerged the code relative to the changes in #1924 and addressed the review comments here.

I reran the benchmark comparison and got the following.

['baseline.json', 'updated.json']
# get_json_object

## [0] NVIDIA RTX A6000

|  size_bytes  |  max_depth  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |           Diff |   %Diff |  Status  |
|--------------|-------------|------------|-------------|------------|-------------|----------------|---------|----------|
|   1000000    |      2      |   4.610 ms |       0.77% |   3.889 ms |       0.85% |    -721.561 us | -15.65% |   FAIL   |
|   10000000   |      2      |   5.030 ms |       0.20% |   4.364 ms |       4.69% |    -665.711 us | -13.23% |   FAIL   |
|  100000000   |      2      |  50.644 ms |       1.09% |  18.459 ms |       0.50% |  -32185.049 us | -63.55% |   FAIL   |
|  1000000000  |      2      | 453.540 ms |       0.68% | 151.467 ms |       0.24% | -302072.676 us | -66.60% |   FAIL   |
|   1000000    |      4      |   4.687 ms |       0.15% |   3.955 ms |       0.15% |    -732.169 us | -15.62% |   FAIL   |
|   10000000   |      4      |   5.139 ms |       0.48% |   4.412 ms |       2.38% |    -726.866 us | -14.14% |   FAIL   |
|  100000000   |      4      |  51.086 ms |       1.32% |  17.707 ms |       0.27% |  -33378.536 us | -65.34% |   FAIL   |
|  1000000000  |      4      | 453.716 ms |       0.30% | 150.117 ms |       0.18% | -303599.867 us | -66.91% |   FAIL   |
|   1000000    |      6      |   4.741 ms |       0.16% |   4.017 ms |       0.16% |    -724.545 us | -15.28% |   FAIL   |
|   10000000   |      6      |   5.168 ms |       0.22% |   4.456 ms |       9.16% |    -711.311 us | -13.76% |   FAIL   |
|  100000000   |      6      |  52.284 ms |       1.17% |  17.854 ms |       0.50% |  -34429.908 us | -65.85% |   FAIL   |
|  1000000000  |      6      | 459.636 ms |       0.55% | 151.526 ms |       0.19% | -308110.211 us | -67.03% |   FAIL   |
|   1000000    |      8      |   4.782 ms |       0.14% |   4.039 ms |       0.18% |    -743.326 us | -15.54% |   FAIL   |
|   10000000   |      8      |   5.239 ms |       1.00% |   4.492 ms |       0.17% |    -747.483 us | -14.27% |   FAIL   |
|  100000000   |      8      |  52.991 ms |       0.99% |  18.085 ms |       0.46% |  -34906.481 us | -65.87% |   FAIL   |
|  1000000000  |      8      | 464.851 ms |       0.49% | 152.728 ms |       0.20% | -312122.459 us | -67.14% |   FAIL   |

# Summary

- Total Matches: 16
  - Pass    (diff <= min_noise): 0
  - Unknown (infinite noise):    0
  - Failure (diff > min_noise):  16

The numbers are really unchanged from the previous results, except that there are some small movement on the baseline numbers.

@revans2
Copy link
Collaborator Author

revans2 commented May 6, 2024

build

Copy link
Collaborator

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost entirely nits.

Comment on lines 140 to 142
__device__ inline bool eof(cudf::size_type pos) { return pos >= _len; }
__device__ inline bool is_null() { return _data == nullptr; }
__device__ inline bool is_empty() { return _len == 0; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 3 could be const.

Comment on lines 173 to 174
__device__ inline bool eof() { return _range.eof(_pos); }
__device__ inline bool is_null() { return _range.is_null(); }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const here and current_char() and pos()

Comment on lines 262 to 263
__device__ inline bool eof(cudf::size_type pos) { return pos >= chars.size(); }
__device__ inline bool eof() { return curr_pos >= chars.size(); }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

Comment on lines 522 to 523
char c = str.current_char();
int v = static_cast<int>(c);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like both of these could be const.

if (copy_destination != nullptr) { *copy_destination++ = str.current_char(); }
} else {
// escape_style::ESCAPED
int escape_chars = escape_char(str.current_char(), copy_destination);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

src/main/cpp/src/json_parser.cuh Outdated Show resolved Hide resolved
src/main/cpp/src/json_parser.cuh Outdated Show resolved Hide resolved
src/main/cpp/src/get_json_object.cu Show resolved Hide resolved
src/main/cpp/src/json_parser.cuh Show resolved Hide resolved
src/main/cpp/src/json_parser.cuh Outdated Show resolved Hide resolved
@revans2
Copy link
Collaborator Author

revans2 commented May 6, 2024

build

@revans2
Copy link
Collaborator Author

revans2 commented May 6, 2024

@nvdbaranec please take another look

ttnghia
ttnghia previously approved these changes May 7, 2024
nvdbaranec
nvdbaranec previously approved these changes May 7, 2024
// We have 1 for the minBlocksPerMultiprocessor in the launch bounds to avoid spilling from
// the kernel itself. By default NVCC uses a heuristic to find a balance between the
// maximum number of registers used by a kernel and the parallelism of the kernel.
// The lots of registers are used the parallelism will suffer. But in our case
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// The lots of registers are used the parallelism will suffer. But in our case
// If lots of registers are used the parallelism may suffer. But in our case

* Note: when setting `allow_single_quotes` or `allow_unescaped_control_chars`,
* then JSON format is not conventional.
* Note: This is not conventional as it allows
* single quotes and unescaped control characters
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* single quotes and unescaped control characters
* single quotes and unescaped control characters

{
}

__device__ inline char_range(const cudf::string_view& input)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
__device__ inline char_range(const cudf::string_view& input)
__device__ inline char_range(cudf::string_view const& input)

// a member variable with a static method like this.
__device__ inline static char_range null() { return char_range(nullptr, 0); }

__device__ inline char_range(const char_range&) = default;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
__device__ inline char_range(const char_range&) = default;
__device__ inline char_range(char_range const&) = default;


__device__ inline char_range(const char_range&) = default;
__device__ inline char_range(char_range&&) = default;
__device__ inline char_range& operator=(const char_range&) = default;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
__device__ inline char_range& operator=(const char_range&) = default;
__device__ inline char_range& operator=(char_range const&) = default;

Comment on lines 522 to 523
const char c = str.current_char();
const int v = static_cast<int>(c);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const char c = str.current_char();
const int v = static_cast<int>(c);
char const c = str.current_char();
int const v = static_cast<int>(c);

if (copy_destination != nullptr) { *copy_destination++ = str.current_char(); }
} else {
// escape_style::ESCAPED
const int escape_chars = escape_char(str.current_char(), copy_destination);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const int escape_chars = escape_char(str.current_char(), copy_destination);
int const escape_chars = escape_char(str.current_char(), copy_destination);

char c = *str_pos;
switch (*str_pos) {
if (!str.eof()) {
const char c = str.current_char();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const char c = str.current_char();
char const c = str.current_char();

cudf::char_utf8 code_point = 0;
for (size_t i = 0; i < 4; i++) {
if (str.eof()) { return false; }
const char c = str.current_char();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const char c = str.current_char();
char const c = str.current_char();

// In UTF-8, the maximum number of bytes used to encode a single character
// is 4
char buff[4];
const cudf::size_type bytes = from_char_utf8(utf_char, buff);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const cudf::size_type bytes = from_char_utf8(utf_char, buff);
cudf::size_type const bytes = from_char_utf8(utf_char, buff);

if (!eof(curr_pos)) {
char c = *curr_pos;
if (!eof()) {
const char c = chars[curr_pos];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const char c = chars[curr_pos];
char const c = chars[curr_pos];

if (!eof(curr_pos)) {
char next_char_after_zero = *curr_pos;
if (!eof()) {
const char next_char_after_zero = chars[curr_pos];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const char next_char_after_zero = chars[curr_pos];
char const next_char_after_zero = chars[curr_pos];

char c = *curr_pos;
skip_whitespaces();
if (!eof()) {
const char c = chars[curr_pos];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const char c = chars[curr_pos];
char const c = chars[curr_pos];

Copy link
Collaborator

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the additional style comments. cudf enforces a very annoying east-const style that we need to follow.

@revans2 revans2 dismissed stale reviews from nvdbaranec and ttnghia via 2183f69 May 7, 2024 17:48
@revans2
Copy link
Collaborator Author

revans2 commented May 7, 2024

@ttnghia and @nvdbaranec please take another look. I think all of the comments are now addressed.

@revans2
Copy link
Collaborator Author

revans2 commented May 7, 2024

build

@gerashegalov
Copy link
Collaborator

Sorry for the additional style comments. cudf enforces a very annoying east-const style that we need to follow.

I don't see this being enforced in thirdparty/cudf/.clang-format since it does not have QualifierAlignment: Right

if I add this rule the clang-format pre-commit finds and auto-corrects many violations in spark-rapids-jni and cudf

@revans2 revans2 merged commit 980633d into NVIDIA:branch-24.06 May 7, 2024
3 checks passed
@revans2 revans2 deleted the get_json_obj_adjust_launch_bounds branch May 7, 2024 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants