Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor GetObject request with Range Header to Skip HeadRequest - Part 3 #397

Merged
merged 45 commits into from
Dec 28, 2023

Conversation

waahm7
Copy link
Contributor

@waahm7 waahm7 commented Dec 20, 2023

Issue #, if available:
#370

Description of changes:

  • This PR refactors our GetObject request with Range header logic to be parsed on the client side, as long as the Range includes a start-range.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@@ -53,6 +53,12 @@ struct aws_s3_auto_ranged_get {
} synced_data;

uint32_t initial_message_has_range_header : 1;

bool initial_message_has_start_range;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using bool because we can not have a pointer to bit field https://stackoverflow.com/questions/13547352/c-cannot-take-address-of-bit-field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By putting these new variables between two other bit-fields, it prevents those bit-fields from getting packed together, and the only point of a bit-field is to get packed together and reduce the overall size of a struct.

IMHO, the bit-fields in aws-c-s3 aren't worth having because they make the code uglier (need to type != 0 instead of just treating them like a normal bool), and these structs are not a big contributor to memory usage. But if we're going to have bit-fields, we should get some benefit from them. Having bitfields that make your code ugly AND getting no benefit from them... is dumb.

I see how you just wanted to put related variables together though. Consider doing 1 of these:

  1. keep all bit-fields together at the end of the struct (the traditional place to put them)
  2. stop using bit-fields altogether in this struct (find/replace ugly code that does != 0)
  3. make this a bit-field (together with other bit-fields), but yeah you can't pass it by pointer so you'll need to make a temp variable bool initial_message_has_start_range_tmp and pass THAT to your function by pointer, and then set the bit-field to the tmp variable

Copy link
Contributor Author

@waahm7 waahm7 Dec 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed description, I have updated it to option 1 for now. I would have preferred option 2 but don't want to include that refactor in this PR.

@codecov-commenter
Copy link

codecov-commenter commented Dec 21, 2023

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (cfc395b) 88.79% compared to head (5ef1b80) 88.82%.

Additional details and impacted files

Impacted file tree graph

@@                   Coverage Diff                   @@
##           refactor-empty-file     #397      +/-   ##
=======================================================
+ Coverage                88.79%   88.82%   +0.03%     
=======================================================
  Files                       21       21              
  Lines                     6078     6141      +63     
=======================================================
+ Hits                      5397     5455      +58     
- Misses                     681      686       +5     
Files Coverage Δ
source/s3_auto_ranged_get.c 97.76% <100.00%> (+0.16%) ⬆️
source/s3_meta_request.c 90.80% <100.00%> (-0.01%) ⬇️
source/s3_util.c 98.01% <87.17%> (-1.61%) ⬇️

source/s3_util.c Outdated
}

/* verify that start-range <= end-range */
if (*out_initial_message_has_end_range && *out_initial_start_range > *out_initial_end_range) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we perform a HeadRequest with startRange > endRange such as bytes=20-10, S3 does not add a Content-Range header in the HeadRequest response, and Content-Length reflects the total length of the object. The CLI downloads the entire object. The current CRT behavior is that it fails with AWS_ERROR_S3_MISSING_CONTENT_RANGE_HEADER at https://github.com/awslabs/aws-c-s3/blob/main/source/s3_auto_ranged_get.c#L510C24-L510C66. According to the HTTP spec, “An int-range is invalid if the last-pos value is present and less than the first-pos.” . This changes it to AWS_ERROR_S3_INVALID_RANGE_HEADER error at request time.

@waahm7 waahm7 marked this pull request as ready for review December 22, 2023 22:19
@waahm7 waahm7 changed the title WIP | Refactor GetObject request with Range Header to Skip HeadRequest - Part 3 Refactor GetObject request with Range Header to Skip HeadRequest - Part 3 Dec 22, 2023
Copy link
Contributor

@graebm graebm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

many trivial comments

@@ -53,6 +53,12 @@ struct aws_s3_auto_ranged_get {
} synced_data;

uint32_t initial_message_has_range_header : 1;

bool initial_message_has_start_range;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By putting these new variables between two other bit-fields, it prevents those bit-fields from getting packed together, and the only point of a bit-field is to get packed together and reduce the overall size of a struct.

IMHO, the bit-fields in aws-c-s3 aren't worth having because they make the code uglier (need to type != 0 instead of just treating them like a normal bool), and these structs are not a big contributor to memory usage. But if we're going to have bit-fields, we should get some benefit from them. Having bitfields that make your code ugly AND getting no benefit from them... is dumb.

I see how you just wanted to put related variables together though. Consider doing 1 of these:

  1. keep all bit-fields together at the end of the struct (the traditional place to put them)
  2. stop using bit-fields altogether in this struct (find/replace ugly code that does != 0)
  3. make this a bit-field (together with other bit-fields), but yeah you can't pass it by pointer so you'll need to make a temp variable bool initial_message_has_start_range_tmp and pass THAT to your function by pointer, and then set the bit-field to the tmp variable

source/s3_auto_ranged_get.c Outdated Show resolved Hide resolved
source/s3_auto_ranged_get.c Outdated Show resolved Hide resolved
Comment on lines 264 to 265
// TODO: Should we align the first part? I need to run some benchmarks before re-adding first
// part alignment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another benefit of aligning the first part is: it makes it more likely that we get checksums, due to aligning with UploadPart boundaries

but oof, yeah, it would suck to send 2 requests when we could have gotten away with 1

just throwing out thoughts, you've got more of this in your head right now than I do...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will never get the checksum of the first partial part request, even if we align it to part_size, so we can't validate the checksum. I observed no difference in performance between aligning the first request or not aligning it. Therefore, I have removed the logic for aligning the first part.

Comment on lines 703 to 706
object_range_end = object_size - 1;
if (auto_ranged_get->initial_message_has_end_range) {
object_range_end = aws_min_u64(object_range_end, auto_ranged_get->initial_object_range_end);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utterly trivial: reads better as if/else

Suggested change
object_range_end = object_size - 1;
if (auto_ranged_get->initial_message_has_end_range) {
object_range_end = aws_min_u64(object_range_end, auto_ranged_get->initial_object_range_end);
}
if (auto_ranged_get->initial_message_has_end_range) {
object_range_end = aws_min_u64(object_range_end, auto_ranged_get->initial_object_range_end);
} else {
object_range_end = object_size - 1;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated object_range_end is used in aws_min_u64. I have changed it to an if/else structure and updated that line to directly use object_size - 1.

source/s3_util.c Outdated Show resolved Hide resolved
source/s3_util.c Outdated Show resolved Hide resolved
tests/s3_util_tests.c Outdated Show resolved Hide resolved
source/s3_util.c Outdated Show resolved Hide resolved
Comment on lines 155 to 158
bool initial_message_has_start_range;
bool initial_message_has_end_range;
uint64_t initial_object_range_start;
uint64_t initial_object_range_end;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are the bools "initial_message" but the ints "initial_object"? Is there some subtle difference I'm not picking up on?

(personally, I would omit the "initial_X" prefix from all of them)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No difference; I just kept it similar to existing initial_message_has_range_header, object_range_start, and object_range_end. I agree that it is confusing, especially initial_object_*. I have updated it to the following:

  bool initial_message_has_start_range;
  bool initial_message_has_end_range;
  uint64_t initial_range_start;
  uint64_t initial_range_end;

I think range_start and range_end would be confusing alongside object_range_start and object_range_end, so I have kept the initial prefix since initial_range_end and object_range_end might differ. Plus, object_range_end is part of the synced block, which we don't need for initial_range_end.

@@ -1355,7 +1357,7 @@ static int s_s3_meta_request_incoming_body(
}

if (request->send_data.response_body.capacity == 0) {
if (request->has_part_size_response_body) {
if (request->has_part_size_response_body && successful_response) {
Copy link
Contributor Author

@waahm7 waahm7 Dec 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore the buffer ticket, as the error response might not fit into the reserved buffer.

@waahm7 waahm7 merged commit 879a01e into refactor-empty-file Dec 28, 2023
30 checks passed
@waahm7 waahm7 deleted the refactor-ranged-get-object branch December 28, 2023 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants