Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[auto-ranged PUT]: CompleteMultipartUpload called while 1 multipart upload is missing #269

Open
grrtrr opened this issue Mar 16, 2023 · 6 comments
Labels
bug Something isn't working p2 This is a standard priority issue

Comments

@grrtrr
Copy link
Contributor

grrtrr commented Mar 16, 2023

CompleteMultipartUpload in auto-ranged-PUT failed due to a missing second (and final) UploadPart.

We ran into this problem with aws-c-s3 0.1.51 and aws-sdk-cpp 1.10.54 on Linux

Problem description

Our API issued an S3CrtClient->PutObject, and it resulted in the following error:

Expected: 'x is ok', with x := 'output_blob_->close()' [av::status::Status]
  x = PutObject() failed
  where: cloud/aws/s3/s3_streambuf.cc:93
  extra: s3://perception-prod-training-data/opt/a831200c/s2a/2023-02-15-bless-collect_dking_updateOverlapFeb10_latestIssues/test/36c6f8923fe514d6b5a28ac5dbdea034.rats: HTTP response code: 400
Resolved remote host IP address: 
Request ID: 2BAQ8WRZGGH3PPG1
Exception name: InvalidPart
Error message: Unable to parse ExceptionName: InvalidPart Message: One or more of the specified parts could not be found.  The part may not have been uploaded, or the specified entity tag may not match the part's entity tag.
7 response headers:
connection : close
content-type : application/xml
date : Wed, 15 Feb 2023 16:39:26 GMT
server : AmazonS3
transfer-encoding : chunked
x-amz-id-2 : oFLatf5OZVV1Ny34iYIEhqA3Ft/+QRXSLzz6/K/c36nd4grUqR6kmhKJS/U32GftV1/GdmVEbOY=
x-amz-request-id : 2BAQ8WRZGGH3PPG1

Backtrace (most recent call first)
#10	 at 0x560a5b826f76 in av::cloud::aws::s3::S3Ostreambuf::PutObject()
#9	 at 0x560a5b82b63d in av::cloud::aws::s3::S3Ostreambuf::close()
#8	 at 0x560a5b822170 in av::cloud::aws::s3::S3Ostream::close()
#7	 at 0x560a5af329b8 in av::perception::s2a::DataExtractionModule::on_shutdown()
#6	 at 0x560a5bf40593 in av::framework::Module::on_executor_shutdown()
#5	 at 0x560a5ac5ed91 in av::detail::FuncImpl<>::invoke()
#4	 at 0x560a5beacc8b in av::BaseThreadPool::do_next()
#3	 at 0x560a5beacecd in std::thread::_State_impl<>::_M_run()
#2	 at 0x7fc1887c76df in <?>
#1	 at 0x7fc188ca26db in start_thread

The PutObject in frame 10 invokes the S3CrtClient->PutObject call.

Further investigation showed that the first UploadPart succeeded (visible in list-parts, there was no evidence (neither list-parts nor API logs) that the second UploadPart completed.

In our logs, there was no further aws-c-s3 error indicating a failed operation.

Open Question

The CompleteMultipartUpload request uses the ETags of the completed requests, so how could the CompleteMultiPartUpload have been sent with the ETag of the second UploadPart?

Perhaps it was sent with only 1 ETag (that of the first, successfully completed UploadPart).

@waahm7
Copy link
Contributor

waahm7 commented Mar 22, 2023

Hi,

Thank you for reporting the issue. Could you please attach the full CRT logs and provide some reproduction steps? Thank you!

@grrtrr
Copy link
Contributor Author

grrtrr commented Mar 22, 2023

@waahm7 - the problem is not easily reproducible, it seems to be a timing-related problem. So far, after 10s of thousands of batch jobs, we have only data for this instance. (It is possible that this has happened more frequently in production, since batch jobs are retried on failure.).

With regard to the CRT logs, we log at WARN level to stderr; the only other log entry other than the above error trace was this one:

[ERROR] 2023-02-15 16:39:26.623 S3MetaRequest [140468667021056] id=0x7fc125951d00 Meta request cannot recover from error 14343 (Invalid response status from request). (request=0x7fc15e00a600, response status=400)
terminate called after throwing an instance of 'av::CheckException'
  what():  Check failure at perception/s2a/dataset_extraction/data_extraction_module.cc:574:
  Expected: 'x is ok', with x := 'output_blob_->close()' [av::status::Status]
  x = PutObject() failed
  where: cloud/aws/s3/s3_streambuf.cc:93
  extra: s3://perception-prod-training-data/opt/a831200c/s2a/2023-02-15-bless-collect_dking_updateOverlapFeb10_latestIssues/test/36c6f8923fe514d6b5a28ac5dbdea034.rats: HTTP response code: 400
Resolved remote host IP address: 
Request ID: 2BAQ8WRZGGH3PPG1
[... rest as above]

The following is the summary of a multi-week-long effort to narrow down the cause of the problem:

  1. The CompleteMultipartUpload that failed with above error had the following data:
opt/a831200c/s2a/2023-02-15-bless-collect_dking_updateOverlapFeb10_latestIssues/test/36c6f8923fe514d6b5a28ac5dbdea034.rats
Object Size: -1 Bytes
Part Count: 1-2
Multipart Upload ID: bSMw15PyZS8eNZgoSxZ.qdZw4Awfm4beQGf9ERvdc8izMN5yFd6quULmunaMpJ8gUh2ZAjjidDlVr97lOBlSwhssgFeb_Zsb5CHN5fPYX6qHLjyRm7uxJZtnPerZGhbwsBU5xNStVDsjlr.Qnhe18s387J.7VybyeA2H9yrUnfk-
  1. As per the Part Count field, it seemed to expect 2 parts. However, only one part was uploaded: the CompleteMultipartUpload with ID 2BAQ8WRZGGH3PPG1 was preceded by 1 UploadPart with ID 2BAR2P5XFDM71V14 on the same TCP connection. There was no trace of another UploadPart call.
  2. The s3 key of the failed multipart upload did not show up in list-multipart-uploads or in list-parts.

Since the CompleteMultipartUpload request requires adding the ETag values returned by the individual UploadPart responses into the XML request body, the cause may be that either

  • both UploadPart calls completed, but the second one failed after the API response returned the ETag (seems unlikely); or
  • due some kind of race condition or unsuitable timing, the CompleteMultipartUpload "thought" there was only 1 instead of 2 ETags/UploadParts present, and thus tried to complete the multipart upload with a single UploadPart.

@waahm7
Copy link
Contributor

waahm7 commented Mar 23, 2023

Thank you for the details. Is the request getting paused and resumed later?

@grrtrr
Copy link
Contributor Author

grrtrr commented Mar 23, 2023

No, it does not use aws_s3_meta_request_pause.

@yasminetalby yasminetalby added bug Something isn't working needs-review This issue or pull request needs review from a core team member. p2 This is a standard priority issue labels Jun 23, 2023
@jmklix
Copy link
Member

jmklix commented Nov 15, 2023

@grrtrr there have been some update since this issue was opened. Are you still running into this with the latest version of this repo?

@jmklix jmklix added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed needs-review This issue or pull request needs review from a core team member. labels Nov 15, 2023
@grrtrr
Copy link
Contributor Author

grrtrr commented Nov 15, 2023

@jmklix what are the updates and how do they fix the issue described here? In particular, have test cases been added to ensure the condition does not happen?

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Nov 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests

4 participants