Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to reproduce #19179 #19191

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

serathius
Copy link
Member

@serathius serathius commented Jan 14, 2025

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: serathius

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

codecov bot commented Jan 14, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.78%. Comparing base (9eb85ee) to head (54813cf).
Report is 29 commits behind head on main.

Additional details and impacted files

see 26 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #19191      +/-   ##
==========================================
+ Coverage   68.77%   68.78%   +0.01%     
==========================================
  Files         420      420              
  Lines       35641    35629      -12     
==========================================
- Hits        24511    24507       -4     
+ Misses       9707     9700       -7     
+ Partials     1423     1422       -1     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9eb85ee...54813cf. Read the comment docs.

@serathius
Copy link
Member Author

cc @fuweid any recommendation on how to reproduce the #19179

@fuweid
Copy link
Member

fuweid commented Jan 15, 2025

Hi @serathius

I can't reproduce it in my local. But I used to etcd-dump-logs to get one thing in first three failure pipeline runs

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-etcd-robustness-amd64/1877585036438409216
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-etcd-robustness-arm64/1877364764741472256
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-etcd-robustness-arm64/1877466683589791744

There were only two kinds of requests before compaction panic. One is creation and other is deletion.
Even if Update type has high weight during random pick, there is no update type in first 400+ requests.
So, creation revision has been deleted in first compaction batch before panic. After restart, etcd key index ignores keys which has only one revision for tombstone. And then etcd resumes compaction, there are no such key in index, compaction will delete all the tombstones. However, it should keep these tombstones.

I am still investigating why there is no update before compaction panic. Hope that information helps

@serathius serathius changed the title Run Issue17780 10 times Reproduce #19179 Jan 16, 2025
@serathius serathius changed the title Reproduce #19179 Try to reproduce #19179 Jan 16, 2025
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
@k8s-ci-robot
Copy link

@serathius: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-etcd-verify 54813cf link true /test pull-etcd-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@serathius
Copy link
Member Author

There were only two kinds of requests before compaction panic. One is creation and other is deletion.

I think I can guess the reason, in Kubernetes traffic operations are based on local state that is feed from watch. It tries to balance number of objects within the storage to keep the average. If the watch was very delayed there would be a long time where the traffic would execute only creates as it would not be aware of any objects. When watch would caught up, the traffic would skip random operations and immediately go for deletions, as there were too many objects in storage.

Still no progress on reproduction.

@k8s-ci-robot
Copy link

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants