-
Notifications
You must be signed in to change notification settings - Fork 667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PVF: drop backing jobs if it is too late #5616
base: master
Are you sure you want to change the base?
Conversation
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
Resolves #4632 The new logic optimizes the distribution of execution jobs for disputes, approvals, and backings. Testing shows improved finality lag and candidate checking times, especially under heavy network load. ### Approach This update adds prioritization to the PVF execution queue. The logic partially implements the suggestions from #4632 (comment). We use thresholds to determine how much a current priority can "steal" from lower ones: - Disputes: 70% - Approvals: 80% - Backing System Parachains: 100% - Backing: 100% A threshold indicates the portion of the current priority that can be allocated from lower priorities. For example: - Disputes take 70%, leaving 30% for approvals and all backings. - 80% of the remaining goes to approvals, which is 30% * 80% = 24% of the original 100%. - If we used parts of the original 100%, approvals couldn't take more than 24%, even if there are no disputes. Assuming a maximum of 12 executions per block, with a 6-second window, 2 CPU cores, and a 2-second run time, we get these distributions: - With disputes: 8 disputes, 3 approvals, 1 backing - Without disputes: 9 approvals, 3 backings It's worth noting that when there are no disputes, if there's only one backing job, we continue processing approvals regardless of their fulfillment status. ### Versi Testing 40/20 Testing showed a slight difference in finality lag and candidate checking time between this pull request and its base on the master branch. The more loaded the network, the greater the observed difference. Testing Parameters: - 40 validators (4 malicious) - 20 gluttons with 2 seconds of PVF execution time - 6 VRF modulo samples - 12 required approvals ![Pasted Graphic 3](https://github.com/user-attachments/assets/8b6163a4-a1c9-44c2-bdba-ce1ef4b1eba7) ![Pasted Graphic 4](https://github.com/user-attachments/assets/9f016647-7727-42e8-afe9-04f303e6c862) ### Versi Testing 80/40 For this test, we compared the master branch with the branch from #5616. The second branch is based on the current one but removes backing jobs that have exceeded their time limits. We excluded malicious nodes to reduce noise from disputing and banning validators. The results show that, under the same load, nodes experience less finality lag and reduced recovery and check time. Even parachains are functioning with a shorter block time, although it remains over 6 seconds. Testing Parameters: - 80 validators (0 malicious) - 40 gluttons with 2 seconds of PVF execution time - 6 VRF modulo samples - 30 required approvals ![image](https://github.com/user-attachments/assets/42bcc845-9115-4ae3-9910-286b77a60bbf) --------- Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Using the block numbers and not timestamps is more robust and simple. However, it can still happen that the relay parent expires while the candidate is executed which is no better than using timestamps.
Added a few more suggestions to simplify the code.
let Ok(mode) = prospective_parachains_mode(sender, relay_parent).await else { | ||
return None; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should assume that async backing is always enabled, if not log an error.
@@ -512,6 +523,21 @@ where | |||
Some(processed_code_hashes) | |||
} | |||
|
|||
async fn maybe_update_active_leaf( | |||
mut validation_backend: impl ValidationBackend, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a future PR we can do even better. We keep track of last finalized block and then we can drop approval execution for the inclusion relay chain blocks that are already finalized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good! 👍 Arguable question is whether we really need both PvfExecKind
and Priority
as their function is so close, but let it be like that for the matter of distinguishing their purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you @AndreiEres.
Did we burn this in already on Kusama validators ?
@@ -625,10 +625,18 @@ async fn request_candidate_validation( | |||
candidate_receipt: CandidateReceipt, | |||
pov: Arc<PoV>, | |||
executor_params: ExecutorParams, | |||
mode: ProspectiveParachainsMode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to support sync backing mode. So we should always pass in the allowed_ancestry_len
here,
Fixes #5530
This PR introduces the removal of backing jobs that have been back pressured for longer than
allowedAncestryLen
, as these candidates are no longer viable.It is reasonable to expect a result for a backing job execution within
allowedAncestryLen
blocks. Therefore, we set the job TTL as a relay block number and synchronize the validation host by sending activated leaves.