CAP: 0063
Title: Parallelism-friendly Transaction Scheduling
Working Group:
Owner: dmkozh@stellar.org
Authors: Dmytro Kozhevin <@dmkozh>
Consulted: Siddharth Suresh <@sisuresh>, Nicolas Barry <@MonsieurNicolas>, Graydon Hoare <@graydon>
Status: Draft
Created: 2024-12-18
Discussion: https://github.com/stellar/stellar-protocol/discussions/1602
Protocol version: 23
This CAP defines a new structure for transaction sets that allows for applying Smart Contract transactions with multiple threads while maintaining bounded application time.
As specified in the Preamble.
Every node in the Stellar network has to apply the transactions to the current ledger state in order to produce the next block. Currently, transaction application happens in a single-threaded fashion, which results in CPU under-utilization on most modern systems. Utilizing the idle cores for applying the transactions provides a straightforward way of increasing the number of transactions that may be included into a single ledger block without compromising the time for closing that ledger.
Smart Contract transactions on Stellar were designed with parallelism support in mind (see CAP-46). Specifically, every transaction has to declare the entries it may read and modify. This in theory allows Core nodes to pick a set of Smart Contract transactions and come up with a parallel execution schedule that will produce exactly the same result 'as if' the transaction set has been applied sequentially. However, there is currently nothing in the protocol that prevents validators from nominating 'parallel-unfriendly' transaction sets that can't be parallelized at all (e.g. when all transactions want to modify the same key). Thus, the time necessary for applying any given transaction set may vary wildly between 1*x
and T*x
where T is the number of threads a validator may use.
This CAP aims to solve this issue and allow for parallelization that has a guaranteed upper bound for the execution time of transaction sets (given at least the minimum specified number of physical threads supported).
This CAP is aligned with the following Stellar Network Goals:
- The Stellar Network should run at scale and at low cost to all participants of the network.
This CAP introduces a new format for the Smart Contract (Soroban) phase of the transaction set. The new format organizes the transactions in the following way:
- Transactions are split into a sequence of 1 or more 'stages', that have to be applied sequentially.
- Every stage consists of multiple clusters of transactions, such that every cluster is completely independent of every other cluster within the stage, so every cluster may be executed in parallel.
- Every cluster consists of potentially dependent transactions, so these transactions have to be generally applied sequentially with respect to each other.
The CAP also adds a new network setting for controlling the maximum number of clusters per stage, which effectively regulates the maximum degree of parallelism supported by the network.
The transaction sets defined by this CAP must not exceed the ledger limits specified by the network configuration, including the new limit on the maximum number of parallel clusters. The validation process allows validators to execute the smart contract transactions using multiple threads while maintaining the upper bound on the total modelled instructions executed by every thread and thus capping the overall execution time.
In order to maintain the read-only TTL update semantics, the CAP also defines the new algorithm for updating the TTLs of the ledger entries, specifically when multiple transactions within the same ledger update the TTL of the same entry.
This patch of XDR changes is based on the XDR files in commit a41b2db15ea34a9f9da5326b996bb8a7ceb5740f
of stellar-xdr.
diff --git a/Stellar-contract-config-setting.x b/Stellar-contract-config-setting.x
index 9f09c7b..a8d3325 100644
--- a/Stellar-contract-config-setting.x
+++ b/Stellar-contract-config-setting.x
@@ -23,6 +23,16 @@ struct ConfigSettingContractComputeV0
uint32 txMemoryLimit;
};
+// Settings for running the contract transactions in parallel.
+struct ConfigSettingContractParallelComputeV0
+{
+ // Maximum number of clusters with dependent transactions allowed in a
+ // stage of parallel tx set component.
+ // This effectively sets the lower bound on the number of physical threads
+ // necessary to effectively apply transaction sets in parallel.
+ uint32 ledgerMaxDependentTxClusters;
+};
+
// Ledger access settings for contracts.
struct ConfigSettingContractLedgerCostV0
{
@@ -302,7 +312,8 @@ enum ConfigSettingID
CONFIG_SETTING_STATE_ARCHIVAL = 10,
CONFIG_SETTING_CONTRACT_EXECUTION_LANES = 11,
CONFIG_SETTING_BUCKETLIST_SIZE_WINDOW = 12,
- CONFIG_SETTING_EVICTION_ITERATOR = 13
+ CONFIG_SETTING_EVICTION_ITERATOR = 13,
+ CONFIG_SETTING_CONTRACT_PARALLEL_COMPUTE_V0 = 14
};
union ConfigSettingEntry switch (ConfigSettingID configSettingID)
@@ -335,5 +346,7 @@ case CONFIG_SETTING_BUCKETLIST_SIZE_WINDOW:
uint64 bucketListSizeWindow<>;
case CONFIG_SETTING_EVICTION_ITERATOR:
EvictionIterator evictionIterator;
+case CONFIG_SETTING_CONTRACT_PARALLEL_COMPUTE_V0:
+ ConfigSettingContractParallelComputeV0 contractParallelCompute;
};
}
diff --git a/Stellar-ledger.x b/Stellar-ledger.x
index 0fc03e2..77db46c 100644
--- a/Stellar-ledger.x
+++ b/Stellar-ledger.x
@@ -164,6 +164,33 @@ enum TxSetComponentType
TXSET_COMP_TXS_MAYBE_DISCOUNTED_FEE = 0
};
+// A collection of transactions that *may* have arbitrary read-write data
+// dependencies between each other, i.e. in a general case the transaction
+// execution order within a cluster may not be arbitrarily shuffled without
+// affecting the end result.
+typedef TransactionEnvelope DependentTxCluster<>;
+// A collection of clusters such that are *guaranteed* to not have read-write
+// data dependencies in-between clusters, i.e. such that the cluster execution
+// order can be arbitrarily shuffled without affecting the end result. Thus
+// clusters can be executed in parallel with respect to each other.
+typedef DependentTxCluster ParallelTxExecutionStage<>;
+
+// Transaction set component that contains transactions organized in a
+// parallelism-friendly fashion.
+//
+// The component consists of several stages that have to be executed in
+// sequential order, each stage consists of several clusters that can be
+// executed in parallel, and the cluster itself consists of several
+// transactions that have to be executed in sequential order in a general case.
+struct ParallelTxsComponent
+{
+ int64* baseFee;
+ // A sequence of stages that *may* have arbitrary data dependencies between
+ // each other, i.e. in a general case the stage execution order may not be
+ // arbitrarily shuffled without affecting the end result.
+ ParallelTxExecutionStage executionStages<>;
+};
+
union TxSetComponent switch (TxSetComponentType type)
{
case TXSET_COMP_TXS_MAYBE_DISCOUNTED_FEE:
@@ -178,6 +205,8 @@ union TransactionPhase switch (int v)
{
case 0:
TxSetComponent v0Components<>;
+case 1:
+ ParallelTxsComponent parallelTxsComponent;
};
// Transaction sets are the unit used by SCP to decide on transitions
--
A new network configuration setting is introduced in order to regulate the maximum potential degree of parallelism supported by the network.
The setting is ledgerMaxDependentTxClusters
and it's introduced in a new ConfigSettingContractParallelComputeV0
setting struct with CONFIG_SETTING_CONTRACT_PARALLEL_COMPUTE_V0
ConfigSettingID
. During the protocol upgrade to version 23, the respective configuration setting ledger entry will be created and ledgerMaxDependentTxClusters
will be set to 1 (no parallelism).
The setting is used for the validation of transaction sets and its exact semantics is described in the following sections.
Soroban phase (phase 1 of transaction) has to contain a single component with version 1 (ParallelTxsComponent
). The phase still has to contain only Soroban transactions and only a single Soroban phase is allowed, there is no protocol change around that.
parallelTxsComponent
contains an optional baseFee
that represents the discounted base fee that must be used as a base fee for every transaction. If set, baseFee
has to be not greater than a base fee of any transaction in the component.
parallelTxsComponent
must consist of zero or more executionStages
. Every stage in executionStages
must contain at least one non-empty DependentTxCluster
. Thus, if a transaction set contains no Soroban transactions, then it must contain a parallelTxsComponent
with 0 executionStages
.
The effective fee computation works in the same fashion as for TxSetComponent
component (described in CAP-0042).
The effective fee for a given transaction in ParallelTxsComponent
is computed in the following way:
- If
baseFee
is empty, then transaction's effective fee is its fee bid. - If
baseFee
is set, then transaction's effective fee is set to the value ofbaseFee
.
As per already existing specification, every phase in a transaction set has to be valid in order for it to get applied to the ledger. Beyond the basic XDR representation format described above, the full validation specification for the parallel Soroban phase is defined as follows:
- When set,
baseFee
must not be higher than the base fee of any transaction in the phase - Every
DependentTxCluster
has to have transactions sorted by their SHA-256 hashes in increasing order - Every stage in
executionStages
must have its clusters sorted by SHA-256 hash of the first transaction in the cluster (recall that empty clusters are not allowed) - Stages must be sorted by SHA-256 hash of the first transaction in the first cluster of the stage (recall that empty stages are not allowed)
- The number of clusters per stage must not exceed the value of
ledgerMaxDependentTxClusters
network configuration setting - Within a stage, footprint conflicts between the dependent transaction clusters are not allowed. The footprint conflict between two transactions is defined as follows: if a transaction A has a ledger key in its read-write footprint, and another transaction B has the same ledger key in its footprint (either read-only, or read-write), then they're conflicting. A pair of clusters is considered to have a footprint conflict in case if any pair of transactions A from the first cluster and B from the second cluster have a conflict.
- The sum of 'sequential' instructions for the phase must not exceed the value of
ledgerMaxInstructions
network configuration setting. 'Sequential' instructions are defined as follows:- 'Sequential' instructions per cluster are defined as the sum of
sorobanData.resources.instructions
across all the transactions in the cluster - 'Sequential' instructions per stage are defined as the max of 'sequential' instructions across its clusters
- 'Sequential' instructions for the phase are defined as the sum of 'sequential' instructions across all the stages in the phase
- 'Sequential' instructions per cluster are defined as the sum of
Even though some of the transactions in the new phase are data-independent by design, there is still a protocol defined order of application of the transactions. That is, if transactions are applied, the results must be 'as if' they have been applied in that order. The application order is defined as follows:
- unchanged We define the transaction 'application comparison key' as
sha256(transaction_envelope_xdr) ^ sha256(transaction_set_xdr)
, where^
is bit-XOR operation. Note, that this is the transaction application order key used by the current protocol as well. - First, transactions in every cluster are sorted for apply using the comparison key
- Then clusters in every stage are sorted using the comparison key of the first transaction in the sorted cluster
- Then all the stages in phase are sorted using the comparison key of the first transaction of the first sorted cluster in the sorted stage
- The application order is defined as the order of transactions visited during iterating the stages in the sort order defined above, i.e. it can be defined by the following pseudo-code:
for each stage in sorted_stages:
for each sorted_cluster in stage:
for each transaction in sorted_cluster:
application_order.append(transaction)
During the application the fees must be withdrawn and sequence numbers must be bumped sequentially, in application order defined above and before performing any operations inside the transactions. This is consistent with the current protocol behavior.
Then all the operations must be applied and the changes in non-TTL ledger entries should be 'as if' the operations from every transaction have been applied sequentially. The semantics for processing the TTL changes and computing the rent fees are changed with this CAP. The changes are described in the following section.
Currently the changes to the entry TTL are immediately applied to the ledger state, in the same way as changes to any other ledger entry. However, the protocol allows increasing the TTL even for the entries that have been declared in the footprint as read-only. Thus if we wanted to preserve the current behavior, all the read-only footprint entries would potentially be read-write, and thus only transactions with non-overlapping footprints could belong to different clusters. This is not optimal for parallelization, given that common entries such as Stellar Asset Contract instances and some popular Wasms can be referenced by multiple transactions, but they're always or almost only read-only.
Thus, this CAP proposes a new way for applying the TTLs changes that makes changes to read-only entries' TTL changes commutative. The proposed TTL update algorithm ensures that the TTL changes are only observable in the following cases:
- When the corresponding entry is being modified in the transaction (i.e. is a part of the read-write footprint)
- After executing all the transactions
The detailed new algorithm is as follows:
- After fees have been withdrawn, but before applying any operations:
- Snapshot the initial
liveUntilLedger
of every entry within the stage's footprint and store it in 'current TTL map' keyed by the TTL entry keys - Also prepare an empty 'read-only update map' that contains the results values of
liveUntilLedgers
.
- Snapshot the initial
- When applying a Soroban operation:
- Observe the relevant TTL changes thus far. For every key in the read-write footprint:
- If 'read-only update map' contains the key, set the value in 'current TTL map' to the value from 'read-only update map'
- Note, that this can't decrease the current TTL of the entry
- Remove the key from 'read-only update map'
- If 'read-only update map' contains the key, set the value in 'current TTL map' to the value from 'read-only update map'
- Before applying the actual operation logic:
- When reading the initial state of the entries, use 'current TTL map' to determine the
liveUntilLedger
of every entry in the footprint
- When reading the initial state of the entries, use 'current TTL map' to determine the
- If an operation succeeds and ledger changes caused by the operation must be materialized:
- For every key in the read-only footprint that had its
liveUntilLedger
updated:- Set the the value corresponding to the key in 'read-only update map' to
max(updated_liveUntilLedger, existing_value_liveUntilLedger)
(if key is missing from the map, just useupdated_liveUntilLedger
)
- Set the the value corresponding to the key in 'read-only update map' to
- For every key in the read-write footprint (unconditionally):
- If the entry still exists, set the value corresponding to the key in 'current TTL map' to the new
liveUntilLedger
of the entry - If the entry was deleted, also removed the key from 'current TTL map'
- If the entry still exists, set the value corresponding to the key in 'current TTL map' to the new
- unchanged The effective rent fee for the operation is still computed based on the initial state of the entry at operation start and the end state of the entry.
- Notice, however, that the initial TTL of the entry itself is defined differently now.
- For every key in the read-only footprint that had its
- Observe the relevant TTL changes thus far. For every key in the read-write footprint:
- After applying all the Soroban operations in the phase:
- Iterate all the remaining keys in the 'read-only update map' and update the
liveUntilLedger
for the corresponding entries with the value from the map- Note, that this can't decrease the current TTL of the entry
- Iterate all the remaining keys in the 'read-only update map' and update the
Notice, that while the algorithm is defined in terms of a single mutable 'current TTL map', in implementation it can actually be separated into 2 parts: a fully immutable part for the entries that are always read-only within a stage, and a mutable part that may only be modified while applying transactions in a single cluster. Thus when applying transactions in parallel, no locking is necessary to update the maps during the operation application. The entries might move between maps only in-between the stages.
Transaction meta for every Soroban transaction will contain a 'correct' TTL change from the perspective of that transaction. That means that the initial entry TTLs are what the transaction has observed before execution, the final TTLs are what the transaction has updated them too, and the rent fee corresponds to the TTL changes. However, if a meta consumer wants to track the TTL of the entry after applying all the transactions in the ledger, they will need to never decrease the TTL of the entries they track. For example:
- Transaction 1 extends the TTL for entry E by 1000 ledgers and sets it
liveUntilLedger
tocurrentLedgerSeq + 1000
- Transaction 2 extends the TTL for entry E by 500 ledgers and sets it
liveUntilLedger
tocurrentLedgerSeq + 500
- The final
liveUntilLedger
for E iscurrentLedgerSeq + 1000
The only way the TTL of an entry can ever go down after applying a ledger is if an entry has been deleted by one transaction and then recreated by another transaction. Thus the conceptual algorithm for tracking entry TTLs using the transaction meta is as follows:
- Maintain a map from TTL keys to the corresponding
liveUntilLedger
values - For every transaction in apply order:
- For every TTL entry update:
- If TTL has been updated, set the value corresponding to the TTL key to
max(updated_liveUntilLedger, existing_value_liveUntilLedger)
(if key is missing from the map, just useupdated_liveUntilLedger
) - If a TTL entry has been removed, remove the corresponding key from the map
- If TTL has been updated, set the value corresponding to the TTL key to
- For every TTL entry update:
The protocol only defines the application order and it cannot strictly define how to apply the operations, as long as the end result is equivalent. However, the phase structure and validation rules strongly suggest parallelization between the dependent transaction clusters. Specifically:
- Within a single stage, every cluster is completely data-independent of every other cluster, and thus every cluster may be applied by a separate thread
- After every thread in the stage has succeeded, the changed ledger entries should be passed to the following stage to be applied in parallel as well
- While the protocol guarantees that there are no data conflicts between the clusters, it's also possible that clusters have 'sub-clusters' that are independent of each other and thus use a higher degree of parallelism than the protocol allows. This doesn't however have any protocol guarantees in terms of the degree of parallelization.
Just like today, this CAP does not specify the exact algorithm used to produce a valid value as to make it easier for implementations to compete on the quality of transaction sets.
Here is a sketch of a possible greedy algorithm that produces relatively efficient Soroban phase contents:
- Define a small number of stages S (1-4) to generate
- For every stage maintain a set of the dependent transaction clusters (initially empty)
- For every transaction in the order of descending inclusion fee
- For every stage try to fit the transaction into stage:
- Determine the potential new contents of dependent transaction clusters:
- Transaction that doesn't conflict with any other transaction forms a new, single-transaction cluster
- Otherwise, merge the new transaction and all the clusters that have at least one conflict with a new transaction into a new cluster
- Validate the potential new clusters: verify that it's possible to pack these clusters into
ledgerMaxDependentTxClusters
bins such that in every bin the total amount of instructions across all the transactions doesn't exceedledgerMaxInstructions / S
instructions. - If the new clusters are valid, store them in the stage and consider the transaction added, otherwise roll them back to the previous state
- Determine the potential new contents of dependent transaction clusters:
- If a transaction doesn't fit into any stage, postpone it to the next ledger and trigger surge pricing
- For every stage try to fit the transaction into stage:
- The output phase consists of the final packings of clusters in every stage into
ledgerMaxDependentTxClusters
This proposal tries to strike the balance between the parallelization efficiency, nomination flexibility and simplicity of transaction set validation and application.
Parallelization efficiency:
- Sequential stages provide a simple scheduling solution for parallelizing the sets of transactions with a moderate amount of conflicts:
- For example, if a single 'hot' entry that multiple transactions read (e.g. an instance entry of a popular token contract) is updated, then the update can be scheduled in one stage, and all the reads may be efficiently parallelized in the remaining stages.
- On the other hand, the small, limited number of stages minimizes the amount of synchronization necessary - only a couple barriers are necessary
- Application time is limited in terms of modelled instructions, so the real apply time should be relatively predictable
- TTL reconciliation logic allows parallelizing TTL changes which a prevalent for otherwise read-only entries
Flexibility:
- The number and size of the stages is not fixed by the protocol, so it's possible to innovate on the nomination algorithms without protocol changes
- The maximum number of clusters is a network setting, so increasing the maximum parallelization degree supported by the network is a change that only requires a single vote and no code changes
- Protocol is also not prescriptive on the exact application logic
Simplicity:
- Storing the dependency clusters and stages inside transaction sets makes the validation fast and straightforward
- Minimal amount of new settings is necessary
The changed TTL update logic also includes a slight change in TTL update semantics when several entries try to update the TTL for the same key.
Currently, if N transactions want to increase TTL of entry E by L ledgers, only the first (arbitrary) one will actually perform the increase and get the rent fee withdrawn.
With the changed algorithm (considering that entry is always read-only) the entry TTL will be increased by N * L ledgers and the respective rent fees will be withdrawn from every transaction. The only alternative to that is really to just increase the TTL by N ledgers, but still withdraw the fees, so the approach taken makes the fee consumption more fair.
Maintaining the current behavior is not an option in a general case, because the read-only key might have its TTL increased by transaction in different clusters and thus the transactions can't observe the TTL changes made by each other.
Expanding the accumulation logic to even the transactions that must be sequential both makes the logic more consistent, and also allows for further parallelization within a cluster if it's otherwise possible, i.e. with this logic read-only keys can be safely considered truly read-only no matter the execution schedule.
As soon as this CAP becomes active, validators will produce the new (v1) format of Soroban phase in transaction sets in SCP.
The configuration entry of type CONFIG_SETTING_CONTRACT_PARALLEL_COMPUTE_V0
will be created with ledgerMaxDependentTxClusters
set to the initial value of 1 during the protocol transition.
This way the only immediate change realized in the protocol will be the TTL update semantics change. Any protocol-supported parallel execution logic won't be active until ledgerMaxDependentTxClusters
is increased by a separate SLP and a validator vote.
As transaction sets are transmitted over the network, the overlay network will have to be updated to support the new format.
The XDR changes introduced by the CAP should have minimal ecosystem effect, since only the internal SCP data structures are changed and these are not typically observed outside of the consensus protocol.
As ledgerMaxDependentTxClusters
setting grows, the demand to the hardware running the Stellar Core will grow as well, specifically multi-core processors will become necessary for nodes to stay in sync with the network in case if most of the network bandwidth is being used. Specifically, at least ledgerMaxDependentTxClusters + 1
cores would be necessary.
The changes on the security front are minimal as transaction semantics are not changed.
As usual, the validators have some degree of control over the contents of the transaction sets, especially in terms of simply censoring the transactions that are being included. However, this CAP doesn't meaningfully affect this risk in either direction.
Improvements to the mem-pooling, nomination and application logic can be implemented as a followup to the initial CAP implementation. These are not protocol changes and may be a part of any Core release (before or after the protocol upgrade).
TBD
TBD