CAP: 0065
Title: Reusable module cache
Working Group:
Owner: Graydon Hoare <@graydon>
Authors: Graydon Hoare <@graydon>
Consulted: Garand Tyson <@SirTyson>, Siddharth Suresh <@sisuresh>, Dmytro Kozhevin <@dmkozh>, Nicolas Barry <@MonsieurNicolas>
Status: Draft
Created: 2025-01-13
Discussion: https://github.com/stellar/stellar-protocol/discussions/1615
Protocol version: 23
Maintain a fully-populated cache of all live WASM modules -- parsed, validated and translated -- at all times, across all ledgers, eliminating parsing and validation costs from the WASM VM instantiation process.
As specified in the Preamble.
The existing protocol caches ready-to-run (parsed, validated and translated) WASM modules only within a single transaction; the cache is reset after each transaction. This means that any module that is used in multiple transactions in a transaction set, or multiple transaction sets over multiple ledgers, is repeatedly (and wastefully) re-parsed, re-validated and re-translated to its executable form.
This change is aligned with the goal of lowering the cost and increasing the scale of the network.
It would be ideal to cache modules in memory across transactions or even across ledgers. However, until recently there has been no consensus on how to attribute and charge fees to cover the cost of populating a longer-lived cache. The design of the module cache has therefore been limited to operating within a single transaction (with a clearly defined fee, and account that pays the fee).
With the protocol changes proposed in CAP-0062 and CAP-0066 the soroban state of the ledger is effectively partitioned into:
- a smaller, size-bounded, live (directly-usable), in-memory component (the "live bucketlist")
- a larger, unbounded, archived (not-directly-usable), on-disk component (the "hot archive bucketlist")
This is part of the longer term state archival plan, and users can control the movement of ledger entries between #1 and #2 by means of TTLs and restore operations.
Because of this new organization, a new caching strategy is possible: the module cache can restrict its population to contracts in the live, in-memory component of the ledger. The cost of populating the cache is then only charged when uploading or restoring ContractCodeEntry
ledger entries to the in-memory component.
Moreover this cost may even be subsumed entirely into the IO resource fees associated with the upload or restore operation; correct calibration will be required to determine if a residual cost has to be charged for the parsing, validation and translation, separate from the IO costs.
This change depends on the adoption of CAP-0062 and CAP-0066.
The API between stellar-core and the soroban host is modified to allow the module cache to be reused across multiple soroban hosts.
When stellar-core starts up, it parses, validates and translates all WASM modules (a ContractCodeEntry
ledger entry) in the live bucketlist and stores them in a single in-memory, reusable module cache.
When a ContractCodeEntry
is uploaded or restored from the hot archive bucketlist, it is added to the reusable module cache after stellar core finishes processing all transactions in the ledger containing the upload or restore operations.
When a ContractCodeEntry
is evicted from the live bucketlist to the hot archive bucketlist, it is removed from the reusable module cache, again after all transactions in the associated ledger is processed.
No other changes to the protocol occur. Users should observe no behavioural differences, just lower fees due to the elimination of parsing, validation and translation costs during transaction execution. The "cache" will have a 100% hit rate since transactions can only invoke contracts currently in the live bucketlist. The cache therefore no longer behaves like a cache so much as simply a ready-to-run representation of the content of the live bucketlist.
Uploads will still be charged the cost of a full parse of the contract, using the coarse worst-case cost model reserved for unknown contracts. This is the same cost model used during uploads before this CAP.
Executing a contract in the same ledger that the contract is first uploaded will incur a full parse/validate/translate cost just before the execution, using the refined cost model of known contracts that was introduced in CAP-0054. In other words the low cost instantiation that is new to this CAP (omitting the parse/validate/translate costs) will only apply to ledgers after the ledger in which a contract is uploaded.
Evictions are performed before uploads or restorations, at the end of ledger close. In other words if the same entry is both evicted and uploaded in a single ledger, the state of the module cache after the ledger will contain the entry.
None.
The only semantic change is that fees are reduced.
The design is very straightforward: there is a cache that stores work that only needs to be done once, so that it can be reused. This cache already exists in soroban, it is just cleared after each transaction. The change is to stop clearing it, increasing its effectiveness (to 100%).
The only real "rationale" is in explaining why we didn't have a persistent (cross-ledger) cache from the beginning of soroban's design. This restriction resulted from our inability to design a suitable way to charge fees to users for cache residency (and cache misses), since the fees charged would depend on the history of cache population and eviction events, which would not be visible or predictable to users.
It was frequently suggested that the network could charge fees to everyone "as if" they were experiencing a cache miss, and then refund the fee on a cache hit. Unfortunately the dominant cost being saved by caching is CPU time, which is not a refundable resource: there is a fixed amount of time per transaction set (based on the close time target) and that time has to be allocated to transactions as part of admission control, when building transaction sets, after which the time is "consumed" whether actually spent on-CPU while executing, or simply having excluded other transactions from the transaction set when building it. Various other strategies were proposed but none (until now) seemed satisfactory.
With CAP-0062 and CAP-0066, the population and eviction events are visible and controlled by the user (using rent bumps) and so we can move ahead with the simple, obvious design of "not clearing the cache after each transaction". Moreover since the design in CAP-0062 places all live ledger data in memory, we can ensure the cache is actually always full of every possibly-invocable contract: a 100% hit rate.
While there is a small risk that the cumulative size of the live bucketlist will exceed available memory, this is controlled by a network setting that the validators can vote to raise or lower as necessary to balance the needs of the network against the cost to node operators of validators with more memory.
In practice we expect the live bucketlist to remain comfortably within the memory of any reasonably modern computer: ledger entries range from hundreds of bytes to low-kilobytes, most servers used as validators have tens of gigabytes of memory, and the great majority of ledger entries are dormant and can therefore be safely evicted to the disk without causing any inconvenience to users. Should they ever become active, the user only has to issue a restore operation.
Besides the changes in CAP-0062 and CAP-0066, upgrading to the changes in this CAP should impose no additional requirements for ecosystem users. CPU costs of transaction execution, and therefore fees charged, will decrease. That should be the only effect.
Besides the changes in CAP-0062 and CAP-0066, this CAP represents no additional backward incompatibilities.
This CAP will substantially reduce transaction execution costs.
There is a slightly elevated hypothetical risk of service denial due to memory exhaustion, beyond CAP-0062 and CAP-0066: for example if a user determines a way to allocate a disproportionate amount of memory in a cached module such that the ledger entry byte-size limits of the live bucketlist imposed by the previous CAPs do not adequately limit the memory consumption of those same ledger entries when translated.
We are not presently aware of any method of attack that fits this hypothetical risk, but it is not out of the realm of possibility. We will continue to perform testing, code inspection and monitoring with an eye to this hypothetical risk.
TBD.
There are preliminary implementations ready, pending further testing, calibration and measurement, integration and review:
-
The soroban side is in stellar/rs-soroban-env#1506
-
The stellar-core side is in stellar/stellar-core#4621