Durability (& reliability) versus performance #101

dumblob · 2021-12-22T11:38:25Z

dumblob
Dec 22, 2021

First I'd like to say I like the decision to strictly conform to a (modern) SQL standard. I think this might be one of the selling points - "it was tested against vsql" would have a huge meaning in the bright future I envision 😉.

Here I'd like to discuss probably the biggest PITA any DB has to clearly decide and define. It's durability guarantees, i.e. a "fault model". This also determines the maximum achievable performance of the DB due to physical limitations.

So I always envisioned the following:

to have a DB API requiring one to specify on per-request basis whether it shall be treated as best-effort (i.e. no guarantee but presumably faster and with lower delays) or as guaranteed operation. See e.g. my rant ACID & strict serializability versus best effort on a per request basis redis/redis#6200 (comment) .

Note that best effort might to lead incorrect data compared to the surrounding global context but it must not lead to internally incorrect data within a transaction - e.g. if in one transaction I fetch two different rows from two different tables it still must guarantee that the data (i.e. both rows) returned are exactly the data from the point in time when the transaction began (i.e. mutually correct).
Guarantee durability in the very maximum technically possible way POSIX allows us to and the knowledge of HW-originated errors allows us to. Namely the same fault model as TigerBeetle has. SQLite3 allows almost for what TigerBeetle does (e.g. commit both the parent directory structure as well as the DB file itself and then fsync) but there are still some rough corners.

The first step would be to focus on calling fsync and waiting for it to finish (see below) after each transaction which has the potential to modify data.

Note also that Linux fsync is synchronous (and will stay so in the upcoming years) but on many other systems it's asynchronous with system-specific calls to wait for its return.

What do you think about implementing both points (1) and (2)?

elliotchance · 2021-12-23T17:02:33Z

elliotchance
Dec 23, 2021
Maintainer

First I'd like to say I like the decision to strictly conform to a (modern) SQL standard. I think this might be one of the selling points - "it was tested against vsql" would have a huge meaning in the bright future I envision 😉.

Thanks! Despite the 1700 pages that describe just part of the base SQL language, it's amazing how much is left to the "up to the database implementation". Still, it is a strong foundation I'll be able to follow for a while yet.

to have a DB API requiring one to specify on per-request basis whether it shall be treated as best-effort (i.e. no guarantee but presumably faster and with lower delays) or as guaranteed operation. See e.g. my rant ACID & strict serializability versus best effort on a per request basis redis/redis#6200 (comment) .

What does a request mean in this case? It could be a single query or a single roundtrip that might contain multiple (read or write) queries? Would this be affected by BEGIN/COMMIT/ROLLBACK passed in different roundtrips?

Note that best effort might to lead incorrect data compared to the surrounding global context but it must not lead to internally incorrect data within a transaction - e.g. if in one transaction I fetch two different rows from two different tables it still must guarantee that the data (i.e. both rows) returned are exactly the data from the point in time when the transaction began (i.e. mutually correct).

Isn't this the motivation for READ UNCOMMITED serialization mode? Where a reading query is blocked by nothing, but may include some rows that should not be visible. I imagine this is great for scenarios like reporting queries, that might go over many rows where absolute correctness is not the goal, but rather just to get it done despite the inflight rows.

Guarantee durability in the very maximum technically possible way POSIX allows us to and the knowledge of HW-originated errors allows us to. Namely the same fault model as TigerBeetle has. SQLite3 allows almost for what TigerBeetle does (e.g. commit both the parent directory structure as well as the DB file itself and then fsync) but there are still some rough corners.

This is a more interesting point of discussion in my opinion. Certainly one extreme is to fsync everything and always. However, even the file system and hardware itself might not give these guarantees as there as several layers between the software and physical storage that may cache/write later that could be influenced by a poorly timed power failure.

Without getting into the weeds too much, I think there is a reasonable balance where we should trust the fsync (if it says it's been written) and treat all queries as atomic and requiring an fsync at the end always. That is, a modification (such as a UPDATE) requires all dirty pages to be flushed before the query can be returned.

Since there is no shared memory, hard flushing on basically all operations is required. Even an incomplete transaction would probably not be able to keep dirty pages beyond a single query for risk of page conflicts that might happen when they get written later.

That being said, this allows the file system to be the database tuning. Modern file systems can treat an fsync like a memory update and as long as the machine can guarantee that page gets written eventually and all other processes see the same virtual page in memory, it can avoid writing to the medium until later. I suspect this is why the file system benchmarks are shown as being as fast (or even faster!) than the memory-only runs: https://github.com/elliotchance/vsql/blob/main/docs/benchmark.rst#results

0 replies

dumblob · 2021-12-23T20:35:26Z

dumblob
Dec 23, 2021
Author

to have a DB API requiring one to specify on per-request basis whether it shall be treated as best-effort (i.e. no guarantee but presumably faster and with lower delays) or as guaranteed operation. See e.g. my rant ACID & strict serializability versus best effort on a per request basis redis/redis#6200 (comment) .

What does a request mean in this case? It could be a single query or a single roundtrip that might contain multiple (read or write) queries? Would this be affected by BEGIN/COMMIT/ROLLBACK passed in different roundtrips?

A request is a collection of commands determined by the user as "one request". In its simpliest form (and accounting for the absolute vast majority) it's one transaction. But the user can determine that e.g. two consecutive commands can be treated as best effort and they should be understood as a special type of transaction wrapping the inner commands.

Note that best effort might to lead incorrect data compared to the surrounding global context but it must not lead to internally incorrect data within a transaction - e.g. if in one transaction I fetch two different rows from two different tables it still must guarantee that the data (i.e. both rows) returned are exactly the data from the point in time when the transaction began (i.e. mutually correct).

Isn't this the motivation for READ UNCOMMITED serialization mode? Where a reading query is blocked by nothing, but may include some rows that should not be visible. I imagine this is great for scenarios like reporting queries, that might go over many rows where absolute correctness is not the goal, but rather just to get it done despite the inflight rows.

Yeah, in a sense it's a bit similar but still it's quite different 😉. Generally what I mean by best effort vs guaranteed is just about delay and trading it for fairness. Suppose I have 4 incoming transactions (each representing a separate request) on the server in an incoming buffer, all of them SERIALIZABLE. But they are of different kinds: best effort, guaranteed, best effort, guaranteed (in this order).

Now suppose the server will start executing first (best effort) transaction and during this execution finds out that the next transaction (guaranteed) would get delayed if it proceeded with the current (best effort) transaction. In that case it'd drop the best effort transaction (rolling it back) and continued only with guaranteed transactions while dropping & cancelling (i.e. notifying the client about aborted transaction) all best effort transactions on the way until it can again keep up with guaranteed transactions already accepted into the incoming buffer.

Of course, if there are only best effort transactions in the buffer, then we'll proceed with processing the oldest ones and just drop one or two most recent ones from the buffer to make space for a newly incoming one (this artificially made space ensures guaranteed ones are at least partially accepted even under high load).

Note this algorithm couldn't be implemented exactly this way to work efficiently. It's just to showcase the idea 😉. In practice you'd probably want to use two separate buffers (one for best effort and one for guaranteed requests) and then in a work-stealing fashion choose which one to process, etc.

This is a more interesting point of discussion in my opinion. Certainly one extreme is to fsync everything and always. However, even the file system and hardware itself might not give these guarantees as there as several layers between the software and physical storage that may cache/write later that could be influenced by a poorly timed power failure.

Without getting into the weeds too much, I think there is a reasonable balance where we should trust the fsync (if it says it's been written) and treat all queries as atomic and requiring an fsync at the end always. That is, a modification (such as a UPDATE) requires all dirty pages to be flushed before the query can be returned.

Since there is no shared memory, hard flushing on basically all operations is required. Even an incomplete transaction would probably not be able to keep dirty pages beyond a single query for risk of page conflicts that might happen when they get written later.

That being said, this allows the file system to be the database tuning. Modern file systems can treat an fsync like a memory update and as long as the machine can guarantee that page gets written eventually and all other processes see the same virtual page in memory, it can avoid writing to the medium until later. I suspect this is why the file system benchmarks are shown as being as fast (or even faster!) than the memory-only runs: https://github.com/elliotchance/vsql/blob/main/docs/benchmark.rst#results

Yep, these are all interesting problems. There are some tricks to force a sooner writing to disk. But this doesn't change anything on the fact, that it's system-specific how to determine whether fsync() already finished or not. Linux is the only exception I know about which really guarantees (PG devs had very serious discussion with kernel devs about this few years ago), that after fsync() returns it's actually on disk. This is of course different from flush() (or alike) which doesn't fsync() and is thus subject to all the non-guarantees you talk about.

Btw. the benchmarks you did are actually something more or less to expect - I'm pretty certain it's because you don't wait for fsync() to actually finish on macOS and only wait for the return from the fsync() function call which is on macOS an asynchronous call running in background (so you should use macOS-specific call to find out whether it finished or not or use macOS-specific synchronous alternative if available).

2 replies

dumblob May 31, 2023
Author

@elliotchance I wonder if you actually resolved the mystery with benchmarks on macOS (I do not own any macOS to try it myself 😢)?

elliotchance May 31, 2023
Maintainer

I never investigated it. Since I'm kind of surprised with how good the performance is right now, I'm still prioritizing basic features and even more specifically features that require a change to the file format at this stage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Durability (& reliability) versus performance #101

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Durability (& reliability) versus performance #101

dumblob Dec 22, 2021

Replies: 2 comments · 2 replies

elliotchance Dec 23, 2021 Maintainer

dumblob Dec 23, 2021 Author

dumblob May 31, 2023 Author

elliotchance May 31, 2023 Maintainer

dumblob
Dec 22, 2021

Replies: 2 comments 2 replies

elliotchance
Dec 23, 2021
Maintainer

dumblob
Dec 23, 2021
Author

dumblob May 31, 2023
Author

elliotchance May 31, 2023
Maintainer