An implementation of a LSMT (Log Structured Merge Tree) based key value store. Follows DDIA's storage chapter.
Update the paths where you would like to save the database files
open src/main/java/com/satvik/satvikdb/utils/SimpleDbConstants.java
open src/main/java/com/satvik/satvikdb/utils/LsmDbConstants.java
and update the ROOT_DIR
variable to match your desired path.
Run src/test/java/com/satvik/satvikdb/SatvikDbTest.java
and see the files created on your ROOT_DIR
- Put the
key, value
pair in the WAL file (Write Ahead Log) - Put the tuple in the in-memory
SortedMap
. This data structure is also calledmemtable
. This is an LSM based DB, so keys will be in sorted order. - If the size of the map exceeds a certain threshold, dump the values in filesystem as the latest
SSTable
file. Make sure the corresponding indexes are also created (sparse - 1 entry per 100ish records). Clear the in memory map, making it ready for upcoming writes. - The indexes store
ByteOffsets
where the value was written, so that we can directly seek at that location. - A
ByteOffset
is the location of a byte in the file, 0-indexed - Now as the map already had the values in sorted order, the values in files are also sorted.
- Run
Compaction
every once in a while, or when the number of files cross a certain threshold. Refer compaction
The chain of these SSTables
files is also called Log Structured Merge Tree (LSTM)
In case the LsmWriteService
becomes overloaded due to some reason and crashes,
the outstanding writes will still remain in the WAL. Hence,
each time the LsmWriteService
is started, it first checks if the WAL is empty.
If not, it first writes the values pending in WAL and then clears it.
n are the number of keys present in the database
- Lookup the in memory
SortedMap
. - If found, return the value.
- If not, load the latest index file from the file system in memory.
- Apply a binary search on the index's keys. Find the
left
pointer, just beforekey
. - If you reach the start with no
left
value, load the previous index file. - As soon as you find a hit, load the
ByteOffset
of thevalue
- open the corresponding database file and seek from the
ByteOffset
returned by the index file. - Start a linear search from here and return the first match with
key
- If there's no match for
key
, return null
As our database grows over time, we might have thousands of SSTable
files created, many of which might have overwritten or old data.
Run the merging process in some defined time interval.
- loads all the database files
- Does a
merge K sorted lists
kind of algorithm in a batch ofn
files. - Each
RandomAccessFile
is considered as a 'sorted array'. - If there are 2 keys with the same name, take the one which is in a latter file. We essentially only want to keep the latest value for a given
key
- After the new merged file is created, delete the old files.
Compaction is idempotent, i.e you can run it x number of times, but the result will not affect database reads or writes.
This flow is similar to what LevelDB and RocksBD do.
For 60000
reads and writes, 28
Memtables on disk,
~3000
entries per file and ~20
entries per index
Without bloom filters:
write: 2155 ms
read : 309ms
With bloom filters:
write: 2681ms
read : 83ms
~3x read improvement with a little hit on write performance.
This project also includes an implementation of a simple append only database which stores the keys in the order in which they were written. Includes no optimisation. To see how that works, change the code in main/test file to this
DbService dbService = DbFactory.getDbService(TypesEnum.SIMPLE);
- Directly writes into the database file in append mode.
- corresponding indexes are also created (dense - 1 entry for each key)
- The index contains the key's
ByteOffset
ofvalue
in the file.
- load the latest index file.
- Does a linear scan until in encounters the
key
- Notes the
ByteOffset
of the correspondingvalue
- Opens the file, directly seeking to the
ByteOffset
and returns thevalue
- Implement delete feature
- Add bloom filter to filter out missing keys. Currently, if the lookup key is not present in the database, we scan all the
SSTable
indexes. This is not very efficient. - Optimise compaction to be
size tiered
/level tiered