Skip to content

what-you-see-is-what-you-do columnar data serialization library for java

License

Notifications You must be signed in to change notification settings

MarginaliaSearch/SlopData

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Slop

Slop is a Java 22+ library for columnar data persistence. It will also build on Java 21 with the --enable-preview flag.

The library is in an early stage of development.

It is designed to be used for storing large amounts of data in a way that is both fast and memory-efficient. The data is write-once, and the Slop library offers many facilities for deciding how it should be stored and accessed. It does not replace a DBMS, but is a storage format for data at rest.

Slop was put together because Parquet support on Java outside the Hadoop ecosystem is a pain in the posterior.

It is designed to be used in the Marginalia Search engine for storing intermediate representations of crawled documents, and batch processing the same data, but can be used elsewhere where is need for storing very large amounts of data.

Modern drives can read and write data at ~ 500 MB/s and RAM is even faster, but a lot of the speed is lost in the abstraction layers, often leaving you with a fraction of the theoretical speed when you're finally able to access the data.

When dealing with smaller amounts of data, this is not a problem, but when you're in the 100GB+ range, the overheads start to become painful, meanwhile even consumer hardware is fully capable of dealing with these quantities of data if you take an axe to all the crap between the hardware and the programmer.

Slop is designed as a low abstraction what-you-see-is-what-you-do library, the reason for this is to be able to eliminate copies and other overheads that are common in higher level libraries.

Additionally, many of the common tools Java offers for reading streams of data (e.g. InputStreams and most JDBC drivers) have shockingly bad performance, and the tools that let you do I/O faster tends to be finicky and hard to use.

The function of Slop is essentially to let you write homogenous streams of data to disk and read them back as fast as possible.

To avoid the common frustration of having multiple representations of the data in both the application and storage layers, a lot of what would commonly be kept in a schema description is instead just implemented as code by the library consumer, reducing the number of places where the schema is defined and limiting the number of times the data is copied or transformed.

To aid with portability, Slop stores schema information in the file names of the data files, besides the actual name of the column itself.

A table of demographic information may end up stored in files like this:

cities.0.dat.s8[].gz
cities.0.dat-len.varint.bin
population.0.dat.s32le.bin
average-age.0.dat.f64le.gz

(Endianness is specified in the file name because old and new Java utilities for dealing with raw binary data have different default endianness and labelling the files makes it easier to know which is which.)

The Slop library offers a bare minimum of facilities to aid with data integrity, such as the SlopTable class, which is a wrapper that ensures consistent positions for a group of columns, and aids in closing the columns when they are no longer needed.

Beyond that, you're largely on your own to ensure that the data is consistent.

Why though?

Slop is fast: Slop generally outperforms most other storage formats available in Java (e.g. anything over jdbc, parquet, protobuf) when it comes to sequential reads and writes, at the cost of really only supporting this one use case.

It's often at least one to two orders of magnitude faster than some of Java's built-in tools for reading and writing data (e.g. Data...Stream, Object...Stream, with a buffered underlying stream).

You should however always benchmark your own use case to be sure.

Slop is compact: Depending on compression and encoding choices, the format will be smaller than a parquet file containing the equivalent information.

Slop is simple: There isn't much magic going on under the hood in Slop.

It's designed with the philosophy that a competent programmer should be able to reverse engineer the format of the data by just looking at a directory listing of the data files.

There are no hidden indexes, magic numbers, no headers or footers, no block structures or checksums, no supplemental data besides the data as presented by ls.

Despite being a very obscure library, this gives the data a sort of portability.

Example

With Slop it's desirable to keep the schema information in the code.

Below is an idiomatic example of how to use Slop to store demographic data. The data is stored in a directory, and the data is written and read using the MyData.Writer and MyData.Reader classes. The MyData class is itself is a record, and the schema is stored as static fields in the MyData class.

public record Population(String city, int population, double avgAge) {

    // This is the schema, and it's specified in code
    private static final StringColumn citiesColumn = new StringColumn("cities", StorageType.GZIP);
    private static final IntColumn populationColumn = new IntColumn("population", StorageType.PLAIN);
    private static final DoubleColumn averageAgeColumn = new DoubleColumn("average-age", StorageType.PLAIN);

    // Extend SlopTable to ensure that the columns are closed when the table is closed,
    // and adds basic sanity checks to ensure that the columns are in sync.
    public static class Writer extends SlopTable {   // (SlopTable implements AutoCloseable) 
        private final StringColumn.Writer citiesWriter;
        private final IntColumn.Writer populationWriter;
        private final DoubleColumn.Writer avgAgeWriter;

        public Writer(Path baseDir) throws IOException {
            citiesWriter = citiesColumn.create(this, baseDir);
            populationWriter = populationColumn.create(this, baseDir);
            avgAgeWriter = averageAgeColumnn.create(this, baseDir);
        }

        public void write(Population data) throws IOException {
            citiesWriter.put(data.city);
            populationWriter.put(data.population);
            avgAgeWriter.put(data.avgAge);
        }
    }

    // Reader also extends SlopTable, for the same reasons as the Writer
    public static class Reader extends SlopTable {
        private final StringColumn.Reader citiesReader;
        private final IntColumn.Reader populationReader;
        private final DoubleColumn.Reader avgAgeReader;

        public Reader(Path baseDir) throws IOException {
            citiesReader = citiesColumn.open(this, baseDir);
            populationReader = populationColumn.open(this, baseDir);
            avgAgeReader = averageAgeColumnn.open(this, baseDir);
        }

        public boolean hasRemaining() throws IOException {
            return citiesReader.hasRemaining();
        }

        public Population read() throws IOException {
            return new Population(
                    citiesReader.get(),
                    populationReader.get(),
                    avgAgeReader.get()
            );
        }
    }
}

A distinguishing feature of Slop is that there is no set-up or configuration anywhere, you just specify what you want to do in the code, and it does what you've told it to. There's also very little in terms of indirection, the Reader and Writer classes associated with each column type are the actual implementation classes that do the reading and writing. You can go look at them, or even copy them and make your own column type if you want to.

Nested Records

Nested records are not supported in Slop, although array values are supported. If you need to store nested records, you've got the options of flattening them, representing them as arrays, or serializing them into a byte array and storing that.

Paging

Slop supports splitting up the data into multiple files, which is useful for large datasets as these can be read independently and in parallel. It's also useful in batch processing, as each file can be processed to completion, allowing for resumption of a terminated job without having to reprocess the entire dataset.

TBW

Column Types

Integer

Type Explanation
ByteColumn 8 bit integer
ShortColumn 16 bit integer
IntColumn 32 bit integer
LongColumn 64 bit integer

Integer

Type Explanation
ByteColumn 8 bit integer
ShortColumn 16 bit integer
CharColumn 16 bit integer unsigned1
IntColumn 32 bit integer
LongColumn 64 bit integer
VarintColumn Variable byte coded integer

[1] matches the Java type

Floating Point

Type Explanation
FloatColumn 32 bit floating point
DoubleColumn 64 bit floating point

String

Type Explanation
StringColumn String with a separate varint coded length column
CStringColumn String separated with a '\0' byte field separator
TxtStringColumn String separated with a '\n' byte field separator
EnumColumn String with a separate lexicon of values, data stored as varint ordinals

String columns permits the specification of Charset. If possible, use StandardCharsets.US_ASCII as it is significantly faster. Otherwise UTF-8 works as well.

Arrays

Type Explanation
ByteArrayColumn Stores byte[]
IntArrayColumn Stores int[]
LongArrayColumn Stores long[]

Storage Types

Slop supports plain storage, as well as compressed storage. Plain storage means the data can be memory mapped, reducing copies while reading.

Three type are currently supported

Type Explanation Memory Mapped I/O
PLAIN No compression Yes
GZIP Gzip compression No
ZSTD Zstd compression No

See the StorageType enum.

Extension

It is possible to extend Slop with new column types, by just creating such a class.

Refer to the existing column types for a template.

Zip Storage

Sometimes it's impractical to keep the data unpacked in a directory, in this scenario Slop offers the ability to pack the data into an uncompressed and carefully aligned Zip file in such a way it can still be accessed with zero overhead using memory mapping.

To this end, an utility class "SlopTablePacker" is available.

Path pathToDir = Path.of("/tmp/foo");
Path pathToDir = Path.of("/tmp/foo.slop.zip");

ByteColumn byteColumn = new ByteColumn("example", StorageType.PLAIN);

try (var table = new SlopTable(pathToDir)) {
    var writer = byteColumn.create(table);
    writer.put(...);
    writer.put(...);
    writer.put(...);
}

SlopTablePacker.packToSlopZip(pathToDir, pathToSlopZip);

try (var table = new SlopTable(pathToSlopZip)) {
    var reader = byteColumn.open(table);
    reader.get();
    reader.get(); 
    reader.get(); 
}

Network Streaming

Slop can read data over network requests, though not when the data is zipped.

If a Slop table is available at "https://example.com/slop/", it can be opened by just providing this URI, e.g.

try (var table = new SlopTable(new URI("https://example.com/slop/"))) {
    // set up columns and read here
}

SQL support

If you feel like Slop could benefit from SQL support, you're almost certainly looking at the wrong tool for the job.

Why is it called "Slop"?

It's a funny word.

About

what-you-see-is-what-you-do columnar data serialization library for java

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages