Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation for DB format? #329

Open
spikebike opened this issue Aug 16, 2024 · 7 comments
Open

Documentation for DB format? #329

spikebike opened this issue Aug 16, 2024 · 7 comments

Comments

@spikebike
Copy link

spikebike commented Aug 16, 2024

I wanted to add a scanner for the Cray Scanner which consumes Lustre changelogs. That way instead of duc scanning I could use the Cray API which is kept up to date by consuming changelogs.

I made a test dir:
/home/MyUser/tmp:
total 12
drwxrwxr-x 2 MyUser MyUser 4096 Aug 16 13:48 DirA
drwxrwxr-x 2 MyUser MyUser 4096 Aug 16 13:48 DirB
drwxrwxr-x 2 MyUser MyUser 4096 Aug 16 13:49 DirC

/home/MyUser/tmp/DirA:
total 4
-rw-rw-r-- 1 MyUser MyUser 2 Aug 16 13:48 1

/home/MyUser/tmp/DirB:
total 8
-rw-rw-r-- 1 MyUser MyUser 3 Aug 16 13:48 2
-rw-rw-r-- 1 MyUser MyUser 4 Aug 16 13:48 3

/home/MyUser/tmp/DirC:
total 12
-rw-rw-r-- 1 MyUser MyUser 5 Aug 16 13:49 4
-rw-rw-r-- 1 MyUser MyUser 6 Aug 16 13:49 5
-rw-rw-r-- 1 MyUser MyUser 6 Aug 16 13:49 6

I made the database SQLITE, I figured it would be the easiest way to inspect the result. Here's the schema and dump:
$ cat schema
CREATE TABLE blobs(key unique primary key, value);
CREATE INDEX keys on blobs(key);

$ cat dump
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE blobs(key unique primary key, value);
INSERT INTO blobs VALUES('fc01/19c42d8',X'f9f311fb019c42d7fb66bfad15013102f907100105');
INSERT INTO blobs VALUES('fc01/19c42d9',X'f9f311fb019c42d7fb66bfad28013304f907100105013203f907100105');
INSERT INTO blobs VALUES('fc01/19c42da',X'f9f311fb019c42d7fb66bfad3b013506f907100105013405f907100105013606f907100105');
INSERT INTO blobs VALUES('fc01/19c42d7',X'0000fb66bfacce044469724102f917100202f9f311fb019c42d8044469724207f927100302f9f311fb019c42d9044469724311f937100402f9f311fb019c42da');
INSERT INTO blobs VALUES('duc_index_reports',X'2f686f6d652f6262726f61646c65792f746d7 [MANY ZEROS DELETED]');
INSERT INTO blobs VALUES('/home/MyUser/tmp',X'132f686f6d652f6262726f61646c65792f746d70f9f311fb019c42d7fb66bfad7efa0e8e37fb66bfad7efa0e8f0b06041af997100a');
CREATE INDEX keys on blobs(key);
COMMIT;

What's the value for "/home/MyUser/tmp"?

How does MyDirA/1 map to 'fc01/19c42d8' or similar?

Is the value for each dir some encoding of filename + size? So 3 files = file1+size,file2+size,file3+size or something?

What's the value for duc_index_reports?

@l8gravely
Copy link
Collaborator

l8gravely commented Aug 16, 2024 via email

@spikebike
Copy link
Author

Here's an overview https://people.cs.vt.edu/~butta/docs/CCGrid2020-BRINDEXER.pdf

Here's some docs:
https://support.hpe.com/hpesc/public/docDisplay?docId=sd00002255en_us&docLocale=en_US&page=GUID-4CDBFE00-A7D5-4A28-8CF6-DEC34E95DBA4.html

It seems like a pretty powerful and flexible tool. I'm using the brindexer/bin/query command, which I believe wraps an API call. I believe Cray's brindexer is usually configured to consume the Lustre change logs, so the DB should be near realtime.

Cray brindexer is much faster than robinhood when configured similarly, consuming change logs, at least for our setup. While brindexer is pretty fast, still pretty annoying for interactive use on large directories. Thus using duc, not to mention our users are used to a dls command (a thin wrapper around duc ls).

So my plan was to periodically run this command instead of the normal duc index:
time /opt/cray/brindexer/bin/query --json -q "select size,path from entries_0 join path on pmd5=pathmd5 WHERE type='f'"

Then pipe it through some code I wrote to sum the file and track the totals per dir, was going to try that before trying to track per file totals, which might run the scanner out of ram. If the ram is not an issue, might just be able to add a JSON import into duc. Here's an example:
to import:

   {"path":"duc/projects","size":549364}
   {"path":"duc/db/scratch","size":1189888}
   {"path":"duc/scratch","size":2083884716}

@spikebike
Copy link
Author

spikebike commented Aug 19, 2024

Found a simpler example:
$ /opt/cray/brindexer/bin/query --json -C path,name,size /kfs2/projects/MyProj/duc | head -10
{"name":"duc.db","path":"duc/.duc-index","size":275712}
{"name":"go1.17.2.linux-amd64.tar.gz","path":"duc/archive","size":0}
{"name":"projects","path":"duc/db","size":20480}
{"name":"scratch","path":"duc/db","size":4096}

To save ram I'd been collapsing dirs to just show dir totals and not file totals, so above would have 24576 for "duc/db", not sure that's needed. Our larger directories are north of 50M files.

There's also a fullname, but we are trying to keep a database per /project directory, so people can't see info other groups directories. So we put a duc.db in each users dir with the same permission as that user's directory, then a wrapper to automatically pick the right database.

@l8gravely
Copy link
Collaborator

l8gravely commented Aug 19, 2024 via email

@spikebike
Copy link
Author

Sadly brindexer does not keep any per directory totals, just returns the actual metadata size for the dir, just like ls -ald /foo/dir. So you have to basically walk the tree, add things up (for current dir and all parents), and do the work yourself. Much like rbh-du (the robinhood equivalent)

I do have some simple code that just takes the output of brindex and keeps a running total for each dir in ram (including updating parent dirs), thus the concerns about running out of ram. But in my last scan of our filesystem the biggest dir had over 50M files and the JSON was 7.5GB. I don't see any easy way to do updates, and wasn't planning on doing so, at least in the first version.

So I'm just looking for a way to start a new duc.db from scratch with a json import from brindexer.

Do does your 'query' command return both file and directory changes?

I'm pretty sure I could query based on last modified time, not sure it's worth it. Currently running brindexer on our largest dir (50M files) takes about 20 minutes. Something I'd be willing to do daily or so, or by request. That would allow users who get complaints about using too much disk space to quickly tell what their dir sizes are. Your pending top N files sounds useful and top N dirs would be useful as well if that is planned.

So how long does 'duc' take to do a full index?

Worst case for brindexer is 20 minutes on our largest dir, duc index seems around 5 times slower, but duc has to do real I/O not just a DB lookup. Here's an example:

$ time ./duc index /projects/MyProj
real 5m17.199s
user 0m1.150s
sys 1m1.683s
$ time /opt/cray/brindexer/bin/query --json -C path,name,size,type /kfs2/projects/MyProj
real 1m3.355s
user 3m11.541s
sys 0m29.087s

The way I used duc before was just to do a daily index and replace the old index when the new index finished. We had some DB corruptions and index crashes, didn't seem worth to try to update the databases, so I just started over with each index.

Unless you can contribute some code and commit to running tests, I don't think I'll be able to do much in
the near term.

I'd consider it, would you be likely to accept a pull request that allowed a json import -> new DB? Would you prefer a new binary or a flag to duc index?

The other possibility would be I could just contribute code that could end up in your project's ~/contrib.

I'll review the source files you mentioned, haven't really decided if I should

  1. write some C to injest JSON that's preprocessed by my go code to get per directory totals
  2. write some go (easier/safer multithreading) to port index.c and write the DB files myself. I noticed tkrzw has go bindings.

@l8gravely
Copy link
Collaborator

l8gravely commented Aug 20, 2024 via email

@spikebike
Copy link
Author

spikebike commented Aug 20, 2024

Hmm... I wonder if I could setup a small test lustre system at home,
even if it's slow, I could play with it. It's an idea at least. And
I see that HPE offers RPMs of the brindexer stuff. Hmm...

Lustre is open source, wouldn't be my first choice for a small test system for a parallel filesystem, but it's got a common design for parallel filesystems. A metadata server that tracks all file metadata, including which OSTs have which ranges of blocks. The lustre driver is in the kernel (which is kinda painful, but performant) so you talk to the metadata server (MDS) and it tells you which block ranges are on which OST. Each OST has a has a native filesystem (Ext4 and ZFS are common) to hold the blobs, but they don't look like normal files, not like ~joeUser is on OST1 and ~bobUser is on OST2 or anything. Stripping of directories and/or files can be across 1 or more OSTs and can be configured at runtime.

I dug around some, and it seems like brindexer is part of clusterstor. I did find a 5 year old copy on github, but no license file, or signs of life. It seems like it's part of clusterstor's police engine, not clear if it's open source.

How does brindexer work with child directories? Does it get the
results for the selected directory and all child directories?

Yes, much like find. You can make queries, so you could say for a given dir tree (and subdirs) list all files, all dirs, all files over 1GB, and any SQL query based on metadata, including timestamps. You can use this with a policy engine to say things like all big files not touched in a month go to cheap storage.

Some examples/overview:
https://wiki.lustre.org/images/8/8b/LUG2024-Scalable_Auto_Tiering-Jabas.pdf

So I'd probably just injest and throw away the JSON input as fast as
possible.

Sensible, or just have duc json-index accept a pipe.

If you are interested in brindexer, I found a comparison between it and GUFI that has fair bit of detail:
https://dl.acm.org/doi/pdf/10.5555/3571885.3571960

Presumably if we can get brindexer -> json -> duc working it should be pretty simple to do the same for GUFI.

Is this brindexder part of a lustre commercial product?

I'm 90% sure it is, can't find any hint of a source repo, except for things like this that I don't think work:
import (
"cray.com/brindexer/fsentity"
"cray.com/brindexer/indexing"
"cray.com/brindexer/scan"
)

And stale and possibly not licensed: https://github.com/arnabkpaul/cray_brindexer/

GUFI sounds pretty similar and open source:
https://github.com/mar-file-system/GUFI

The quickstart to build the code, build and index, and query the index looks very simple and easy.

What I don't know is if GUFI can ingest Lustre changelogs. I'd rather not walk 100PB just to find the new files.

You misunderstood my question. I was asking if brindexer will return
just changes to the size or number of files in a directory, or also
the fast that N files changed in the directory?

I believe you can write any SQL query on the metadata, so something like all files since midnight should work. It would be somewhat painful since a dir walk might take 20 minutes, but if you track the timestamp of the newest file in each dir, then use that date for incremental updates (select files newer than TIMESTAMP) should work.

So this /kfs2/projects/MyProj has 50 million files, but they're spread
across a bunch of sub-directories, right, in a tree structure? I've
not really had any exposure to Lustre

Heh, not sure, I launched a DB query, hopefully it takes less than 20 minutes. From the perspective of duc, brindex, and similar it's just a filesystem, nothing lustre specific is required. Various knobs impact performance (like caching, different pool performance characteristics, and striping across OSTs), but all the usual commands like find, du, ls, etc work the same. The pain points are it's easy to have a dir with 50M files under it and that can take a long time to run find, ls, or du on, thus the need for duc.

Ah:
$ /opt/cray/brindexer/bin/query --json -q "select size,path from entries_0 join path on pmd5=pathmd5 WHERE type='d'" /projects/MyProj | wc -l
10857063

Ouch, 10M dirs for 50M files.

I don't know yet, let's see what you get? I think at first it might
make sense to just create a new indexer (cmd-brindex.c maybe?) which
sucks in the JSON formatted data (or talks to the API directly to save
time/space) and generates the duc db from that info.

Sounds good, I'll give it a shot.

Ugh, no! What is your go code talking to? A lustre API? Does it
offer a C interface? Or is it a http type API? Hmm... since I don't
know much about go, it does look like it can call C libraries. So
maybe calling into the libduc/ stuff from your go code would work? I
just hate the idea of going from Go -> JSON -> C when it doesn't make
sense.

My go code is mostly:

            components := strings.Split(record.Path, "/")
            for i := 1; i <= len(components); i++ {
                prefix := strings.Join(components[:i], "/")
                directorySizes[prefix] += record.Size
            }

Basically for each parent dir add the record size to it. Nice and concise, but my favorite part of go is the thread safe multiple producer -> multiple consumer channels that are part of the language standard. Making it very easy to throw X CPUs at Y bits of work and have it "just work".

Please feel free to post some sample code, and if you have some
instructions on setting up a simple lustre test case, that would be
ideal.

It's a fair bit of work, involves recompiling the kernel, here's a good overview:
https://wiki.lustre.org/Installing_the_Lustre_Software

Generally I'd consider cephFs to be easier to setup and manage, but either will be a good intro into parallel filesystems. I'd plan on at least 3 nodes (1 Metadata and 2 OSTs), but they could be virtual. Ceph is included in ubuntu, probably in debian as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants