-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation for DB format? #329
Comments
>>>> "Bill" == Bill Broadley ***@***.***> writes:
I wanted to add a scanner for the Cray Scanner which consumes Lustre
changelogs. That way instead of duc scanning I could use the Cray
API which is kept up to date by consuming changelogs.
Awesome! Do you have any pointers to the documentation of this tool?
Some questions I have:
1. can you import the full current filesystem info from the API?
2. OR do you have to scan the filesystem using statx() and then attach
to the API to get changes after a certain time?
At it's heart, duc is pretty simple, it just reads the data and stuffs
it into a DB using the directory path as the key, then putting more
stuff below level. Look at src/libduc/index.c for a quick look.
I'd also recommend you looking at my 'tkrzw' branch where I've been
putting a bunch of new code that will become the version 1.5 in the
next few weeks hopefully.
- support for tkrzw backend, which handles really big volumes.
- support for file size histograms
- support for tracking top N largest files in filesystem
Both second features are just barebones and need more work, especially
for the reporting in GUI, UI and CGI formats.
So this is a good time to add new features if possible.
John
… I made a test dir:
/home/MyUsery/tmp:
total 12
drwxrwxr-x 2 MyUsery MyUsery 4096 Aug 16 13:48 DirA
drwxrwxr-x 2 MyUsery MyUsery 4096 Aug 16 13:48 DirB
drwxrwxr-x 2 MyUsery MyUsery 4096 Aug 16 13:49 DirC
/home/MyUsery/tmp/DirA:
total 4
-rw-rw-r-- 1 MyUsery MyUsery 2 Aug 16 13:48 1
/home/MyUsery/tmp/DirB:
total 8
-rw-rw-r-- 1 MyUsery MyUsery 3 Aug 16 13:48 2
-rw-rw-r-- 1 MyUsery MyUsery 4 Aug 16 13:48 3
/home/MyUsery/tmp/DirC:
total 12
-rw-rw-r-- 1 MyUsery MyUsery 5 Aug 16 13:49 4
-rw-rw-r-- 1 MyUsery MyUsery 6 Aug 16 13:49 5
-rw-rw-r-- 1 MyUsery MyUsery 6 Aug 16 13:49 6
I made the database SQLITE, I figured it would be the easiest way to inspect the result. Here's
the schema and dump:
$ cat schema
CREATE TABLE blobs(key unique primary key, value);
CREATE INDEX keys on blobs(key);
$ cat dump
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE blobs(key unique primary key, value);
INSERT INTO blobs VALUES('fc01/19c42d8',X'f9f311fb019c42d7fb66bfad15013102f907100105');
INSERT INTO blobs VALUES('fc01/
19c42d9',X'f9f311fb019c42d7fb66bfad28013304f907100105013203f907100105');
INSERT INTO blobs VALUES('fc01/
19c42da',X'f9f311fb019c42d7fb66bfad3b013506f907100105013405f907100105013606f907100105');
INSERT INTO blobs VALUES('fc01/
19c42d7',X'0000fb66bfacce044469724102f917100202f9f311fb019c42d8044469724207f927100302f9f311fb019c42d9044469724311f937100402f9f311fb019c42da');
INSERT INTO blobs VALUES('duc_index_reports',X'2f686f6d652f6262726f61646c65792f746d7 [MANY ZEROS
DELETED]');
INSERT INTO blobs VALUES('/home/MyUser/
tmp',X'132f686f6d652f6262726f61646c65792f746d70f9f311fb019c42d7fb66bfad7efa0e8e37fb66bfad7efa0e8f0b06041af997100a');
CREATE INDEX keys on blobs(key);
COMMIT;
What's the value for "/home/MyUser/tmp"?
How does MyDirA/1 map to 'fc01/19c42d8' or similar?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.*Message ID: <zevv/duc/issues/329
@github.com>
|
Here's an overview https://people.cs.vt.edu/~butta/docs/CCGrid2020-BRINDEXER.pdf Here's some docs: It seems like a pretty powerful and flexible tool. I'm using the brindexer/bin/query command, which I believe wraps an API call. I believe Cray's brindexer is usually configured to consume the Lustre change logs, so the DB should be near realtime. Cray brindexer is much faster than robinhood when configured similarly, consuming change logs, at least for our setup. While brindexer is pretty fast, still pretty annoying for interactive use on large directories. Thus using duc, not to mention our users are used to a dls command (a thin wrapper around duc ls). So my plan was to periodically run this command instead of the normal duc index: Then pipe it through some code I wrote to sum the file and track the totals per dir, was going to try that before trying to track per file totals, which might run the scanner out of ram. If the ram is not an issue, might just be able to add a JSON import into duc. Here's an example:
|
Found a simpler example: To save ram I'd been collapsing dirs to just show dir totals and not file totals, so above would have 24576 for "duc/db", not sure that's needed. Our larger directories are north of 50M files. There's also a fullname, but we are trying to keep a database per /project directory, so people can't see info other groups directories. So we put a duc.db in each users dir with the same permission as that user's directory, then a wrapper to automatically pick the right database. |
>>>> "Bill" == Bill Broadley ***@***.***> writes:
Found a simpler example:
$ /opt/cray/brindexer/bin/query --json -C path,name,size /kfs2/projects/MyProj/duc | head -10
{"name":"duc.db","path":"duc/.duc-index","size":275712}
{"name":"go1.17.2.linux-amd64.tar.gz","path":"duc/archive","size":0}
{"name":"projects","path":"duc/db","size":20480}
{"name":"scratch","path":"duc/db","size":4096}
To save ram I'd been collapsing dirs to just show dir totals and not
file totals, so above would have 24576 for "duc/db", not sure that's
needed. Our larger directories are north of 50M files.
So maybe we need to build an 'import' command, but... how to do just
incremental updates is the question, because with a chain of
directories:
a -> b -> c -> d -> -e -> f -> ... -> t -> u -> v
if we make a change to 't', we need to percollate that change back up
the tree properly. Do does your 'query' command return both file and
directory changes?
I haven't really done any thinking about incremental updates, but
you're more than welcome to take a look a src/libduc/index.c and see
how stuff is done.
It might be that you need to sort the JSON report depth first, so
changes can get applied down below, and then the tree walked back up
with updates as needed.
So how long does 'duc' take to do a full index? On average, since I
know you have multiple projects. Unless you can contribute some code
and commit to running tests, I don't think I'll be able to do much in
the near term. I'm planning on releasing v1.5.0a soon for people to
test, but I've still got rough edges to polish.
There's also a fullname, but we are trying to keep a database per
/project directory, so people can't see info other groups
directories. So we put a duc.db in each users dir with the same
permission as that user's directory, then a wrapper to automatically
pick the right database.
Sure, that's sorta kinda what I do with the CGI, I have an indexer
which generates reports on a per-filesystem basis and the main
index.html (see the contributed scripts) builds an index based off the
DBs in a directory it finds.
|
Sadly brindexer does not keep any per directory totals, just returns the actual metadata size for the dir, just like ls -ald /foo/dir. So you have to basically walk the tree, add things up (for current dir and all parents), and do the work yourself. Much like rbh-du (the robinhood equivalent) I do have some simple code that just takes the output of brindex and keeps a running total for each dir in ram (including updating parent dirs), thus the concerns about running out of ram. But in my last scan of our filesystem the biggest dir had over 50M files and the JSON was 7.5GB. I don't see any easy way to do updates, and wasn't planning on doing so, at least in the first version. So I'm just looking for a way to start a new duc.db from scratch with a json import from brindexer.
I'm pretty sure I could query based on last modified time, not sure it's worth it. Currently running brindexer on our largest dir (50M files) takes about 20 minutes. Something I'd be willing to do daily or so, or by request. That would allow users who get complaints about using too much disk space to quickly tell what their dir sizes are. Your pending top N files sounds useful and top N dirs would be useful as well if that is planned.
Worst case for brindexer is 20 minutes on our largest dir, duc index seems around 5 times slower, but duc has to do real I/O not just a DB lookup. Here's an example: $ time ./duc index /projects/MyProj The way I used duc before was just to do a daily index and replace the old index when the new index finished. We had some DB corruptions and index crashes, didn't seem worth to try to update the databases, so I just started over with each index.
I'd consider it, would you be likely to accept a pull request that allowed a json import -> new DB? Would you prefer a new binary or a flag to duc index? The other possibility would be I could just contribute code that could end up in your project's ~/contrib. I'll review the source files you mentioned, haven't really decided if I should
|
>>>> "Bill" == Bill Broadley ***@***.***> writes:
I'm going to be travelling the rest of this week, so I won't have a
ton of time to work on this, but I'll be thinking about it for sure.
Hmm... I wonder if I could setup a small test lustre system at home,
even if it's slow, I could play with it. It's an idea at least. And
I see that HPE offers RPMs of the brindexer stuff. Hmm...
Sadly brindexer does not keep any per directory totals, just returns
the actual metadata size for the dir, just like ls -ald /foo/dir. So
you have to basically walk the tree, add things up (for current dir
and all parents), and do the work yourself. Much like rbh-du (the
robinhood equivalent)
How does brindexer work with child directories? Does it get the
results for the selected directory and all child directories?
I do have some simple code that just takes the output of brindex and
keeps a running total for each dir in ram (including updating parent
dirs), thus the concerns about running out of ram. But in my last
scan of our filesystem the biggest dir had over 50M files and the
JSON was 7.5GB. I don't see any easy way to do updates, and wasn't
planning on doing so, at least in the first version.
So I'd probably just injest and throw away the JSON input as fast as
possible. Or is there an API for bindexer you can call directly? I
don't have any access to cray stuff (been 26+ years since I last
touched Cray hardware, but this all is Lustre FS stuff...
Is this brindexder part of a lustre commercial product?
Can I get it a debian package since that's what I mostly run at home?
So I'm just looking for a way to start a new duc.db from scratch with a json import from
brindexer.
Do does your 'query' command return both file and directory changes?
I'm pretty sure I could query based on last modified time, not sure
it's worth it.
You misunderstood my question. I was asking if brindexer will return
just changes to the size or number of files in a directory, or also
the fast that N files changed in the directory?
Currently running brindexer on our largest dir (50M files) takes
about 20 minutes. Something I'd be willing to do daily or so, or by
request. That would allow users who get complaints about using too
much disk space to quickly tell what their dir sizes are. Your
pending top N files sounds useful and top N dirs would be useful as
well if that is planned.
So how long does 'duc' take to do a full index?
Worst case for brindexer is 20 minutes on our largest dir, duc index
seems around 5 times slower, but duc has to do real I/O not just a
DB lookup. Here's an example:
$ time ./duc index /projects/MyProj
real 5m17.199s
user 0m1.150s
sys 1m1.683s
$ time /opt/cray/brindexer/bin/query --json -C path,name,size,type /kfs2/projects/MyProj
real 1m3.355s
user 3m11.541s
sys 0m29.087s
So this /kfs2/projects/MyProj has 50 million files, but they're spread
across a bunch of sub-directories, right, in a tree structure? I've
not really had any exposure to Lustre
The way I used duc before was just to do a daily index and replace
the old index when the new index finished. We had some DB
corruptions and index crashes, didn't seem worth to try to update
the databases, so I just started over with each index.
Unless you can contribute some code and commit to running tests,
I don't think I'll be able to do much in the near term.
I'd consider it, would you be likely to accept a pull request that allowed a json import -> new
DB? Would you prefer a new binary or a flag to duc index?
I don't know yet, let's see what you get? I think at first it might
make sense to just create a new indexer (cmd-brindex.c maybe?) which
sucks in the JSON formatted data (or talks to the API directly to save
time/space) and generates the duc db from that info.
The other possibility would be I could just contribute code that
could end up in your project's ~/ contrib.
Sure! But honestly, I'd be happy to expand duc's reach.
I'll review the source files you mentioned, haven't really decided if I should
1. write some C to injest JSON that's preprocessed by my go code to
get per directory totals
Ugh, no! What is your go code talking to? A lustre API? Does it
offer a C interface? Or is it a http type API? Hmm... since I don't
know much about go, it does look like it can call C libraries. So
maybe calling into the libduc/ stuff from your go code would work? I
just hate the idea of going from Go -> JSON -> C when it doesn't make
sense.
2. write some go (easier/safer multithreading) to port index.c and write the DB files myself. I
noticed tkrzw has go bindings.
That might be a better thing to do, but then your code would
definitely live in contrib/ because we couldn't be expected to keep
things in sync. You would have to write to a specific duc DB format
version and make sure you check it.
It certainly sounds like a fun project!
Please feel free to post some sample code, and if you have some
instructions on setting up a simple lustre test case, that would be
ideal.
John
|
Lustre is open source, wouldn't be my first choice for a small test system for a parallel filesystem, but it's got a common design for parallel filesystems. A metadata server that tracks all file metadata, including which OSTs have which ranges of blocks. The lustre driver is in the kernel (which is kinda painful, but performant) so you talk to the metadata server (MDS) and it tells you which block ranges are on which OST. Each OST has a has a native filesystem (Ext4 and ZFS are common) to hold the blobs, but they don't look like normal files, not like ~joeUser is on OST1 and ~bobUser is on OST2 or anything. Stripping of directories and/or files can be across 1 or more OSTs and can be configured at runtime. I dug around some, and it seems like brindexer is part of clusterstor. I did find a 5 year old copy on github, but no license file, or signs of life. It seems like it's part of clusterstor's police engine, not clear if it's open source.
Yes, much like find. You can make queries, so you could say for a given dir tree (and subdirs) list all files, all dirs, all files over 1GB, and any SQL query based on metadata, including timestamps. You can use this with a policy engine to say things like all big files not touched in a month go to cheap storage. Some examples/overview:
Sensible, or just have duc json-index accept a pipe. If you are interested in brindexer, I found a comparison between it and GUFI that has fair bit of detail: Presumably if we can get brindexer -> json -> duc working it should be pretty simple to do the same for GUFI.
I'm 90% sure it is, can't find any hint of a source repo, except for things like this that I don't think work: And stale and possibly not licensed: https://github.com/arnabkpaul/cray_brindexer/ GUFI sounds pretty similar and open source: The quickstart to build the code, build and index, and query the index looks very simple and easy. What I don't know is if GUFI can ingest Lustre changelogs. I'd rather not walk 100PB just to find the new files.
I believe you can write any SQL query on the metadata, so something like all files since midnight should work. It would be somewhat painful since a dir walk might take 20 minutes, but if you track the timestamp of the newest file in each dir, then use that date for incremental updates (select files newer than TIMESTAMP) should work.
Heh, not sure, I launched a DB query, hopefully it takes less than 20 minutes. From the perspective of duc, brindex, and similar it's just a filesystem, nothing lustre specific is required. Various knobs impact performance (like caching, different pool performance characteristics, and striping across OSTs), but all the usual commands like find, du, ls, etc work the same. The pain points are it's easy to have a dir with 50M files under it and that can take a long time to run find, ls, or du on, thus the need for duc. Ah: Ouch, 10M dirs for 50M files.
Sounds good, I'll give it a shot.
My go code is mostly:
Basically for each parent dir add the record size to it. Nice and concise, but my favorite part of go is the thread safe multiple producer -> multiple consumer channels that are part of the language standard. Making it very easy to throw X CPUs at Y bits of work and have it "just work".
It's a fair bit of work, involves recompiling the kernel, here's a good overview: Generally I'd consider cephFs to be easier to setup and manage, but either will be a good intro into parallel filesystems. I'd plan on at least 3 nodes (1 Metadata and 2 OSTs), but they could be virtual. Ceph is included in ubuntu, probably in debian as well. |
I wanted to add a scanner for the Cray Scanner which consumes Lustre changelogs. That way instead of duc scanning I could use the Cray API which is kept up to date by consuming changelogs.
I made a test dir:
/home/MyUser/tmp:
total 12
drwxrwxr-x 2 MyUser MyUser 4096 Aug 16 13:48 DirA
drwxrwxr-x 2 MyUser MyUser 4096 Aug 16 13:48 DirB
drwxrwxr-x 2 MyUser MyUser 4096 Aug 16 13:49 DirC
/home/MyUser/tmp/DirA:
total 4
-rw-rw-r-- 1 MyUser MyUser 2 Aug 16 13:48 1
/home/MyUser/tmp/DirB:
total 8
-rw-rw-r-- 1 MyUser MyUser 3 Aug 16 13:48 2
-rw-rw-r-- 1 MyUser MyUser 4 Aug 16 13:48 3
/home/MyUser/tmp/DirC:
total 12
-rw-rw-r-- 1 MyUser MyUser 5 Aug 16 13:49 4
-rw-rw-r-- 1 MyUser MyUser 6 Aug 16 13:49 5
-rw-rw-r-- 1 MyUser MyUser 6 Aug 16 13:49 6
I made the database SQLITE, I figured it would be the easiest way to inspect the result. Here's the schema and dump:
$ cat schema
CREATE TABLE blobs(key unique primary key, value);
CREATE INDEX keys on blobs(key);
$ cat dump
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE blobs(key unique primary key, value);
INSERT INTO blobs VALUES('fc01/19c42d8',X'f9f311fb019c42d7fb66bfad15013102f907100105');
INSERT INTO blobs VALUES('fc01/19c42d9',X'f9f311fb019c42d7fb66bfad28013304f907100105013203f907100105');
INSERT INTO blobs VALUES('fc01/19c42da',X'f9f311fb019c42d7fb66bfad3b013506f907100105013405f907100105013606f907100105');
INSERT INTO blobs VALUES('fc01/19c42d7',X'0000fb66bfacce044469724102f917100202f9f311fb019c42d8044469724207f927100302f9f311fb019c42d9044469724311f937100402f9f311fb019c42da');
INSERT INTO blobs VALUES('duc_index_reports',X'2f686f6d652f6262726f61646c65792f746d7 [MANY ZEROS DELETED]');
INSERT INTO blobs VALUES('/home/MyUser/tmp',X'132f686f6d652f6262726f61646c65792f746d70f9f311fb019c42d7fb66bfad7efa0e8e37fb66bfad7efa0e8f0b06041af997100a');
CREATE INDEX keys on blobs(key);
COMMIT;
What's the value for "/home/MyUser/tmp"?
How does MyDirA/1 map to 'fc01/19c42d8' or similar?
Is the value for each dir some encoding of filename + size? So 3 files = file1+size,file2+size,file3+size or something?
What's the value for duc_index_reports?
The text was updated successfully, but these errors were encountered: