Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spotlight search using data in MySQL CNID DB #1409

Open
andylemin opened this issue Aug 15, 2024 · 7 comments
Open

Spotlight search using data in MySQL CNID DB #1409

andylemin opened this issue Aug 15, 2024 · 7 comments

Comments

@andylemin
Copy link

Is your feature request related to a problem? Please describe.

  • Setting up Tracker and DBUS etc, for Spotlight is difficult, complex, and fragile, and often does not work. It is a lot of effort just to be able to search using an index rather than sequential scan.

Describe the solution you'd like

  • have a simple filename-only based Spotlight search using the MySQL CNID DB.

I have seen very few, if any, truly useful cases in production of searching via the metadata that Tracker provides. The vast majority of searches are just filename searches, and the metadata often just adds noise to the results. As a result you often have to configure the tracker to not scan for extra data anyway.

Being able to have a simple name only search using the existing MySQL would be fast, simple, stable and efficient.

It would also allow for instant Spotlight searches without having to wait for the index scans to catch up.

@rdmark
Copy link
Member

rdmark commented Aug 15, 2024

The complexity / fragility argument I definitely buy. It's a lot of scaffolding for this one feature. I wonder though if the metadata search can be tweaked to make it more generally useful? Do you have examples from the wild of "useless" metadata filtering?

Regarding the MySQL proposal: Do you have a POC lined up already that demonstrates how this can be set up with the existing CNID backend? It's not immediately obvious to me how the CNID backend can be plugged in as a Spotlight backend as-is, but perhaps it is straight-forward.

As a counter-argument to using MySQL for this purpose: We still arguably end up with a "difficult" solution since setting up and administering a MySQL database is non-trivial for most users. Can a simpler database be used for this, e.g. SQLite?

Finally, as yet another path forward: Build a fully indexed Spotlight backend using Elasticsearch. This is what Samba has done. And since Samba and Netatalk share DNA, it might even be fairly easy to port over their solution.

Perhaps an overload of options in one comment. My apologies. :)

@Windoze345
Copy link

This already exists for dbd afaict

search db = BOOLEAN (default: no) (V)
Use fast CNID database namesearch instead of slow recursive filesystem search. Relies on a consistent CNID database, i.e. Samba or local filesystem access lead to inaccurate or wrong results. Works only for "dbd" CNID db volumes.

@andylemin
Copy link
Author

andylemin commented Sep 2, 2024

Hi @rdmark,
Sorry for the slow reply.

I think you are underselling how easy Netatalk makes it :) Thanks to all the hard work by the Netatalk contributors, getting the MySQL backend working is simple.

For example, for a production setup on FreeBSD and ZFS pool with NVMe Special Device/Vdev;

Create a ZFS Vol for MySQL data dir;

zfs create -o atime=off -o recordsize=32k -o special_small_blocks=16k -o compression=lz4 -o xattr=sa -o dnodesize=auto -o checksum=blake3 -o redundant_metadata=most -o sync=disabled -o mountpoint=/var/db/mysql tank/cnid

If you don't have a special device vdev, remove special_small_blocks=16k, and set recordsize=16k instead.

sync=disabled because it does not matter if the volume is lost/corrupted, as it takes just minutes to recreate (if you have a box with plenty of RAM and a warm ARC cache), so speed is more important.

Install MySQL

I personally recommend not changing the default MySQL data dir, as changing the default location can be painful, instead just mount the ZFS volume over the top of /var/db/mysql as above with mountpoint=/var/db/mysql.

pkg search mysql
pkg install mysql84-server mysql84-client
sudo service mysql-server enable
sudo service mysql-server start

Run insecure setup script and set root password

sudo mysql_secure_installation

Connect to MySQL and create cnid DB.

mysql -u root -p
CREATE DATABASE cnid;
Not sure if this is actually needed (it might be created automatically now) but I always do this out of habit.

Configure Netatalk to use MySQL

Arguably the hardest part about this is the afp.conf config;

https://netatalk.io/3.2/htmldocs/configuration - says very little, and does not mention anything about how Netatalk does the hard work for you of creating the DB Schema etc.

https://netatalk.io/3.2/htmldocs/afp.conf.5 - lists the 4 Global options under Miscellaneous Options, but does not give you any hints that you also need an associated setting for each of the volumes.. (this caught me out for ages). Ie, to search for cnid scheme in a different part of the docs. Having a complete example config (like below) in one place in the docs would help.

/usr/local/etc/afp.conf

[Global]
 cnid mysql host = localhost
 cnid mysql user = root
 cnid mysql pw = <PASSWORD>
 cnid mysql db = cnid

[share1]
 cnid scheme = mysql

Start/Restart Netatalk

/usr/local/etc/rc.d/netatalk restart

Netatalk will create the required MySQL tables and schema automatically :)

Rebuild/populate the CNID DB on MySQL

dbd -f -F /usr/local/etc/afpd.conf

https://netatalk.io/3.0/htmldocs/dbd.1

With the default CNID backend, this takes several hours for a share with a few million objects.
With MySQL CNID backend, this takes 10-20 minutes for me (around 10x faster in this setup).

Regarding "We still arguably end up with a "difficult" solution since setting up and administering a MySQL database is non-trivial for most users." - I don't see this, if a user can setup a Linux file server with Netatalk they could also setup MySQL as above.
As long as the user is not running MySQL for other applications on the same box, it is mostly just setup and forget.. Sure there are a few steps here, but it is almost zero maintenance after installed. If you have any issues, you can just destroy and recreate the DB.

For example, the default tuning values for MySQL 8+ allow for a Netatalk share with several million files, long before you have to increase any memory limits, or do any fancy tuning. And even if you do, there are tools like mysqltuner.pl etc.
Worst case you might have to rebuild the CNID DB with dbd occasionally.

To be fair, you might have to fiddle with the boot scripts to ensure that the ZFS Vol is mounted before MySQL starts, and that Netatalk starts last after MySQL. And users should probably have MySQL listen on 127.0.0.1 only.
Otherwise it is very little effort for a stable and significant speed boost.

Regarding user cases;
Tracker and Elasticsearch are intended to index file contents, not just file names. This would be useful for users hosting code and document file shares. More complex search for more complex file share.

I would guess that most users are using Netatalk for media file shares, where filename search is more useful. Simple search for simple file share.

This is what brought me to the idea of just using the MySQL backend for the simple case. It already has all the filenames, it is fast, and it is updated in realtime (as soon as a file is created (via Netatalk), it can be found in the MySQL backend)..
So no changes are needed for the MySQL schema, only some Netatalk code to run an SQL query for a given Spotlight query.. I would propose this is a small change for a majority user case.

NB; For files added to the share volumes not via Netatalk, I run the dbd command without -f via cron regularly, to update the CNID.

Regarding a POC; Would you like me to play with some example SQL queries against the cnid DB?

Thanks for your consideration

@andylemin
Copy link
Author

Looks like it could be as simple as; select * from <VolUUID> where Name LIKE '%<SpotlightQueryString>%';

Ie, Substituting '*' wildcard in Spotlight query for '%' should do the trick and is case insensitive too 🙂

There is no index on the Name field (obviously), so it should be O(n) time relative to table length.
A quick test showed this LIKE sequential scan can search 1M records in ~500ms consistently, and of course does not touch the spinning disks.

Keen to hear your thoughts?

@rdmark
Copy link
Member

rdmark commented Sep 7, 2024

I think you are underselling how easy Netatalk makes it :) Thanks to all the hard work by the Netatalk contributors, getting the MySQL backend working is simple.

Impressive research! You say "simple", and then proceed with 3 screens worth of instructions. ;)

Joke aside, I get that it's not "hard" per se, if you're familiar with operating a system like this. A good chunk of Netatalk users today are hobbyists who often don't have sysadmin experience, which is why I'm trying to streamline the installation process as much as possible.

The MySQL backend has a documentation deficit, for certain. Would you be open to filing a PR that adds the example, and any additional context that you think would help? The docs are defined in XML format here: https://github.com/Netatalk/netatalk/tree/main/doc

I do love the real-time updates and fast search of your proposed solution. It's definitely more practically useful than Tracker. What would be the next step?

@andylemin
Copy link
Author

andylemin commented Sep 8, 2024

Haha, I knew you would make a comment about the length of my example. 😉 It started out short..

Absolutely! I would be happy to build a documentation PR.

I can provide some narrative about the auto-setup goodness, and a complete afp.conf example.
A generic on-EXT example for home Linux users, and an on-ZFS production example.

The next steps are tricker for myself, as I have not written any C for over 20 years..
I would be interested in attempting it, but I would need time from someone patient and willing to answer my newbie questions, and who could nudge me in the right direction to ease the learning curve on the code base?

@rdmark
Copy link
Member

rdmark commented Sep 8, 2024

I am frankly an amateur myself but am of course offering to assist to the best of my ability.

In fact, I would say that the last 3 years of this project has been characterized by a bunch of non-Cdevs (re)learning C and figuring things out as we go along. :)

@rdmark rdmark added this to the release-4.1 milestone Nov 7, 2024
@rdmark rdmark removed this from the release-4.1 milestone Jan 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants