Upload takes too long with duplicate_files = false. Maybe add hashes cache? #367

vvrein · 2024-11-20T19:02:20Z

Hi! Thank you for the great tool ❤️
I started to use it recently, by migrating from other pastebin, and as far as I can see, rustypaste with duplicate_files = false tries to hash each file in upload folder to guess if file which to be uploaded already exists.

Here is info about my existing uploads:

# ls -lha data/files/ | wc -l
727
# du -hs data/files/
3.9G    .data/files/

Uploaded file:

$ ls -lha favicon-152.png
-rw-r--r-- 1 vrein vrein 7.7K Dec 20  2020 favicon-152.png

Uploading time with duplicate_files = true:

$ time curl http://127.0.0.1:8880/ -F "file=@favicon-152.png"
http://127.0.0.1:8880/favicon-152.AqlQ6JKAp9.png

real    0m0.010s
user    0m0.006s
sys     0m0.003s

Uploading time with duplicate_files = false:

$ time curl http://127.0.0.1:8880/ -F "file=@favicon-152.png"
http://127.0.0.1:8880/favicon-152.AqlQ6JKAp9.png

real    0m10.411s
user    0m0.007s
sys     0m0.003s

I've added some random large files with dd if=/dev/urandom of=largefile bs=1M count=... and summarized in table:

total files count	total files size	uploading time
727	3.9G	10.411s
728 (+1 1G)	4.9G	13.137s
729 (+1x2G +1x1G)	6.9G	19.254s
3730 (+3k 1M)	6.9G	18.403s

Upload time mostly depends on total files size, files count - unless reached a few millions - should not impact drastically.

I think this is a really great feature, but with current implementation it is prone to enlarge uploading time as file size and count increase, so maybe adding simple cache mechanism, like storing file hashes in memory or in file is worth implementing.

rustypaste version: built from source bf6dd31
os: arch linux
kernel: 6.10.10-arch1-1
processor: Intel(R) Xeon(R) CPU E3-1230 v6 @ 3.50GHz

Unfortunately I have no experience with rust, so may help only with testing and debugging :)

The text was updated successfully, but these errors were encountered:

tessus · 2024-11-20T20:45:10Z

Thanks for the report.

What you are saying makes sense. I haven't checked the code yet, but it looks like hashes are built for all files every time a file is uploaded. Otherwise uploading a file with a few bytes wouldn't take 10s.

Adding a cache would be a nice enhancement. On a technical level there are a few options, but since the server does not use a database, but only textfiles, these options are less performant.

hash is stored in extended attributes

Pros

fairly easy to implement

Contras

the exended attributes of all files have to be read on every upload
extended attributes are not available on Windows

hashes are stored in text file with mapping to filename

Pros

only one file has to be read
available on Windows
fairly easy to implement

Contras

depending on the number of files on the server, the file can have many lines. but even with a very large hash-mapping file, the processing would probably be sub-second.

use of sqlite or any other engine that allows indexed or hashed access

Pros

fairly easy to implement
most performant

Contras

additional dependecy

I am not the owner, so I can't make any design decision. I currently also don't have much time to implement any of this.
But maybe @orhun can add this feature in one of his live streaming coding sessions. ;-)

vvrein · 2024-11-21T09:25:50Z

I think that the way of implementation should correlate with rustypaste initial idea / philosophy.
Like is it created to serve thousands or tenth thousands or millions of files? Up to tenth thousands - storing file hashes in plain file should be ok. If more - sqlite with index may be better decision.

I wrote the simple script to test search speed in different conditions.

hashes.py

import os
import random
import hashlib
import string
import sqlite3
import shutil
import timeit

HASH_FILE="hashes.txt"
HASH_DB="hashes.sqlite"
HASH_BYTES_DB="hashes_bytes.sqlite"
HASH_INDEXED_DB="hashes_indexed.sqlite"
FILENAME_LENGTH = 15
HASHES_COUNT = 1*10**6


def getrandline():
    rb = random.randbytes(10)
    line = hashlib.sha256(rb).hexdigest() \
            + " " * 5 \
            + ''.join(random.choices(string.ascii_letters, k=FILENAME_LENGTH)) \
            + ".txt"
    return line

# create hash text file with 1M records (85Mb)
if not os.path.exists(HASH_FILE):
    with open(HASH_FILE, 'a') as f:
        for _ in range(0, HASHES_COUNT):
            f.write(getrandline() + "\n")


# store hashes as text in sqlite (92Mb)
if not os.path.exists(HASH_DB):
    con = sqlite3.connect(HASH_DB)
    cur = con.cursor()
    cur.execute("CREATE TABLE hashes(hash, filename)")
    with open(HASH_FILE, "r") as f:
        for line in f:
            _hash, _filename = line.split()
            cur.execute("insert into hashes values (?,?)", (_hash, _filename))
    con.commit()
    con.close()


# try to store hashes as bytes. This does not works, since values stored as text in sqlite (92Mb)
#   size is the same as in previous variant
if not os.path.exists(HASH_BYTES_DB):
    con = sqlite3.connect(HASH_BYTES_DB)
    cur = con.cursor()
    cur.execute("CREATE TABLE hashes(hash blob, filename blob)")
    with open(HASH_FILE, "r") as f:
        for line in f:
            _hash, _filename = line.split()
            cur.execute("insert into hashes values (?,?)", (_hash.encode(), _filename.encode()))
    con.commit()
    con.close()


# store hashes as text in sqlite, gives x1.77 overhead to size (163Mb)
if not os.path.exists(HASH_INDEXED_DB):
    shutil.copy(HASH_DB, HASH_INDEXED_DB)
    con = sqlite3.connect(HASH_INDEXED_DB)
    cur = con.cursor()
    cur.execute("create index hash_idx on hashes (hash)")
    con.commit()
    con.close()





search_in_file = """
def search_in_file(file, search):
    with open(file, "r") as f:
        for line in f:
            if search in line:
                return line
search_in_file(HASH_FILE, searching_for)
"""


search_in_db = """
con = sqlite3.connect(HASH_DB)
def search_in_db(con, search):
    cur = con.cursor()
    res = cur.execute("select * from hashes where hash = ?", (search,))
    return res.fetchone()
search_in_db(con, searching_for)
con.close()
"""

search_in_indexed_db = """
con = sqlite3.connect(HASH_INDEXED_DB)
def search_in_db(con, search):
    cur = con.cursor()
    res = cur.execute("select * from hashes where hash = ?", (search,))
    return res.fetchone()
search_in_db(con, searching_for)
con.close()
"""

searching_for = "some_hash_here"

num = 100
print(f"Search for hash {searching_for} {num} times in file " + str(timeit.timeit(search_in_file, number=num, globals=globals())))
print(f"Search for hash {searching_for} {num} times in sqlite db " + str(timeit.timeit(search_in_db, number=num, globals=globals())))
print(f"Search for hash {searching_for} {num} times in sqlite indexed db "+ str(timeit.timeit(search_in_indexed_db, number=num, globals=globals())))

Here is the results of search 100 times for fourth and fourth from the end hash:

filename	entries	size	hash position	search time
hashes.txt	1M	85M	-4	13.583s
hashes.sqlite	1M	92M	-4	6.052s
hashes_indexed.sqlite	1M	163M	-4	0.0103s
------------------------	---------	-----------	-------------------	-----------------
hashes.txt	1M	85M	4	0.002s
hashes.sqlite	1M	92M	4	6.007s
hashes_indexed.sqlite	1M	163M	4	0.011s

filename	entries	size	hash position	search time
hashes.txt	100K	8.5M	-4	1.378s
hashes.sqlite	100K	9.1M	-4	0.641s
hashes_indexed.sqlite	100K	17M	-4	0.010s
------------------------	---------	-----------	-------------------	-----------------
hashes.txt	100K	8.5M	4	0.001s
hashes.sqlite	100K	9.1M	4	0.655s
hashes_indexed.sqlite	100K	17M	4	0.010s

filename	entries	size	hash position	search time
hashes.txt	10K	870K	-4	0.139s
hashes.sqlite	10K	920K	-4	0.088s
hashes_indexed.sqlite	10K	1.7M	-4	0.010s
------------------------	---------	-----------	-------------------	-----------------
hashes.txt	10K	870K	4	0.001s
hashes.sqlite	10K	920K	4	0.086s
hashes_indexed.sqlite	10K	1.7M	4	0.010s

So the most consistent and position independent results are got with sqlite + index on hash (sqlite without index is consistent too, but not so fast), but it have cons as @tessus said earlier.

vvrein · 2024-11-21T09:44:06Z

But regarding that entries in cache should not only be created, but updated (upload another file with same name) and deleted (expired files and files deleted by user) too - it may be easier to do this with sqlite

tessus · 2024-11-22T00:07:55Z

but updated (upload another file with same name)

Nope. Files with the same name will be rejected from the server.

deleted (expired files and files deleted by user)

Yep, you are correct.

I love sqlite for clients in a single user env. I am not so happy with sqlite on a server though. I have never run perf tests, but an insert requires a lock, and I am not sure what isolation levels are supported by sqlite. What I am trying to say is that the bottleneck for uploading could then be the max transaction rate for inserts. There could be multiple threads trying to insert data.
However, I doubt that there are thousands of users trying to upload thousands of files. (at the same time)
Afaik rustypaste was not designed to handle that kind of traffic.
Additionally inserting the hash could be done async. For the upload itself only the caluculated hash is important and reads can be done without a lock, if done via an uncommitted read. (But that means that there could be duplicate files in certain situations).
Then again, one could also just insert the hash. If it fails, it means the file exists already. No reads necessary at all. But then we are bound by insert perf.
A lot of possible ways to implement this.... ;-) And as always there are trade-offs.

Anyway, let's see what orhun thinks about this.

orhun · 2024-12-07T14:59:20Z

Hey guys, sorry for the late reply! Just started to get into flow of things over here again.

First of all, I agree that the current way of handling duplicate files is not the most efficient and most definitely not very scalable. So it should be fixed in a way that does not derail from the design philosophy of rustypaste.

Secondly, thanks for exploring the different solutions out there and benchmarking them! This definitely helps greatly with considering these options.

So the most consistent and position independent results are got with sqlite + index on hash (sqlite without index is consistent too, but not so fast

I'm not a big fan of adding a database solution for this project due to the reasons listed by @tessues and additionally:

It's an endless pit. Supporting X db would mean we can support Y db in the future. And then we just end up with a Z cloud database in the end.
I like files. Files are databases.

So we should simply use a file for caching them I think. e.g. a file that contains all the hashes and updating them periodically would be a good start imo.

hash is stored in extended attributes

I also like this btw - OS is a problem though. There was an attempt in #292 IIRC.

Either way, I would appreciate if someone goes ahead and implement this by any means. This definitely should be fixed.

tessus · 2024-12-07T17:17:06Z

So we should simply use a file for caching them I think. e.g. a file that contains all the hashes and updating them periodically would be a good start imo.

Well, the file must be updated when a file is uploaded, otherwise there could be duplicate files.
The file must include the hash and the filename, otherwise the original filename cannot be returned when a duplicate file is uploaded.
There is an edge case though. What if someone uploads a file x with the name y. Then after a while, someone uploads file x with the name z. x is the same file, y and z are given via the header or random names are off (and x not only has the same content but also the same name).
In that case the second person who uploads x, will get the filename y instead of z. IMO it is valid, if duplicates are set to false. But we should maybe add it to the readme.

The hash also has to be deleted from the hash cache file when a file is deleted on the server.

I have to look into this, but won't we run into file locking issues, when multiple threads upload or delete files?

This is certainly an interesting challenge. Not sure, if I get to it soon. Maybe you could use this for one of your live coding sessions...

orhun · 2024-12-09T09:34:45Z

Well, the file must be updated when a file is uploaded, otherwise there could be duplicate files.
The file must include the hash and the filename, otherwise the original filename cannot be returned when a duplicate file is uploaded.

Ah I see, good point.

In that case the second person who uploads x, will get the filename y instead of z. IMO it is valid, if duplicates are set to false. But we should maybe add it to the readme.

That's quite an edge case :D But yes, I think this should be mentioned in README. Alternatively, we should add more fixture tests to verify the behavior of the combination of different features I think. There might be other cases we are missing.

The hash also has to be deleted from the hash cache file when a file is deleted on the server.

Yup, that's why I suggested a cronjob-like approach.

I have to look into this, but won't we run into file locking issues, when multiple threads upload or delete files?

Probably we will. #246 might help here though.

Maybe you could use this for one of your live coding sessions...

Haha yeah, good idea!

tessus added the enhancement New feature or request label Nov 20, 2024

orhun changed the title ~~Upload goes too long with duplicate_files = false. Maybe add hashes cache?~~ Upload takes too long with duplicate_files = false. Maybe add hashes cache? Dec 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upload takes too long with duplicate_files = false. Maybe add hashes cache? #367

Upload takes too long with duplicate_files = false. Maybe add hashes cache? #367

vvrein commented Nov 20, 2024

tessus commented Nov 20, 2024

vvrein commented Nov 21, 2024 •

edited

Loading

vvrein commented Nov 21, 2024

tessus commented Nov 22, 2024 •

edited

Loading

orhun commented Dec 7, 2024

tessus commented Dec 7, 2024

orhun commented Dec 9, 2024

Upload takes too long with duplicate_files = false. Maybe add hashes cache? #367

Upload takes too long with duplicate_files = false. Maybe add hashes cache? #367

Comments

vvrein commented Nov 20, 2024

tessus commented Nov 20, 2024

hash is stored in extended attributes

Pros

Contras

hashes are stored in text file with mapping to filename

Pros

Contras

use of sqlite or any other engine that allows indexed or hashed access

Pros

Contras

vvrein commented Nov 21, 2024 • edited Loading

vvrein commented Nov 21, 2024

tessus commented Nov 22, 2024 • edited Loading

orhun commented Dec 7, 2024

tessus commented Dec 7, 2024

orhun commented Dec 9, 2024

vvrein commented Nov 21, 2024 •

edited

Loading

tessus commented Nov 22, 2024 •

edited

Loading