Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload takes too long with duplicate_files = false. Maybe add hashes cache? #367

Open
vvrein opened this issue Nov 20, 2024 · 7 comments
Open
Labels
enhancement New feature or request

Comments

@vvrein
Copy link

vvrein commented Nov 20, 2024

Hi! Thank you for the great tool ❤️
I started to use it recently, by migrating from other pastebin, and as far as I can see, rustypaste with duplicate_files = false tries to hash each file in upload folder to guess if file which to be uploaded already exists.

Here is info about my existing uploads:

# ls -lha data/files/ | wc -l
727
# du -hs data/files/
3.9G    .data/files/

Uploaded file:

$ ls -lha favicon-152.png
-rw-r--r-- 1 vrein vrein 7.7K Dec 20  2020 favicon-152.png

Uploading time with duplicate_files = true:

$ time curl http://127.0.0.1:8880/ -F "file=@favicon-152.png"
http://127.0.0.1:8880/favicon-152.AqlQ6JKAp9.png

real    0m0.010s
user    0m0.006s
sys     0m0.003s

Uploading time with duplicate_files = false:

$ time curl http://127.0.0.1:8880/ -F "file=@favicon-152.png"
http://127.0.0.1:8880/favicon-152.AqlQ6JKAp9.png

real    0m10.411s
user    0m0.007s
sys     0m0.003s

I've added some random large files with dd if=/dev/urandom of=largefile bs=1M count=... and summarized in table:

total files count total files size uploading time
727 3.9G 10.411s
728 (+1 1G) 4.9G 13.137s
729 (+1x2G +1x1G) 6.9G 19.254s
3730 (+3k 1M) 6.9G 18.403s

Upload time mostly depends on total files size, files count - unless reached a few millions - should not impact drastically.

I think this is a really great feature, but with current implementation it is prone to enlarge uploading time as file size and count increase, so maybe adding simple cache mechanism, like storing file hashes in memory or in file is worth implementing.

rustypaste version: built from source bf6dd31
os: arch linux
kernel: 6.10.10-arch1-1
processor: Intel(R) Xeon(R) CPU E3-1230 v6 @ 3.50GHz

Unfortunately I have no experience with rust, so may help only with testing and debugging :)

@tessus
Copy link
Collaborator

tessus commented Nov 20, 2024

Thanks for the report.

What you are saying makes sense. I haven't checked the code yet, but it looks like hashes are built for all files every time a file is uploaded. Otherwise uploading a file with a few bytes wouldn't take 10s.

Adding a cache would be a nice enhancement. On a technical level there are a few options, but since the server does not use a database, but only textfiles, these options are less performant.

hash is stored in extended attributes

Pros

  • fairly easy to implement

Contras

  • the exended attributes of all files have to be read on every upload
  • extended attributes are not available on Windows

hashes are stored in text file with mapping to filename

Pros

  • only one file has to be read
  • available on Windows
  • fairly easy to implement

Contras

  • depending on the number of files on the server, the file can have many lines. but even with a very large hash-mapping file, the processing would probably be sub-second.

use of sqlite or any other engine that allows indexed or hashed access

Pros

  • fairly easy to implement
  • most performant

Contras

  • additional dependecy

I am not the owner, so I can't make any design decision. I currently also don't have much time to implement any of this.
But maybe @orhun can add this feature in one of his live streaming coding sessions. ;-)

@tessus tessus added the enhancement New feature or request label Nov 20, 2024
@vvrein
Copy link
Author

vvrein commented Nov 21, 2024

I think that the way of implementation should correlate with rustypaste initial idea / philosophy.
Like is it created to serve thousands or tenth thousands or millions of files? Up to tenth thousands - storing file hashes in plain file should be ok. If more - sqlite with index may be better decision.

I wrote the simple script to test search speed in different conditions.

hashes.py
import os
import random
import hashlib
import string
import sqlite3
import shutil
import timeit

HASH_FILE="hashes.txt"
HASH_DB="hashes.sqlite"
HASH_BYTES_DB="hashes_bytes.sqlite"
HASH_INDEXED_DB="hashes_indexed.sqlite"
FILENAME_LENGTH = 15
HASHES_COUNT = 1*10**6


def getrandline():
    rb = random.randbytes(10)
    line = hashlib.sha256(rb).hexdigest() \
            + " " * 5 \
            + ''.join(random.choices(string.ascii_letters, k=FILENAME_LENGTH)) \
            + ".txt"
    return line

# create hash text file with 1M records (85Mb)
if not os.path.exists(HASH_FILE):
    with open(HASH_FILE, 'a') as f:
        for _ in range(0, HASHES_COUNT):
            f.write(getrandline() + "\n")


# store hashes as text in sqlite (92Mb)
if not os.path.exists(HASH_DB):
    con = sqlite3.connect(HASH_DB)
    cur = con.cursor()
    cur.execute("CREATE TABLE hashes(hash, filename)")
    with open(HASH_FILE, "r") as f:
        for line in f:
            _hash, _filename = line.split()
            cur.execute("insert into hashes values (?,?)", (_hash, _filename))
    con.commit()
    con.close()


# try to store hashes as bytes. This does not works, since values stored as text in sqlite (92Mb)
#   size is the same as in previous variant
if not os.path.exists(HASH_BYTES_DB):
    con = sqlite3.connect(HASH_BYTES_DB)
    cur = con.cursor()
    cur.execute("CREATE TABLE hashes(hash blob, filename blob)")
    with open(HASH_FILE, "r") as f:
        for line in f:
            _hash, _filename = line.split()
            cur.execute("insert into hashes values (?,?)", (_hash.encode(), _filename.encode()))
    con.commit()
    con.close()


# store hashes as text in sqlite, gives x1.77 overhead to size (163Mb)
if not os.path.exists(HASH_INDEXED_DB):
    shutil.copy(HASH_DB, HASH_INDEXED_DB)
    con = sqlite3.connect(HASH_INDEXED_DB)
    cur = con.cursor()
    cur.execute("create index hash_idx on hashes (hash)")
    con.commit()
    con.close()





search_in_file = """
def search_in_file(file, search):
    with open(file, "r") as f:
        for line in f:
            if search in line:
                return line
search_in_file(HASH_FILE, searching_for)
"""


search_in_db = """
con = sqlite3.connect(HASH_DB)
def search_in_db(con, search):
    cur = con.cursor()
    res = cur.execute("select * from hashes where hash = ?", (search,))
    return res.fetchone()
search_in_db(con, searching_for)
con.close()
"""

search_in_indexed_db = """
con = sqlite3.connect(HASH_INDEXED_DB)
def search_in_db(con, search):
    cur = con.cursor()
    res = cur.execute("select * from hashes where hash = ?", (search,))
    return res.fetchone()
search_in_db(con, searching_for)
con.close()
"""

searching_for = "some_hash_here"

num = 100
print(f"Search for hash {searching_for} {num} times in file " + str(timeit.timeit(search_in_file, number=num, globals=globals())))
print(f"Search for hash {searching_for} {num} times in sqlite db " + str(timeit.timeit(search_in_db, number=num, globals=globals())))
print(f"Search for hash {searching_for} {num} times in sqlite indexed db "+ str(timeit.timeit(search_in_indexed_db, number=num, globals=globals())))

Here is the results of search 100 times for fourth and fourth from the end hash:

filename entries size hash position search time
hashes.txt 1M 85M -4 13.583s
hashes.sqlite 1M 92M -4 6.052s
hashes_indexed.sqlite 1M 163M -4 0.0103s
------------------------ --------- ----------- ------------------- -----------------
hashes.txt 1M 85M 4 0.002s
hashes.sqlite 1M 92M 4 6.007s
hashes_indexed.sqlite 1M 163M 4 0.011s
filename entries size hash position search time
hashes.txt 100K 8.5M -4 1.378s
hashes.sqlite 100K 9.1M -4 0.641s
hashes_indexed.sqlite 100K 17M -4 0.010s
------------------------ --------- ----------- ------------------- -----------------
hashes.txt 100K 8.5M 4 0.001s
hashes.sqlite 100K 9.1M 4 0.655s
hashes_indexed.sqlite 100K 17M 4 0.010s
filename entries size hash position search time
hashes.txt 10K 870K -4 0.139s
hashes.sqlite 10K 920K -4 0.088s
hashes_indexed.sqlite 10K 1.7M -4 0.010s
------------------------ --------- ----------- ------------------- -----------------
hashes.txt 10K 870K 4 0.001s
hashes.sqlite 10K 920K 4 0.086s
hashes_indexed.sqlite 10K 1.7M 4 0.010s

So the most consistent and position independent results are got with sqlite + index on hash (sqlite without index is consistent too, but not so fast), but it have cons as @tessus said earlier.

@vvrein
Copy link
Author

vvrein commented Nov 21, 2024

But regarding that entries in cache should not only be created, but updated (upload another file with same name) and deleted (expired files and files deleted by user) too - it may be easier to do this with sqlite

@tessus
Copy link
Collaborator

tessus commented Nov 22, 2024

but updated (upload another file with same name)

Nope. Files with the same name will be rejected from the server.

deleted (expired files and files deleted by user)

Yep, you are correct.

I love sqlite for clients in a single user env. I am not so happy with sqlite on a server though. I have never run perf tests, but an insert requires a lock, and I am not sure what isolation levels are supported by sqlite. What I am trying to say is that the bottleneck for uploading could then be the max transaction rate for inserts. There could be multiple threads trying to insert data.
However, I doubt that there are thousands of users trying to upload thousands of files. (at the same time)
Afaik rustypaste was not designed to handle that kind of traffic.
Additionally inserting the hash could be done async. For the upload itself only the caluculated hash is important and reads can be done without a lock, if done via an uncommitted read. (But that means that there could be duplicate files in certain situations).
Then again, one could also just insert the hash. If it fails, it means the file exists already. No reads necessary at all. But then we are bound by insert perf.
A lot of possible ways to implement this.... ;-) And as always there are trade-offs.

Anyway, let's see what orhun thinks about this.

@orhun orhun changed the title Upload goes too long with duplicate_files = false. Maybe add hashes cache? Upload takes too long with duplicate_files = false. Maybe add hashes cache? Dec 7, 2024
@orhun
Copy link
Owner

orhun commented Dec 7, 2024

Hey guys, sorry for the late reply! Just started to get into flow of things over here again.

First of all, I agree that the current way of handling duplicate files is not the most efficient and most definitely not very scalable. So it should be fixed in a way that does not derail from the design philosophy of rustypaste.

Secondly, thanks for exploring the different solutions out there and benchmarking them! This definitely helps greatly with considering these options.

So the most consistent and position independent results are got with sqlite + index on hash (sqlite without index is consistent too, but not so fast

I'm not a big fan of adding a database solution for this project due to the reasons listed by @tessues and additionally:

  • It's an endless pit. Supporting X db would mean we can support Y db in the future. And then we just end up with a Z cloud database in the end.
  • I like files. Files are databases.

So we should simply use a file for caching them I think. e.g. a file that contains all the hashes and updating them periodically would be a good start imo.

hash is stored in extended attributes

I also like this btw - OS is a problem though. There was an attempt in #292 IIRC.

Either way, I would appreciate if someone goes ahead and implement this by any means. This definitely should be fixed.

@tessus
Copy link
Collaborator

tessus commented Dec 7, 2024

So we should simply use a file for caching them I think. e.g. a file that contains all the hashes and updating them periodically would be a good start imo.

Well, the file must be updated when a file is uploaded, otherwise there could be duplicate files.
The file must include the hash and the filename, otherwise the original filename cannot be returned when a duplicate file is uploaded.
There is an edge case though. What if someone uploads a file x with the name y. Then after a while, someone uploads file x with the name z. x is the same file, y and z are given via the header or random names are off (and x not only has the same content but also the same name).
In that case the second person who uploads x, will get the filename y instead of z. IMO it is valid, if duplicates are set to false. But we should maybe add it to the readme.

The hash also has to be deleted from the hash cache file when a file is deleted on the server.

I have to look into this, but won't we run into file locking issues, when multiple threads upload or delete files?

This is certainly an interesting challenge. Not sure, if I get to it soon. Maybe you could use this for one of your live coding sessions...

@orhun
Copy link
Owner

orhun commented Dec 9, 2024

Well, the file must be updated when a file is uploaded, otherwise there could be duplicate files.
The file must include the hash and the filename, otherwise the original filename cannot be returned when a duplicate file is uploaded.

Ah I see, good point.

In that case the second person who uploads x, will get the filename y instead of z. IMO it is valid, if duplicates are set to false. But we should maybe add it to the readme.

That's quite an edge case :D But yes, I think this should be mentioned in README. Alternatively, we should add more fixture tests to verify the behavior of the combination of different features I think. There might be other cases we are missing.

The hash also has to be deleted from the hash cache file when a file is deleted on the server.

Yup, that's why I suggested a cronjob-like approach.

I have to look into this, but won't we run into file locking issues, when multiple threads upload or delete files?

Probably we will. #246 might help here though.

Maybe you could use this for one of your live coding sessions...

Haha yeah, good idea!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants