-
-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upload takes too long with duplicate_files = false. Maybe add hashes cache? #367
Comments
Thanks for the report. What you are saying makes sense. I haven't checked the code yet, but it looks like hashes are built for all files every time a file is uploaded. Otherwise uploading a file with a few bytes wouldn't take 10s. Adding a cache would be a nice enhancement. On a technical level there are a few options, but since the server does not use a database, but only textfiles, these options are less performant. hash is stored in extended attributesPros
Contras
hashes are stored in text file with mapping to filenamePros
Contras
use of sqlite or any other engine that allows indexed or hashed accessPros
Contras
I am not the owner, so I can't make any design decision. I currently also don't have much time to implement any of this. |
I think that the way of implementation should correlate with I wrote the simple script to test search speed in different conditions. hashes.pyimport os
import random
import hashlib
import string
import sqlite3
import shutil
import timeit
HASH_FILE="hashes.txt"
HASH_DB="hashes.sqlite"
HASH_BYTES_DB="hashes_bytes.sqlite"
HASH_INDEXED_DB="hashes_indexed.sqlite"
FILENAME_LENGTH = 15
HASHES_COUNT = 1*10**6
def getrandline():
rb = random.randbytes(10)
line = hashlib.sha256(rb).hexdigest() \
+ " " * 5 \
+ ''.join(random.choices(string.ascii_letters, k=FILENAME_LENGTH)) \
+ ".txt"
return line
# create hash text file with 1M records (85Mb)
if not os.path.exists(HASH_FILE):
with open(HASH_FILE, 'a') as f:
for _ in range(0, HASHES_COUNT):
f.write(getrandline() + "\n")
# store hashes as text in sqlite (92Mb)
if not os.path.exists(HASH_DB):
con = sqlite3.connect(HASH_DB)
cur = con.cursor()
cur.execute("CREATE TABLE hashes(hash, filename)")
with open(HASH_FILE, "r") as f:
for line in f:
_hash, _filename = line.split()
cur.execute("insert into hashes values (?,?)", (_hash, _filename))
con.commit()
con.close()
# try to store hashes as bytes. This does not works, since values stored as text in sqlite (92Mb)
# size is the same as in previous variant
if not os.path.exists(HASH_BYTES_DB):
con = sqlite3.connect(HASH_BYTES_DB)
cur = con.cursor()
cur.execute("CREATE TABLE hashes(hash blob, filename blob)")
with open(HASH_FILE, "r") as f:
for line in f:
_hash, _filename = line.split()
cur.execute("insert into hashes values (?,?)", (_hash.encode(), _filename.encode()))
con.commit()
con.close()
# store hashes as text in sqlite, gives x1.77 overhead to size (163Mb)
if not os.path.exists(HASH_INDEXED_DB):
shutil.copy(HASH_DB, HASH_INDEXED_DB)
con = sqlite3.connect(HASH_INDEXED_DB)
cur = con.cursor()
cur.execute("create index hash_idx on hashes (hash)")
con.commit()
con.close()
search_in_file = """
def search_in_file(file, search):
with open(file, "r") as f:
for line in f:
if search in line:
return line
search_in_file(HASH_FILE, searching_for)
"""
search_in_db = """
con = sqlite3.connect(HASH_DB)
def search_in_db(con, search):
cur = con.cursor()
res = cur.execute("select * from hashes where hash = ?", (search,))
return res.fetchone()
search_in_db(con, searching_for)
con.close()
"""
search_in_indexed_db = """
con = sqlite3.connect(HASH_INDEXED_DB)
def search_in_db(con, search):
cur = con.cursor()
res = cur.execute("select * from hashes where hash = ?", (search,))
return res.fetchone()
search_in_db(con, searching_for)
con.close()
"""
searching_for = "some_hash_here"
num = 100
print(f"Search for hash {searching_for} {num} times in file " + str(timeit.timeit(search_in_file, number=num, globals=globals())))
print(f"Search for hash {searching_for} {num} times in sqlite db " + str(timeit.timeit(search_in_db, number=num, globals=globals())))
print(f"Search for hash {searching_for} {num} times in sqlite indexed db "+ str(timeit.timeit(search_in_indexed_db, number=num, globals=globals()))) Here is the results of search 100 times for fourth and fourth from the end hash:
So the most consistent and position independent results are got with sqlite + index on |
But regarding that entries in cache should not only be created, but updated (upload another file with same name) and deleted (expired files and files deleted by user) too - it may be easier to do this with sqlite |
Nope. Files with the same name will be rejected from the server.
Yep, you are correct. I love sqlite for clients in a single user env. I am not so happy with sqlite on a server though. I have never run perf tests, but an insert requires a lock, and I am not sure what isolation levels are supported by sqlite. What I am trying to say is that the bottleneck for uploading could then be the max transaction rate for inserts. There could be multiple threads trying to insert data. Anyway, let's see what orhun thinks about this. |
Hey guys, sorry for the late reply! Just started to get into flow of things over here again. First of all, I agree that the current way of handling duplicate files is not the most efficient and most definitely not very scalable. So it should be fixed in a way that does not derail from the design philosophy of Secondly, thanks for exploring the different solutions out there and benchmarking them! This definitely helps greatly with considering these options.
I'm not a big fan of adding a database solution for this project due to the reasons listed by @tessues and additionally:
So we should simply use a file for caching them I think. e.g. a file that contains all the hashes and updating them periodically would be a good start imo.
I also like this btw - OS is a problem though. There was an attempt in #292 IIRC. Either way, I would appreciate if someone goes ahead and implement this by any means. This definitely should be fixed. |
Well, the file must be updated when a file is uploaded, otherwise there could be duplicate files. The hash also has to be deleted from the hash cache file when a file is deleted on the server. I have to look into this, but won't we run into file locking issues, when multiple threads upload or delete files? This is certainly an interesting challenge. Not sure, if I get to it soon. Maybe you could use this for one of your live coding sessions... |
Ah I see, good point.
That's quite an edge case :D But yes, I think this should be mentioned in README. Alternatively, we should add more fixture tests to verify the behavior of the combination of different features I think. There might be other cases we are missing.
Yup, that's why I suggested a cronjob-like approach.
Probably we will. #246 might help here though.
Haha yeah, good idea! |
Hi! Thank you for the great tool ❤️
I started to use it recently, by migrating from other pastebin, and as far as I can see, rustypaste with
duplicate_files = false
tries to hash each file in upload folder to guess if file which to be uploaded already exists.Here is info about my existing uploads:
Uploaded file:
Uploading time with
duplicate_files = true
:Uploading time with
duplicate_files = false
:I've added some random large files with
dd if=/dev/urandom of=largefile bs=1M count=...
and summarized in table:Upload time mostly depends on total files size, files count - unless reached a few millions - should not impact drastically.
I think this is a really great feature, but with current implementation it is prone to enlarge uploading time as file size and count increase, so maybe adding simple cache mechanism, like storing file hashes in memory or in file is worth implementing.
rustypaste version: built from source bf6dd31
os:
arch linux
kernel:
6.10.10-arch1-1
processor:
Intel(R) Xeon(R) CPU E3-1230 v6 @ 3.50GHz
Unfortunately I have no experience with rust, so may help only with testing and debugging :)
The text was updated successfully, but these errors were encountered: