-
-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extreme score from Query-Likelihood Quantized Index #572
Comments
I just realized that |
Sounds good. I also tried |
Yeah, not passing |
I haven't figured this one out yet, but I definitely see something is broken. For one, quantized index using any of the non-blocked encoding is fundamentally broken -- but I think I have an idea why and how to fix it. Second, I see that at compression time, there is a score that wants to be written: 4294967295, which happens to be a 32-bit int with all 1s, or |
Also, BM25 doesn't seem to be affected. |
Sounds good, thank you so much! Seems like we are very close to the bug when using qld as the ranking function. Yeah, all outputs by using bm25 are all good, according to the results from previous runs. |
But I have a question about this: So, the way that PISA storing quantized score is not fixing it to 1 byte, but will use more byte if the score is very large? |
BTW, when you say this, is quantized index of bm25 using elias_fano encoding also broken? |
Yeah, I believe so, but I would have to confirm that. Some tests I wrote fail for those indexes, so there's clearly something wrong. |
Writing is done the same way as frequencies, so depends on the encoding used. Quantization is really just done when computing the score, if that score is 256, then the codec will write it. |
@J9rryGou Actually nvm about what I said about non-blocked encodings. They also seem to work for BM25 after all. |
Due to quantization, some scores can be 0, but our frequency encoding (which is used for scores) assumes positive values. To fix it, we quantize into a range starting at 1 instead. Fixes: #572
@J9rryGou the culprit is how we encode frequencies: we always encode frequency - 1 (because they are all positive). When some scores are quantized to 0, it breaks down, because we end up with 2^32-1 after that subtraction (underflow). Could you please try the fix branch #575 and report back if it fixes the issue? Note that I've discovered different issue with PL2 & DPH scorers but both QLD and BM25 should work fine. |
Due to quantization, some scores can be 0, but our frequency encoding (which is used for scores) assumes positive values. To fix it, we quantize into a range starting at 1 instead. Fixes: #572
Due to quantization, some scores can be 0, but our frequency encoding (which is used for scores) assumes positive values. To fix it, we quantize into a range starting at 1 instead. Fixes: #572
Due to quantization, some scores can be 0, but our frequency encoding (which is used for scores) assumes positive values. To fix it, we quantize into a range starting at 1 instead. Fixes: #572
Due to quantization, some scores can be 0, but our frequency encoding (which is used for scores) assumes positive values. To fix it, we quantize into a range starting at 1 instead. Changelog-changed: Scores are quantized starting at 1 instead of 0 Fixes: #572 Signed-off-by: Michal Siedlaczek <michal@siedlaczek.me>
Due to quantization, some scores can be 0, but our frequency encoding (which is used for scores) assumes positive values. To fix it, we quantize into a range starting at 1 instead. Changelog-changed: Scores are quantized starting at 1 instead of 0 Fixes: #572 Signed-off-by: Michal Siedlaczek <michal@siedlaczek.me>
I created a quantized index by following:
Then I use my edited evaluate_queries to run on a query dataset selected from TREC05
I found there are some extreme high score for a document, is there anything wrong with my code?
![0OLK%BO`IIM@)FODYH45D4](https://private-user-images.githubusercontent.com/106114570/298657669-4b046c72-1e6b-4e30-9c5b-f01fded24e14.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyMDM0ODIsIm5iZiI6MTczOTIwMzE4MiwicGF0aCI6Ii8xMDYxMTQ1NzAvMjk4NjU3NjY5LTRiMDQ2YzcyLTFlNmItNGUzMC05YzViLWYwMWZkZWQyNGUxNC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjEwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMFQxNTU5NDJaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1kYTk5ZTc5Y2I0NjBmZWEwOTg4ZGEyNzg1NGU1NjJhMTFhYTM4ZmUyNDhiYzFlNTc5ZTI5ZjU1NDBkMDUyYTQ5JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.HoOvLbLd2ftHlyNBcO9uvsSTNxOnU99FjxTIqz6jd7s)
The text was updated successfully, but these errors were encountered: