Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extreme score from Query-Likelihood Quantized Index #572

Closed
J9rryGou opened this issue Jan 22, 2024 · 14 comments · Fixed by #575
Closed

Extreme score from Query-Likelihood Quantized Index #572

J9rryGou opened this issue Jan 22, 2024 · 14 comments · Fixed by #575
Labels
bug Something isn't working

Comments

@J9rryGou
Copy link

J9rryGou commented Jan 22, 2024

I created a quantized index by following:

cd /home/jg6226/code/raw_pisa/build
./bin/create_wand_data -c /hdd1/data/ssd2_data_backup/ssd2/data/index/cw09b/CW09B.url.inv -o /ssd2/data/index/cw09b_ql_index/CW09B.ql.quantized.wand --quantize --scorer qld -b 128

./bin/compress_inverted_index -c /hdd1/data/ssd2_data_backup/ssd2/data/index/cw09b/CW09B.url.inv -o /ssd2/data/index/cw09b_ql_index/CW09B.ql.quantized.index.opt -e block_simdbp --quantize --scorer qld --wand /ssd2/data/index/cw09b_ql_index/CW09B.ql.quantized.wand --check

Then I use my edited evaluate_queries to run on a query dataset selected from TREC05

cd /home/jg6226/code/20230101_pisa_termscore_small_size/pisa/build
./bin/evaluate_queries_didordered -e block_simdbp -a ranked_or -i /ssd2/data/index/cw09b_quantized_index/CW09B.quantized.index.opt -q /home/jg6226/data/Hit_Ratio_Project/TREC0506_query/cleaned_query/trec05_testing_queries.txt -k 1000 --scorer quantized --wand /ssd2/data/index/cw09b_quantized_index/CW09B.quantized.wand  --documents /home/jg6226/data/index/cw09b/CW09B.url.fwd.doclex --terms /home/jg6226/data/index/cw09b/CW09B.fwd.termlex -f /home/jg6226/data/Hit_Ratio_Project/TREC0506_query/evaluate_result/trec05_testing_quantized_output.txt -d

I found there are some extreme high score for a document, is there anything wrong with my code?
0OLK%BO`IIM@)FODYH45D4

@J9rryGou J9rryGou added the bug Something isn't working label Jan 22, 2024
@elshize
Copy link
Member

elshize commented Jan 27, 2024

@J9rryGou I fixed a bug with quantization: #573 can you check if you're still getting this issue?

@elshize
Copy link
Member

elshize commented Feb 5, 2024

I just realized that --check has no effect when compressing with quantization. I will see if this can be implemented.

@J9rryGou
Copy link
Author

J9rryGou commented Feb 5, 2024

I just realized that --check has no effect when compressing with quantization. I will see if this can be implemented.

Sounds good.

I also tried compress_inverted_index by not passing --check, the index still has the issue I mentioned above.

@elshize
Copy link
Member

elshize commented Feb 5, 2024

Yeah, not passing --check will have no effect, it's just being ignored. I'll work on implementing the check for quantized, then maybe that can reveal something...

@elshize
Copy link
Member

elshize commented Feb 6, 2024

I haven't figured this one out yet, but I definitely see something is broken.

For one, quantized index using any of the non-blocked encoding is fundamentally broken -- but I think I have an idea why and how to fix it.

Second, I see that at compression time, there is a score that wants to be written: 4294967295, which happens to be a 32-bit int with all 1s, or 2^32 - 1. Not sure yet why but it's a lead.

@elshize
Copy link
Member

elshize commented Feb 6, 2024

Also, BM25 doesn't seem to be affected.

@J9rryGou
Copy link
Author

J9rryGou commented Feb 6, 2024

I haven't figured this one out yet, but I definitely see something is broken.

For one, quantized index using any of the non-blocked encoding is fundamentally broken -- but I think I have an idea why and how to fix it.

Second, I see that at compression time, there is a score that wants to be written: 4294967295, which happens to be a 32-bit int with all 1s, or 2^32 - 1. Not sure yet why but it's a lead.

Sounds good, thank you so much! Seems like we are very close to the bug when using qld as the ranking function. Yeah, all outputs by using bm25 are all good, according to the results from previous runs.

@J9rryGou
Copy link
Author

J9rryGou commented Feb 6, 2024

I haven't figured this one out yet, but I definitely see something is broken.

For one, quantized index using any of the non-blocked encoding is fundamentally broken -- but I think I have an idea why and how to fix it.

Second, I see that at compression time, there is a score that wants to be written: 4294967295, which happens to be a 32-bit int with all 1s, or 2^32 - 1. Not sure yet why but it's a lead.

But I have a question about this:
Since the quantized score has range 0 to 255 (256 is very rare). I did see 256 occur in quantized bm25 score, maybe the way pisa store the quantized score is like this: if it is in range 0 to 255, use one byte, if it is 256, use 2 bytes. That's why before you did that modification of quantizer, it worked well before. For the quantized index of qld, there are some extremely large scores, the PISA will store them with more bytes (maybe up to 8 bytes? I see some score that is even larger than 2^32 -1, but I am not 100% sure.). This can explain why the size of quantized qld index is about 47GB, whereas the size of quantized bm25 index is about 25GB.

So, the way that PISA storing quantized score is not fixing it to 1 byte, but will use more byte if the score is very large?

@J9rryGou
Copy link
Author

J9rryGou commented Feb 6, 2024

For one, quantized index using any of the non-blocked encoding is fundamentally broken -- but I think I have an idea why and how to fix it.

BTW, when you say this, is quantized index of bm25 using elias_fano encoding also broken?

@elshize
Copy link
Member

elshize commented Feb 6, 2024

For one, quantized index using any of the non-blocked encoding is fundamentally broken -- but I think I have an idea why and how to fix it.

BTW, when you say this, is quantized index of bm25 using elias_fano encoding also broken?

Yeah, I believe so, but I would have to confirm that. Some tests I wrote fail for those indexes, so there's clearly something wrong.

@elshize
Copy link
Member

elshize commented Feb 7, 2024

So, the way that PISA storing quantized score is not fixing it to 1 byte, but will use more byte if the score is very large?

Writing is done the same way as frequencies, so depends on the encoding used. Quantization is really just done when computing the score, if that score is 256, then the codec will write it.

@elshize
Copy link
Member

elshize commented Feb 8, 2024

@J9rryGou Actually nvm about what I said about non-blocked encodings. They also seem to work for BM25 after all.

elshize added a commit that referenced this issue Feb 8, 2024
Due to quantization, some scores can be 0, but our frequency encoding
(which is used for scores) assumes positive values. To fix it, we
quantize into a range starting at 1 instead.

Fixes: #572
@elshize
Copy link
Member

elshize commented Feb 8, 2024

@J9rryGou the culprit is how we encode frequencies: we always encode frequency - 1 (because they are all positive). When some scores are quantized to 0, it breaks down, because we end up with 2^32-1 after that subtraction (underflow).

Could you please try the fix branch #575 and report back if it fixes the issue?

Note that I've discovered different issue with PL2 & DPH scorers but both QLD and BM25 should work fine.

elshize added a commit that referenced this issue Feb 8, 2024
Due to quantization, some scores can be 0, but our frequency encoding
(which is used for scores) assumes positive values. To fix it, we
quantize into a range starting at 1 instead.

Fixes: #572
elshize added a commit that referenced this issue Feb 11, 2024
Due to quantization, some scores can be 0, but our frequency encoding
(which is used for scores) assumes positive values. To fix it, we
quantize into a range starting at 1 instead.

Fixes: #572
elshize added a commit that referenced this issue Feb 12, 2024
Due to quantization, some scores can be 0, but our frequency encoding
(which is used for scores) assumes positive values. To fix it, we
quantize into a range starting at 1 instead.

Fixes: #572
elshize added a commit that referenced this issue Feb 12, 2024
Due to quantization, some scores can be 0, but our frequency encoding
(which is used for scores) assumes positive values. To fix it, we
quantize into a range starting at 1 instead.

Changelog-changed: Scores are quantized starting at 1 instead of 0
Fixes: #572
Signed-off-by: Michal Siedlaczek <michal@siedlaczek.me>
elshize added a commit that referenced this issue Feb 13, 2024
Due to quantization, some scores can be 0, but our frequency encoding
(which is used for scores) assumes positive values. To fix it, we
quantize into a range starting at 1 instead.

Changelog-changed: Scores are quantized starting at 1 instead of 0
Fixes: #572
Signed-off-by: Michal Siedlaczek <michal@siedlaczek.me>
@elshize
Copy link
Member

elshize commented Feb 13, 2024

@J9rryGou I closed it with the fix in #575 If you encounter this issue again on the new version, feel free to reopen or open a new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants