Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outsourcing and improving the posting writing of IndexImpl.Text.cpp #1699

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

Flixtastic
Copy link
Contributor

This PR is to further clean up the IndexImpl.Text file while also improving the functionality of the frequency and gap encoding. This extends to a possibilty to better compress and store floats or doubles.

Copy link

codecov bot commented Jan 4, 2025

Codecov Report

Attention: Patch coverage is 93.01310% with 16 lines in your changes missing coverage. Please review.

Project coverage is 89.85%. Comparing base (acb6633) to head (a64e848).

Files with missing lines Patch % Lines
src/index/TextIndexReadWrite.cpp 91.52% 6 Missing and 4 partials ⚠️
src/index/TextIndexReadWrite.h 90.62% 2 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1699      +/-   ##
==========================================
- Coverage   89.86%   89.85%   -0.02%     
==========================================
  Files         389      391       +2     
  Lines       37308    37317       +9     
  Branches     4204     4202       -2     
==========================================
+ Hits        33527    33531       +4     
- Misses       2485     2487       +2     
- Partials     1296     1299       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most important thing is:
Please tell me in which places you have changed something, and where you only extracted, s.t. we can properly review the nontrivial changes.
I really like this idea, everything that makes the index class smaller is good.

Comment on lines 596 to 601
std::ranges::copy(TextIndexReadWrite::readFreqComprList<Id, Score>(
tbmd._cl._nofElements, tbmd._cl._startScorelist,
static_cast<size_t>(tbmd._cl._lastByte + 1 -
tbmd._cl._startScorelist),
textIndexFile_, &Id::makeFromInt),
idTable.getColumn(2).begin());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of the following (index-breaking) suggestion, which makes this code possibly simpler
(maybe we can postpone it to another PR, if this stalls your work here):

  1. We consistently directly compress and store the bits of the ID (as they are also consecutive for positive integers, the gap encoding and frequency encoding should still work). This gets rid of all the Id::makeFromBlaIndex(BlaIndex::make(...)) calls in the transform and copy calls.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again: After some thought please remember this idea, but probably this is for future changes, as it is rather intrusive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would theoretically work fine but there is one slight Problem. This problem has do to with simple8b encoding after gap encoding. If we try to gap encode IDs the first element will be the starting ID without any encoding. Because IDs use their first few bits to determine what type of ID they are there will be a one in the first 4 bits of the ID. This then becomes a problem in simple8b encoding, since it only works for uint64_t with the first 4 bits being 0.

Comment on lines 41 to 49
std::vector<uint64_t> textRecordList(firstElements.begin(),
firstElements.end());
std::vector<WordIndex> wordIndexList(secondElements.begin(),
secondElements.end());
std::vector<Score> scoreList(thirdElements.begin(), thirdElements.end());

GapEncode<uint64_t> textRecordEncoder(textRecordList);
FrequencyEncode<WordIndex> wordIndexEncoder(wordIndexList);
FrequencyEncode<Score> scoreEncoder(scoreList);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these really need a vector or can we make them work with the lazy views directly (I will see once I get there).

src/index/TextIndexReadWrite.cpp Outdated Show resolved Hide resolved
off_t& currentOffset) {
TextIndexReadWrite::writeVectorAndMoveOffset(encodedVector_, nofElements, out,
currentOffset);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you iin this file point out to me the places (via comments) where you have changed anything except for just copying and extracting it here?

The reason is, that a lot of code requires modernization here, but I would prefer to first quickly do the extraction, and then modernize in a separate step.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments of 09d4a97

Comment on lines +147 to +150
explicit GapEncode(const TypedVector& vectorToEncode);

void writeToFile(ad_utility::File& out, size_t nofElements,
off_t& currentOffset);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think you can get away with a lazy view as the input to the constructor.
  2. Would the code in IndexImpl.cpp become simpler if you make a static function that does the encoding + writing in one step (same for the other encoders).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also maybe part for a separate PR, your work here is valuable, by moving it to a separate file it now has a size where we can see the possible improvements much simpler.

src/index/TextIndexReadWrite.h Outdated Show resolved Hide resolved
@Flixtastic Flixtastic requested a review from joka921 January 9, 2025 19:38
@sparql-conformance
Copy link

@Flixtastic
Copy link
Contributor Author

The 2 SonarQube issues can't be adressed with the current implementation of ranges if I am correct.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small reviews from the diff, I'll have another pass over everything with your // MODIFIED comments.

Comment on lines +78 to +79
size_t writeList(const std::vector<Numeric> data, size_t nofElements,
ad_utility::File& file) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argument should be const vector& (doesn't sonar also say so?)

  1. Is nofElements the same as data.size() , in that case this redundant argument should be removed.

  2. As a replacement of const std::vector<...>& you can always use (as the argument type) std::span<const Numeric> here
    (it is more generic and more modern).

but 1 and 2 are more important than 3.

Comment on lines +124 to +129
ql::ranges::transform(frequencyMap.begin(), frequencyMap.end(),
std::back_inserter(frequencyVector),
[](const auto& kv) { return kv; });
ql::ranges::sort(
frequencyVector.begin(), frequencyVector.end(),
[](const auto& a, const auto& b) { return a.second > b.second; });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ql::ranges::transform(frequencyMap.begin(), frequencyMap.end(),
std::back_inserter(frequencyVector),
[](const auto& kv) { return kv; });
ql::ranges::sort(
frequencyVector.begin(), frequencyVector.end(),
[](const auto& a, const auto& b) { return a.second > b.second; });
ql::ranges::transform(frequencyMap,
std::back_inserter(frequencyVector),
[](const auto& kv) { return kv; });
ql::ranges::sort(
frequencyVector,
[](const auto& a, const auto& b) { return a.second > b.second; });

That's one of the points of ql::ranges.

And can't the first transorm with the identity function just simpler be

ql::ranges::copy(frequeycMap, back_inserter(vector));

const TypedVector getEncodedVector() { return encodedVector_; }
const CodeMap& getCodeMap() { return codeMap_; }
const CodeBook& getCodeBook() { return codeBook_; }
TypedVector getEncodedVector() { return encodedVector_; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TypedVector getEncodedVector() { return encodedVector_; }
const TypedVector& getEncodedVector() const { return encodedVector_; }

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional small comments.

Comment on lines +64 to +65
const std::function<To(From)>& transformer = [](From x) {
return static_cast<To>(x);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now,
please make the transformation a template parameter (also for the gap encoding reading below), doesn't sonarcloud complain here?

template <typename To, typename From, typename Transformer = decltype(ad_utility::staticCast<To>)
vector<To> readFreqList(..., Transformer transformer = {}) {...}

Comment on lines +71 to +100
size_t nofCodebookBytes;
vector<uint64_t> frequencyEncodedResult;
frequencyEncodedResult.resize(nofElements + 250);
off_t current = from;
size_t ret = textIndexFile.read(&nofCodebookBytes, sizeof(size_t), current);
LOG(TRACE) << "Nof Codebook Bytes: " << nofCodebookBytes << '\n';
AD_CONTRACT_CHECK(sizeof(size_t) == ret);
current += ret;
std::vector<From> codebook;
codebook.resize(nofCodebookBytes / sizeof(From));
ret = textIndexFile.read(codebook.data(), nofCodebookBytes, current);
current += ret;
AD_CONTRACT_CHECK(ret == size_t(nofCodebookBytes));
std::vector<uint64_t> simple8bEncoded;
simple8bEncoded.resize(nofElements);
ret = textIndexFile.read(simple8bEncoded.data(), nofBytes - (current - from),
current);
current += ret;
AD_CONTRACT_CHECK(size_t(current - from) == nofBytes);
LOG(DEBUG) << "Decoding Simple8b code...\n";
ad_utility::Simple8bCode::decode(simple8bEncoded.data(), nofElements,
frequencyEncodedResult.data());
LOG(DEBUG) << "Reverting frequency encoded items to actual IDs...\n";
frequencyEncodedResult.resize(nofElements);
vector<To> result;
result.reserve(frequencyEncodedResult.size());
ql::ranges::for_each(frequencyEncodedResult, [&](const auto& encoded) {
result.push_back(transformer(codebook.at(encoded)));
});
LOG(DEBUG) << "Done reading frequency-encoded list. Size: " << result.size()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In one of your next PRs you can refactor the reading and writing here, using the functionalities in
util/serialization , because all those manual read and offset += ret calls are very hard to read.

Comment on lines +134 to +139
From previous = 0;
for (size_t i = 0; i < gapEncodedVector.size(); ++i) {
previous += gapEncodedVector[i];
result.push_back(transformer(previous));
}
LOG(DEBUG) << "Done reading gap-encoded list. Size: " << result.size()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it (same for the frequency encoding)
happen, that To and From are the same type, (and the transformation therefore does nothing).
In this case you could simply return the read vector, without the (then redundant) copy.

Comment on lines +41 to +49
std::vector<uint64_t> textRecordList(firstElements.begin(),
firstElements.end());
std::vector<WordIndex> wordIndexList(secondElements.begin(),
secondElements.end());
std::vector<Score> scoreList(thirdElements.begin(), thirdElements.end());

GapEncode textRecordEncoder(textRecordList);
FrequencyEncode wordIndexEncoder(wordIndexList);
FrequencyEncode scoreEncoder(scoreList);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor the FrequencyEncode and GapEncode s.t. the constructor argument becomes a template, then you can pass in the lazy views firstElements... etc. directly, without copying them to a vector first
(more efficient AND less code).

Comment on lines +94 to +102
template <typename T>
void writeVectorAndMoveOffset(const std::vector<T>& vectorToWrite,
size_t nofElements, ad_utility::File& file,
off_t& currentOffset) {
size_t bytes =
textIndexReadWrite::writeList(vectorToWrite, nofElements, file);
currentOffset += bytes;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a separate PR, you can refactor this to use our serialization library, which makes the reading and writing of such types to and from disk much more readable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants