-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Outsourcing and improving the posting writing of IndexImpl.Text.cpp #1699
base: master
Are you sure you want to change the base?
Conversation
…yet, commit is used to initialize branch
…quency and gap compressed lists.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1699 +/- ##
==========================================
- Coverage 89.86% 89.85% -0.02%
==========================================
Files 389 391 +2
Lines 37308 37317 +9
Branches 4204 4202 -2
==========================================
+ Hits 33527 33531 +4
- Misses 2485 2487 +2
- Partials 1296 1299 +3 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most important thing is:
Please tell me in which places you have changed something, and where you only extracted, s.t. we can properly review the nontrivial changes.
I really like this idea, everything that makes the index class smaller is good.
src/index/IndexImpl.Text.cpp
Outdated
std::ranges::copy(TextIndexReadWrite::readFreqComprList<Id, Score>( | ||
tbmd._cl._nofElements, tbmd._cl._startScorelist, | ||
static_cast<size_t>(tbmd._cl._lastByte + 1 - | ||
tbmd._cl._startScorelist), | ||
textIndexFile_, &Id::makeFromInt), | ||
idTable.getColumn(2).begin()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of the following (index-breaking) suggestion, which makes this code possibly simpler
(maybe we can postpone it to another PR, if this stalls your work here):
- We consistently directly compress and store the bits of the ID (as they are also consecutive for positive integers, the gap encoding and frequency encoding should still work). This gets rid of all the
Id::makeFromBlaIndex(BlaIndex::make(...))
calls in thetransform
andcopy
calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again: After some thought please remember this idea, but probably this is for future changes, as it is rather intrusive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would theoretically work fine but there is one slight Problem. This problem has do to with simple8b encoding after gap encoding. If we try to gap encode IDs the first element will be the starting ID without any encoding. Because IDs use their first few bits to determine what type of ID they are there will be a one in the first 4 bits of the ID. This then becomes a problem in simple8b encoding, since it only works for uint64_t with the first 4 bits being 0.
src/index/TextIndexReadWrite.cpp
Outdated
std::vector<uint64_t> textRecordList(firstElements.begin(), | ||
firstElements.end()); | ||
std::vector<WordIndex> wordIndexList(secondElements.begin(), | ||
secondElements.end()); | ||
std::vector<Score> scoreList(thirdElements.begin(), thirdElements.end()); | ||
|
||
GapEncode<uint64_t> textRecordEncoder(textRecordList); | ||
FrequencyEncode<WordIndex> wordIndexEncoder(wordIndexList); | ||
FrequencyEncode<Score> scoreEncoder(scoreList); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these really need a vector
or can we make them work with the lazy views
directly (I will see once I get there).
off_t& currentOffset) { | ||
TextIndexReadWrite::writeVectorAndMoveOffset(encodedVector_, nofElements, out, | ||
currentOffset); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you iin this file point out to me the places (via comments) where you have changed anything except for just copying and extracting it here?
The reason is, that a lot of code requires modernization here, but I would prefer to first quickly do the extraction, and then modernize in a separate step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments of 09d4a97
explicit GapEncode(const TypedVector& vectorToEncode); | ||
|
||
void writeToFile(ad_utility::File& out, size_t nofElements, | ||
off_t& currentOffset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think you can get away with a lazy view as the input to the constructor.
- Would the code in
IndexImpl.cpp
become simpler if you make a static function that does the encoding + writing in one step (same for the other encoders).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Also maybe part for a separate PR, your work here is valuable, by moving it to a separate file it now has a size where we can see the possible improvements much simpler.
…th using vector.data()
Conformance check passed ✅No test result changes. |
Quality Gate passedIssues Measures |
The 2 SonarQube issues can't be adressed with the current implementation of ranges if I am correct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small reviews from the diff, I'll have another pass over everything with your // MODIFIED
comments.
size_t writeList(const std::vector<Numeric> data, size_t nofElements, | ||
ad_utility::File& file) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argument should be const vector&
(doesn't sonar also say so?)
-
Is
nofElements
the same asdata.size()
, in that case this redundant argument should be removed. -
As a replacement of
const std::vector<...>&
you can always use (as the argument type)std::span<const Numeric>
here
(it is more generic and more modern).
but 1 and 2 are more important than 3
.
ql::ranges::transform(frequencyMap.begin(), frequencyMap.end(), | ||
std::back_inserter(frequencyVector), | ||
[](const auto& kv) { return kv; }); | ||
ql::ranges::sort( | ||
frequencyVector.begin(), frequencyVector.end(), | ||
[](const auto& a, const auto& b) { return a.second > b.second; }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ql::ranges::transform(frequencyMap.begin(), frequencyMap.end(), | |
std::back_inserter(frequencyVector), | |
[](const auto& kv) { return kv; }); | |
ql::ranges::sort( | |
frequencyVector.begin(), frequencyVector.end(), | |
[](const auto& a, const auto& b) { return a.second > b.second; }); | |
ql::ranges::transform(frequencyMap, | |
std::back_inserter(frequencyVector), | |
[](const auto& kv) { return kv; }); | |
ql::ranges::sort( | |
frequencyVector, | |
[](const auto& a, const auto& b) { return a.second > b.second; }); |
That's one of the points of ql::ranges
.
And can't the first transorm with the identity function just simpler be
ql::ranges::copy(frequeycMap, back_inserter(vector));
const TypedVector getEncodedVector() { return encodedVector_; } | ||
const CodeMap& getCodeMap() { return codeMap_; } | ||
const CodeBook& getCodeBook() { return codeBook_; } | ||
TypedVector getEncodedVector() { return encodedVector_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TypedVector getEncodedVector() { return encodedVector_; } | |
const TypedVector& getEncodedVector() const { return encodedVector_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some additional small comments.
const std::function<To(From)>& transformer = [](From x) { | ||
return static_cast<To>(x); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now,
please make the transformation a template parameter (also for the gap encoding reading below), doesn't sonarcloud complain here?
template <typename To, typename From, typename Transformer = decltype(ad_utility::staticCast<To>)
vector<To> readFreqList(..., Transformer transformer = {}) {...}
size_t nofCodebookBytes; | ||
vector<uint64_t> frequencyEncodedResult; | ||
frequencyEncodedResult.resize(nofElements + 250); | ||
off_t current = from; | ||
size_t ret = textIndexFile.read(&nofCodebookBytes, sizeof(size_t), current); | ||
LOG(TRACE) << "Nof Codebook Bytes: " << nofCodebookBytes << '\n'; | ||
AD_CONTRACT_CHECK(sizeof(size_t) == ret); | ||
current += ret; | ||
std::vector<From> codebook; | ||
codebook.resize(nofCodebookBytes / sizeof(From)); | ||
ret = textIndexFile.read(codebook.data(), nofCodebookBytes, current); | ||
current += ret; | ||
AD_CONTRACT_CHECK(ret == size_t(nofCodebookBytes)); | ||
std::vector<uint64_t> simple8bEncoded; | ||
simple8bEncoded.resize(nofElements); | ||
ret = textIndexFile.read(simple8bEncoded.data(), nofBytes - (current - from), | ||
current); | ||
current += ret; | ||
AD_CONTRACT_CHECK(size_t(current - from) == nofBytes); | ||
LOG(DEBUG) << "Decoding Simple8b code...\n"; | ||
ad_utility::Simple8bCode::decode(simple8bEncoded.data(), nofElements, | ||
frequencyEncodedResult.data()); | ||
LOG(DEBUG) << "Reverting frequency encoded items to actual IDs...\n"; | ||
frequencyEncodedResult.resize(nofElements); | ||
vector<To> result; | ||
result.reserve(frequencyEncodedResult.size()); | ||
ql::ranges::for_each(frequencyEncodedResult, [&](const auto& encoded) { | ||
result.push_back(transformer(codebook.at(encoded))); | ||
}); | ||
LOG(DEBUG) << "Done reading frequency-encoded list. Size: " << result.size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In one of your next PRs you can refactor the reading and writing here, using the functionalities in
util/serialization
, because all those manual read
and offset += ret
calls are very hard to read.
From previous = 0; | ||
for (size_t i = 0; i < gapEncodedVector.size(); ++i) { | ||
previous += gapEncodedVector[i]; | ||
result.push_back(transformer(previous)); | ||
} | ||
LOG(DEBUG) << "Done reading gap-encoded list. Size: " << result.size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it (same for the frequency encoding)
happen, that To and From are the same type, (and the transformation therefore does nothing).
In this case you could simply return the read vector, without the (then redundant) copy.
std::vector<uint64_t> textRecordList(firstElements.begin(), | ||
firstElements.end()); | ||
std::vector<WordIndex> wordIndexList(secondElements.begin(), | ||
secondElements.end()); | ||
std::vector<Score> scoreList(thirdElements.begin(), thirdElements.end()); | ||
|
||
GapEncode textRecordEncoder(textRecordList); | ||
FrequencyEncode wordIndexEncoder(wordIndexList); | ||
FrequencyEncode scoreEncoder(scoreList); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactor the FrequencyEncode
and GapEncode
s.t. the constructor argument becomes a template, then you can pass in the lazy views firstElements
... etc. directly, without copying them to a vector
first
(more efficient AND less code).
template <typename T> | ||
void writeVectorAndMoveOffset(const std::vector<T>& vectorToWrite, | ||
size_t nofElements, ad_utility::File& file, | ||
off_t& currentOffset) { | ||
size_t bytes = | ||
textIndexReadWrite::writeList(vectorToWrite, nofElements, file); | ||
currentOffset += bytes; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a separate PR, you can refactor this to use our serialization library, which makes the reading and writing of such types to and from disk much more readable.
This PR is to further clean up the IndexImpl.Text file while also improving the functionality of the frequency and gap encoding. This extends to a possibilty to better compress and store floats or doubles.