Outsourcing and improving the posting writing of IndexImpl.Text.cpp #1699

Flixtastic · 2025-01-04T22:40:16Z

This PR is to further clean up the IndexImpl.Text file while also improving the functionality of the frequency and gap encoding. This extends to a possibilty to better compress and store floats or doubles.

…yet, commit is used to initialize branch

…quency and gap compressed lists.

codecov · 2025-01-04T23:25:16Z

Codecov Report

Attention: Patch coverage is 93.01310% with 16 lines in your changes missing coverage. Please review.

Project coverage is 89.85%. Comparing base (acb6633) to head (a64e848).

Files with missing lines	Patch %	Lines
src/index/TextIndexReadWrite.cpp	91.52%	6 Missing and 4 partials ⚠️
src/index/TextIndexReadWrite.h	90.62%	2 Missing and 4 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1699      +/-   ##
==========================================
- Coverage   89.86%   89.85%   -0.02%     
==========================================
  Files         389      391       +2     
  Lines       37308    37317       +9     
  Branches     4204     4202       -2     
==========================================
+ Hits        33527    33531       +4     
- Misses       2485     2487       +2     
- Partials     1296     1299       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

joka921

The most important thing is:
Please tell me in which places you have changed something, and where you only extracted, s.t. we can properly review the nontrivial changes.
I really like this idea, everything that makes the index class smaller is good.

joka921 · 2025-01-08T19:48:06Z

src/index/IndexImpl.Text.cpp

+  std::ranges::copy(TextIndexReadWrite::readFreqComprList<Id, Score>(
+                        tbmd._cl._nofElements, tbmd._cl._startScorelist,
+                        static_cast<size_t>(tbmd._cl._lastByte + 1 -
+                                            tbmd._cl._startScorelist),
+                        textIndexFile_, &Id::makeFromInt),
+                    idTable.getColumn(2).begin());


What do you think of the following (index-breaking) suggestion, which makes this code possibly simpler
(maybe we can postpone it to another PR, if this stalls your work here):

We consistently directly compress and store the bits of the ID (as they are also consecutive for positive integers, the gap encoding and frequency encoding should still work). This gets rid of all the Id::makeFromBlaIndex(BlaIndex::make(...)) calls in the transform and copy calls.

Again: After some thought please remember this idea, but probably this is for future changes, as it is rather intrusive.

This would theoretically work fine but there is one slight Problem. This problem has do to with simple8b encoding after gap encoding. If we try to gap encode IDs the first element will be the starting ID without any encoding. Because IDs use their first few bits to determine what type of ID they are there will be a one in the first 4 bits of the ID. This then becomes a problem in simple8b encoding, since it only works for uint64_t with the first 4 bits being 0.

joka921 · 2025-01-08T19:53:23Z

src/index/TextIndexReadWrite.cpp

+  std::vector<uint64_t> textRecordList(firstElements.begin(),
+                                       firstElements.end());
+  std::vector<WordIndex> wordIndexList(secondElements.begin(),
+                                       secondElements.end());
+  std::vector<Score> scoreList(thirdElements.begin(), thirdElements.end());
+
+  GapEncode<uint64_t> textRecordEncoder(textRecordList);
+  FrequencyEncode<WordIndex> wordIndexEncoder(wordIndexList);
+  FrequencyEncode<Score> scoreEncoder(scoreList);


Do these really need a vector or can we make them work with the lazy views directly (I will see once I get there).

src/index/TextIndexReadWrite.cpp

joka921 · 2025-01-08T19:57:54Z

src/index/TextIndexReadWrite.cpp

+                               off_t& currentOffset) {
+  TextIndexReadWrite::writeVectorAndMoveOffset(encodedVector_, nofElements, out,
+                                               currentOffset);
+}


Can you iin this file point out to me the places (via comments) where you have changed anything except for just copying and extracting it here?

The reason is, that a lot of code requires modernization here, but I would prefer to first quickly do the extraction, and then modernize in a separate step.

See comments of 09d4a97

joka921 · 2025-01-08T20:00:15Z

src/index/TextIndexReadWrite.h

+  explicit GapEncode(const TypedVector& vectorToEncode);
+
+  void writeToFile(ad_utility::File& out, size_t nofElements,
+                   off_t& currentOffset);


I think you can get away with a lazy view as the input to the constructor.

Would the code in IndexImpl.cpp become simpler if you make a static function that does the encoding + writing in one step (same for the other encoders).

(Also maybe part for a separate PR, your work here is valuable, by moving it to a separate file it now has a size where we can see the possible improvements much simpler.

src/index/TextIndexReadWrite.h

…eview and should be removed later

…se namespace)

…th using vector.data()

sparql-conformance · 2025-01-15T14:24:28Z

Conformance check passed ✅

No test result changes.

Details: https://qlever.cs.uni-freiburg.de/sparql-conformance-ui?cur=a64e848577b9c4cd1a3fae414f3f54fc2f3dbcd8&prev=acb6633debc7341985341aff147b5038cc8d951b

sonarqubecloud · 2025-01-15T15:21:40Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Flixtastic · 2025-01-15T21:29:08Z

The 2 SonarQube issues can't be adressed with the current implementation of ranges if I am correct.

joka921

Small reviews from the diff, I'll have another pass over everything with your // MODIFIED comments.

joka921 · 2025-01-16T11:20:44Z

src/index/TextIndexReadWrite.cpp

+size_t writeList(const std::vector<Numeric> data, size_t nofElements,
+                 ad_utility::File& file) {


Argument should be const vector& (doesn't sonar also say so?)

Is nofElements the same as data.size() , in that case this redundant argument should be removed.

As a replacement of const std::vector<...>& you can always use (as the argument type) std::span<const Numeric> here
(it is more generic and more modern).

but 1 and 2 are more important than 3.

joka921 · 2025-01-16T11:24:11Z

src/index/TextIndexReadWrite.cpp

+  ql::ranges::transform(frequencyMap.begin(), frequencyMap.end(),
+                        std::back_inserter(frequencyVector),
+                        [](const auto& kv) { return kv; });
+  ql::ranges::sort(
+      frequencyVector.begin(), frequencyVector.end(),
+      [](const auto& a, const auto& b) { return a.second > b.second; });


Suggested change

ql::ranges::transform(frequencyMap.begin(), frequencyMap.end(),

std::back_inserter(frequencyVector),

[](const auto& kv) { return kv; });

ql::ranges::sort(

frequencyVector.begin(), frequencyVector.end(),

[](const auto& a, const auto& b) { return a.second > b.second; });

ql::ranges::transform(frequencyMap,

std::back_inserter(frequencyVector),

[](const auto& kv) { return kv; });

ql::ranges::sort(

frequencyVector,

[](const auto& a, const auto& b) { return a.second > b.second; });

That's one of the points of ql::ranges.

And can't the first transorm with the identity function just simpler be

ql::ranges::copy(frequeycMap, back_inserter(vector));

joka921 · 2025-01-16T11:25:44Z

src/index/TextIndexReadWrite.h

-  const TypedVector getEncodedVector() { return encodedVector_; }
-  const CodeMap& getCodeMap() { return codeMap_; }
-  const CodeBook& getCodeBook() { return codeBook_; }
+  TypedVector getEncodedVector() { return encodedVector_; }


Suggested change

TypedVector getEncodedVector() { return encodedVector_; }

const TypedVector& getEncodedVector() const { return encodedVector_; }

joka921

Some additional small comments.

joka921 · 2025-01-16T11:39:08Z

src/index/TextIndexReadWrite.h

+    const std::function<To(From)>& transformer = [](From x) {
+      return static_cast<To>(x);


For now,
please make the transformation a template parameter (also for the gap encoding reading below), doesn't sonarcloud complain here?

template <typename To, typename From, typename Transformer = decltype(ad_utility::staticCast<To>) vector<To> readFreqList(..., Transformer transformer = {}) {...}

joka921 · 2025-01-16T11:40:04Z

src/index/TextIndexReadWrite.h

+  size_t nofCodebookBytes;
+  vector<uint64_t> frequencyEncodedResult;
+  frequencyEncodedResult.resize(nofElements + 250);
+  off_t current = from;
+  size_t ret = textIndexFile.read(&nofCodebookBytes, sizeof(size_t), current);
+  LOG(TRACE) << "Nof Codebook Bytes: " << nofCodebookBytes << '\n';
+  AD_CONTRACT_CHECK(sizeof(size_t) == ret);
+  current += ret;
+  std::vector<From> codebook;
+  codebook.resize(nofCodebookBytes / sizeof(From));
+  ret = textIndexFile.read(codebook.data(), nofCodebookBytes, current);
+  current += ret;
+  AD_CONTRACT_CHECK(ret == size_t(nofCodebookBytes));
+  std::vector<uint64_t> simple8bEncoded;
+  simple8bEncoded.resize(nofElements);
+  ret = textIndexFile.read(simple8bEncoded.data(), nofBytes - (current - from),
+                           current);
+  current += ret;
+  AD_CONTRACT_CHECK(size_t(current - from) == nofBytes);
+  LOG(DEBUG) << "Decoding Simple8b code...\n";
+  ad_utility::Simple8bCode::decode(simple8bEncoded.data(), nofElements,
+                                   frequencyEncodedResult.data());
+  LOG(DEBUG) << "Reverting frequency encoded items to actual IDs...\n";
+  frequencyEncodedResult.resize(nofElements);
+  vector<To> result;
+  result.reserve(frequencyEncodedResult.size());
+  ql::ranges::for_each(frequencyEncodedResult, [&](const auto& encoded) {
+    result.push_back(transformer(codebook.at(encoded)));
+  });
+  LOG(DEBUG) << "Done reading frequency-encoded list. Size: " << result.size()


In one of your next PRs you can refactor the reading and writing here, using the functionalities in
util/serialization , because all those manual read and offset += ret calls are very hard to read.

joka921 · 2025-01-16T11:44:46Z

src/index/TextIndexReadWrite.h

+  From previous = 0;
+  for (size_t i = 0; i < gapEncodedVector.size(); ++i) {
+    previous += gapEncodedVector[i];
+    result.push_back(transformer(previous));
+  }
+  LOG(DEBUG) << "Done reading gap-encoded list. Size: " << result.size()


Does it (same for the frequency encoding)
happen, that To and From are the same type, (and the transformation therefore does nothing).
In this case you could simply return the read vector, without the (then redundant) copy.

joka921 · 2025-01-16T11:50:21Z

src/index/TextIndexReadWrite.cpp

+  std::vector<uint64_t> textRecordList(firstElements.begin(),
+                                       firstElements.end());
+  std::vector<WordIndex> wordIndexList(secondElements.begin(),
+                                       secondElements.end());
+  std::vector<Score> scoreList(thirdElements.begin(), thirdElements.end());
+
+  GapEncode textRecordEncoder(textRecordList);
+  FrequencyEncode wordIndexEncoder(wordIndexList);
+  FrequencyEncode scoreEncoder(scoreList);


Refactor the FrequencyEncode and GapEncode s.t. the constructor argument becomes a template, then you can pass in the lazy views firstElements... etc. directly, without copying them to a vector first
(more efficient AND less code).

joka921 · 2025-01-16T11:56:53Z

src/index/TextIndexReadWrite.cpp

+template <typename T>
+void writeVectorAndMoveOffset(const std::vector<T>& vectorToWrite,
+                              size_t nofElements, ad_utility::File& file,
+                              off_t& currentOffset) {
+  size_t bytes =
+      textIndexReadWrite::writeList(vectorToWrite, nofElements, file);
+  currentOffset += bytes;
+}
+


In a separate PR, you can refactor this to use our serialization library, which makes the reading and writing of such types to and from disk much more readable.

Flixtastic and others added 3 commits January 3, 2025 00:52

First try at outsourcing the writing of the text index. Doesn't work …

912ebce

…yet, commit is used to initialize branch

Solved issues with compression and also outsourced the reading of fre…

2d6f3ad

…quency and gap compressed lists.

Merge branch 'ad-freiburg:master' into text-index-compression

0b2b516

Flixtastic and others added 6 commits January 6, 2025 13:14

Merge branch 'ad-freiburg:master' into text-index-compression

ff19077

Renaming of TextIndexWriteRead to TextIndexReadWrite

75cf7d1

Improved the readFreqComprList to now be more logical and usable

5b29ca5

Clean up code

57d0ef7

Adjusted the readGapComprList to follow the usage of readFreqComprList

66b60b6

Merge branch 'ad-freiburg:master' into text-index-compression

e07488e

joka921 reviewed Jan 8, 2025

View reviewed changes

Flixtastic and others added 3 commits January 9, 2025 12:44

Merge branch 'ad-freiburg:master' into text-index-compression

fdb6932

Added comments to TextIndexReadWrite.h. Those comments are only for r…

09d4a97

…eview and should be removed later

Implemented the small changes requested. (Using ql::views and lowerca…

777329a

…se namespace)

Flixtastic requested a review from joka921 January 9, 2025 19:38

Addressed SonarQube issues and removed the usage of new and delete wi…

a64e848

…th using vector.data()

joka921 reviewed Jan 16, 2025

View reviewed changes

joka921 requested changes Jan 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Outsourcing and improving the posting writing of IndexImpl.Text.cpp #1699

Outsourcing and improving the posting writing of IndexImpl.Text.cpp #1699

Flixtastic commented Jan 4, 2025

codecov bot commented Jan 4, 2025 •

edited

Loading

joka921 left a comment

joka921 Jan 8, 2025

joka921 Jan 8, 2025

Flixtastic Jan 9, 2025

joka921 Jan 8, 2025

joka921 Jan 8, 2025

Flixtastic Jan 9, 2025

joka921 Jan 8, 2025

joka921 Jan 8, 2025

sparql-conformance bot commented Jan 15, 2025

sonarqubecloud bot commented Jan 15, 2025

Flixtastic commented Jan 15, 2025

joka921 left a comment

joka921 Jan 16, 2025

joka921 Jan 16, 2025

joka921 Jan 16, 2025

joka921 left a comment

joka921 Jan 16, 2025

joka921 Jan 16, 2025

joka921 Jan 16, 2025

joka921 Jan 16, 2025

joka921 Jan 16, 2025

		size_t writeList(const std::vector<Numeric> data, size_t nofElements,
		ad_utility::File& file) {

	TypedVector getEncodedVector() { return encodedVector_; }
	const TypedVector& getEncodedVector() const { return encodedVector_; }

		const std::function<To(From)>& transformer = [](From x) {
		return static_cast<To>(x);

Outsourcing and improving the posting writing of IndexImpl.Text.cpp #1699

Are you sure you want to change the base?

Outsourcing and improving the posting writing of IndexImpl.Text.cpp #1699

Conversation

Flixtastic commented Jan 4, 2025

codecov bot commented Jan 4, 2025 • edited Loading

Codecov Report

joka921 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sparql-conformance bot commented Jan 15, 2025

Conformance check passed ✅

sonarqubecloud bot commented Jan 15, 2025

Quality Gate passed

Flixtastic commented Jan 15, 2025

joka921 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joka921 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jan 4, 2025 •

edited

Loading