Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add map type support #13906

Merged
merged 1 commit into from
Oct 2, 2024
Merged

Add map type support #13906

merged 1 commit into from
Oct 2, 2024

Conversation

xiangfu0
Copy link
Contributor

@xiangfu0 xiangfu0 commented Aug 29, 2024

  • Support Map type which can be defined with a String key to a typed value.
  • Support MapItem function intMap['key'] to extract the map value from a given key.
  • Default implementation is based on forward index and use MapUtils to ser/de the Map to and from bytes.

Sample schema:

{
  "schemaName" : "testMapFields",
  "enableColumnBasedNullHandling" : false,
  "dateTimeFieldSpecs" : [ {
    "name" : "ts",
    "dataType" : "TIMESTAMP",
    "fieldType" : "DATE_TIME",
    "notNull" : false,
    "format" : "TIMESTAMP",
    "granularity" : "1:DAYS"
  } ],
  "complexFieldSpecs" : [ {
    "name" : "stringMap",
    "dataType" : "MAP",
    "fieldType" : "COMPLEX",
    "notNull" : false,
    "childFieldSpecs" : {
      "key" : {
        "name" : "key",
        "dataType" : "STRING",
        "fieldType" : "DIMENSION",
        "notNull" : false
      },
      "value" : {
        "name" : "value",
        "dataType" : "STRING",
        "fieldType" : "DIMENSION",
        "notNull" : false
      }
    }
  }, {
    "name" : "intMap",
    "dataType" : "MAP",
    "fieldType" : "COMPLEX",
    "notNull" : false,
    "childFieldSpecs" : {
      "key" : {
        "name" : "key",
        "dataType" : "STRING",
        "fieldType" : "DIMENSION",
        "notNull" : false
      },
      "value" : {
        "name" : "value",
        "dataType" : "INT",
        "fieldType" : "DIMENSION",
        "notNull" : false
      }
    }
  } ]
}

Sample query 1: Selection only

SELECT * FROM myMapTable ORDER BY ts

Sample response:

{"dataSchema":{"columnNames":["intMap","stringMap","ts"],"columnDataTypes":["MAP","MAP","TIMESTAMP"]},"rows":[[{},{},"2024-10-02 14:04:25.009"],[{"k0":1},{"k0":"1"},"2024-10-02 14:04:25.01"],[{"k0":2,"k1":2},{"k0":"2","k1":"2"},"2024-10-02 14:04:25.011"],[{"k0":3,"k1":3,"k2":3},{"k0":"3","k1":"3","k2":"3"},"2024-10-02 14:04:25.012"],[{"k3":4,"k0":4,"k1":4,"k2":4},{"k3":"4","k0":"4","k1":"4","k2":"4"},"2024-10-02 14:04:25.013"],[{"k3":5,"k4":5,"k0":5,"k1":5,"k2":5},{"k3":"5","k4":"5","k0":"5","k1":"5","k2":"5"},"2024-10-02 14:04:25.014"],[{"k3":6,"k4":6,"k5":6,"k0":6,"k1":6,"k2":6},{"k3":"6","k4":"6","k5":"6","k0":"6","k1":"6","k2":"6"},"2024-10-02 14:04:25.015"],[{"k0":7,"k1":7,"k2":7,"k3":7,"k4":7,"k5":7,"k6":7},{"k0":"7","k1":"7","k2":"7","k3":"7","k4":"7","k5":"7","k6":"7"},"2024-10-02 14:04:25.016"],[{"k0":8,"k1":8,"k2":8,"k3":8,"k4":8,"k5":8,"k6":8,"k7":8},{"k0":"8","k1":"8","k2":"8","k3":"8","k4":"8","k5":"8","k6":"8","k7":"8"},"2024-10-02 14:04:25.017"],[{"k0":9,"k1":9,"k2":9,"k3":9,"k4":9,"k5":9,"k6":9,"k7":9,"k8":9},{"k0":"9","k1":"9","k2":"9","k3":"9","k4":"9","k5":"9","k6":"9","k7":"9","k8":"9"},"2024-10-02 14:04:25.018"]]}

Sample query 2: Selection order-by

SELECT stringMap['k0'], intMap['k0'] FROM myMapTable ORDER BY ts

Sample response:

{"dataSchema":{"columnNames":["item(stringMap,'k0')","item(intMap,'k0')"],"columnDataTypes":["STRING","INT"]},"rows":[["null",-2147483648],["1",1],["2",2],["3",3],["4",4],["5",5],["6",6],["7",7],["8",8],["9",9]]}

Sample query 3: Aggregation only

SELECT MAX(intMap['k0']), MAX(intMap['k1']) FROM myMapTable

Sample response:

{"dataSchema":{"columnNames":["max(item(intMap,'k0'))","max(item(intMap,'k1'))"],"columnDataTypes":["DOUBLE","DOUBLE"]},"rows":[[999.0,999.0]]}

Sample query 4: Aggregation group-by

SELECT stringMap['k0'] AS key, MIN(intMap['k0']) AS value FROM myMapTable GROUP BY key ORDER BY value

Sample response:

{"dataSchema":{"columnNames":["key","value"],"columnDataTypes":["STRING","DOUBLE"]},"rows":[["null",-2.147483648E9],["1",1.0],["2",2.0],["3",3.0],["4",4.0],["5",5.0],["6",6.0],["7",7.0],["8",8.0],["9",9.0]]}

Sample query 5: Filter

SELECT stringMap['k2'] FROM myMapTable WHERE stringMap['k1']  = '25'

Sample response:

{"dataSchema":{"columnNames":["item(stringMap,'k2')"],"columnDataTypes":["STRING"]},"rows":[["25"]]}

Sample query 6: Filter on non-existing key

SELECT stringMap['k2'] FROM myMapTable WHERE stringMap['kk']  = '25'

Sample response:

{"dataSchema":{"columnNames":["item(stringMap,'k2')"],"columnDataTypes":["STRING"]},"rows":[]}

Sample query 7: Select non-existing key

SELECT stringMap['kkk'], intMap['kkk'] FROM myMapTable

Sample response:

{"dataSchema":{"columnNames":["item(stringMap,'kkk')","item(intMap,'kkk')"],"columnDataTypes":["STRING","INT"]},"rows":[["null",-2147483648],["null",-2147483648],["null",-2147483648],["null",-2147483648],["null",-2147483648],["null",-2147483648],["null",-2147483648],["null",-2147483648],["null",-2147483648],["null",-2147483648]]}

@xiangfu0 xiangfu0 force-pushed the feature-map-type-new branch from e1e08ce to e211795 Compare August 29, 2024 07:51
@codecov-commenter
Copy link

codecov-commenter commented Aug 29, 2024

Codecov Report

Attention: Patch coverage is 20.64777% with 588 lines in your changes missing coverage. Please review.

Project coverage is 63.86%. Comparing base (59551e4) to head (b59e4d8).
Report is 1110 commits behind head on master.

Files with missing lines Patch % Lines
...or/impl/stats/MapColumnPreIndexStatsCollector.java 0.00% 104 Missing ⚠️
...egment/local/segment/index/map/NullDataSource.java 0.00% 43 Missing ⚠️
...ent/index/readers/map/ImmutableMapIndexReader.java 0.00% 43 Missing ⚠️
.../local/segment/index/map/MutableMapDataSource.java 0.00% 39 Missing ⚠️
...ocal/segment/index/map/ImmutableMapDataSource.java 0.00% 37 Missing ⚠️
...ator/transform/function/ItemTransformFunction.java 0.00% 34 Missing ⚠️
...local/segment/index/map/MapIndexReaderWrapper.java 0.00% 29 Missing ⚠️
.../segment/local/segment/index/map/MapIndexType.java 44.00% 27 Missing and 1 partial ⚠️
...ent/local/segment/index/map/BaseMapDataSource.java 0.00% 27 Missing ⚠️
...ent/local/segment/index/map/MapKeyIndexReader.java 0.00% 23 Missing ⚠️
... and 39 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #13906      +/-   ##
============================================
+ Coverage     61.75%   63.86%   +2.11%     
- Complexity      207     1537    +1330     
============================================
  Files          2436     2618     +182     
  Lines        133233   144230   +10997     
  Branches      20636    22061    +1425     
============================================
+ Hits          82274    92115    +9841     
- Misses        44911    45323     +412     
- Partials       6048     6792     +744     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 63.84% <20.64%> (+2.13%) ⬆️
java-21 63.73% <20.64%> (+2.10%) ⬆️
skip-bytebuffers-false 63.86% <20.64%> (+2.11%) ⬆️
skip-bytebuffers-true 63.70% <20.64%> (+35.97%) ⬆️
temurin 63.86% <20.64%> (+2.11%) ⬆️
unittests 63.86% <20.64%> (+2.11%) ⬆️
unittests1 55.55% <20.10%> (+8.66%) ⬆️
unittests2 34.39% <7.69%> (+6.65%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@xiangfu0 xiangfu0 force-pushed the feature-map-type-new branch 7 times, most recently from 8c43e8d to 1290317 Compare September 1, 2024 04:16
@xiangfu0 xiangfu0 force-pushed the feature-map-type-new branch 2 times, most recently from ca8599d to ecb6208 Compare September 7, 2024 18:36
int size = positionOffsetInVariableBufferAndGetLength(rowId, colId);
ByteBuffer buffer = _variableSizeData.slice();
buffer.limit(size);
return MapUtils.deserializeMap(buffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How expensive will be deserializeMap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be expensive for general map conversion and it's ok to be slow for select the whole map column.
The major purpose of introducing MAP type is to support more indexing for filtering and aggregation.

@Override
public Object convert(Object value, PinotDataType sourceType) {
switch (sourceType) {
case OBJECT:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we adding Object Type as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Source type may not be recognized as MAP, it could be OBJECT as well.

@Override
public BlockValSet getBlockValueSet(String[] paths) {
// TODO: only support one level of path for now, e.g. `map.key`
assert paths.length == 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to assert or throw exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens at query runtime, so it will only fail the query execution

* Evaluates myMap['foo']
*/
public class ItemTransformFunction extends BaseTransformFunction {
public static final String FUNCTION_NAME = "item";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Name item?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (numRows == 0) {
return values;
}
if (storedType == DataType.UNKNOWN) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Unknown type allowed during write? Should this be an exception?

if (_fieldSpec.isSingleValueField()) {
_maxNumValuesPerMVEntry = -1;
} else {
_maxNumValuesPerMVEntry = columnMetadata.getMaxNumberOfMultiValues();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MultiValue MAP Column is not supported. What does this mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAP is considered as a single value column.

}

@Override
public ColumnMetadata getKeyMetadata(String key) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of these methods return any metadata value. Is that expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are not implemented fully, so ignore.

}

@Override
public boolean isSingleValue() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

* @param context Reader context
* @return MAP type single-value at the given document id
*/
default Map<String, Object> getMap(int docId, T context) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What indexes are supported on MAP type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not yet implemented


// Besides the value bytes, we store: size, length for each key, length for each value
long bufferSize = (1 + 2 * (long) size) * Integer.BYTES;
byte[][] keyBytesArray = new byte[size][];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for very large map will these become performance bottlenecks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes and expected.

@xiangfu0 xiangfu0 force-pushed the feature-map-type-new branch 4 times, most recently from d13b0f5 to 0bd5fa8 Compare October 1, 2024 02:34
@xiangfu0 xiangfu0 force-pushed the feature-map-type-new branch from 0bd5fa8 to b59e4d8 Compare October 1, 2024 03:39
@Jackie-Jiang Jackie-Jiang added feature release-notes Referenced by PRs that need attention when compiling the next release notes labels Oct 1, 2024
@xiangfu0 xiangfu0 merged commit ea103c1 into apache:master Oct 2, 2024
21 checks passed
@xiangfu0 xiangfu0 deleted the feature-map-type-new branch October 2, 2024 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature release-notes Referenced by PRs that need attention when compiling the next release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants