Skip to content

Conversation

gortiz
Copy link
Contributor

@gortiz gortiz commented Jun 3, 2024

This PR includes several changes in the code that builds, serializes and deserializes DataBlocks in order to improve the performace.

Changes here should not change the binary format (test included verify that). Instead I've changed how the code to reduce allocation and copies. I'm sure more can be done to improve the performance without breaking the binary format and even more could be done if we decide to break the format.

The PR includes 4 benchmarks: One that given a List<Object[]> creates a datablock, one that serializes that datablock, one that deserialize it and one that does the three things in a row.
The old version of this benchmark is in dbeeaaf.

The results of the test that does creates, serialize and deserialize is:

Benchmark Param: _dataType Param: _rows Param: _blockType Param: _version Score: ops/ms Score: MB/op
all BIG_DECIMAL 10000 COLUMNAR old 6.84 2.51 MB
all BIG_DECIMAL 10000 COLUMNAR new 10.13 1.13 MB
all BIG_DECIMAL 10000 ROW old 6.25 2.50 MB
all BIG_DECIMAL 10000 ROW new 9.26 1.13 MB
all BIG_DECIMAL 1000000 COLUMNAR old 0.04 257.40 MB
all BIG_DECIMAL 1000000 COLUMNAR new 0.09 111.71 MB
all BIG_DECIMAL 1000000 ROW old 0.04 257.39 MB
all BIG_DECIMAL 1000000 ROW new 0.08 111.71 MB
all BYTES 10000 COLUMNAR old 3.54 5.23 MB
all BYTES 10000 COLUMNAR new 17.34 1.09 MB
all BYTES 10000 ROW old 3.54 5.22 MB
all BYTES 10000 ROW new 15.77 1.09 MB
all BYTES 1000000 COLUMNAR old 0.02 551.55 MB
all BYTES 1000000 COLUMNAR new 0.11 108.16 MB
all BYTES 1000000 ROW old 0.02 551.54 MB
all BYTES 1000000 ROW new 0.10 108.16 MB
all INT 10000 COLUMNAR old 65.15 0.28 MB
all INT 10000 COLUMNAR new 177.29 0.07 MB
all INT 10000 ROW old 25.27 0.31 MB
all INT 10000 ROW new 64.06 0.07 MB
all INT 1000000 COLUMNAR old 0.34 33.36 MB
all INT 1000000 COLUMNAR new 0.85 4.84 MB
all INT 1000000 ROW old 0.21 37.35 MB
all INT 1000000 ROW new 0.49 4.84 MB
all LONG 10000 COLUMNAR old 37.42 0.52 MB
all LONG 10000 COLUMNAR new 146.98 0.11 MB
all LONG 10000 ROW old 23.11 0.51 MB
all LONG 10000 ROW new 65.48 0.11 MB
all LONG 1000000 COLUMNAR old 0.17 65.36 MB
all LONG 1000000 COLUMNAR new 0.64 8.84 MB
all LONG 1000000 ROW old 0.15 65.35 MB
all LONG 1000000 ROW new 0.43 8.84 MB
all LONG_ARRAY 10000 COLUMNAR old 3.68 4.51 MB
all LONG_ARRAY 10000 COLUMNAR new 8.73 0.90 MB
all LONG_ARRAY 10000 ROW old 3.59 4.50 MB
all LONG_ARRAY 10000 ROW new 10.21 0.90 MB
all LONG_ARRAY 1000000 COLUMNAR old 0.02 479.52 MB
all LONG_ARRAY 1000000 COLUMNAR new 0.09 88.62 MB
all LONG_ARRAY 1000000 ROW old 0.02 479.51 MB
all LONG_ARRAY 1000000 ROW new 0.09 88.62 MB
all STRING 10000 COLUMNAR old 24.28 0.47 MB
all STRING 10000 COLUMNAR new 30.64 0.08 MB
all STRING 10000 ROW old 16.61 0.34 MB
all STRING 10000 ROW new 23.92 0.08 MB
all STRING 1000000 COLUMNAR old 0.18 49.38 MB
all STRING 1000000 COLUMNAR new 0.26 4.86 MB
all STRING 1000000 ROW old 0.14 37.37 MB
all STRING 1000000 ROW new 0.21 20.86 MB
all STRING_ARRAY 10000 COLUMNAR old 1.94 4.02 MB
all STRING_ARRAY 10000 COLUMNAR new 2.92 1.96 MB
all STRING_ARRAY 10000 ROW old 1.84 4.01 MB
all STRING_ARRAY 10000 ROW new 2.86 1.96 MB
all STRING_ARRAY 1000000 COLUMNAR old 0.01 412.55 MB
all STRING_ARRAY 1000000 COLUMNAR new 0.02 192.53 MB
all STRING_ARRAY 1000000 ROW old 0.01 412.54 MB
all STRING_ARRAY 1000000 ROW new 0.02 192.53 MB

As you can see, throughput is between 1x to 3x, but the difference in allocation is even better. This is specially important because while this benchmark is executed on a machine with enough memory, a large allocation rate can be problematic in production because a single query allocating too much may heavily affect latency on other queries due to GC (even using non-blocking GCs!).

One of the key techniques to achieve the performance gain was to use special type of buffers and streams. In order to make this PR smaller, I created #13304 that only contains the buffer and stream changes. We can merge that PR before merging this one.

TODO:

  • Clean up the code
  • Create test that verify binary format is not broken

@gortiz gortiz mentioned this pull request Jun 3, 2024
3 tasks
@codecov-commenter
Copy link

codecov-commenter commented Jun 3, 2024

Codecov Report

Attention: Patch coverage is 53.10263% with 393 lines in your changes missing coverage. Please review.

Project coverage is 57.84%. Comparing base (59551e4) to head (bd6d9e5).
Report is 1119 commits behind head on master.

Files with missing lines Patch % Lines
...apache/pinot/common/datablock/DataBlockEquals.java 6.93% 157 Missing and 4 partials ⚠️
...g/apache/pinot/common/datablock/BaseDataBlock.java 7.86% 81 Missing and 1 partial ⚠️
...pinot/common/datablock/ZeroCopyDataBlockSerde.java 62.55% 66 Missing and 16 partials ⚠️
.../apache/pinot/common/datablock/DataBlockUtils.java 61.70% 15 Missing and 3 partials ⚠️
...java/org/apache/pinot/common/utils/DataSchema.java 0.00% 14 Missing ⚠️
...e/pinot/segment/spi/memory/CompoundDataBuffer.java 30.76% 9 Missing ⚠️
.../pinot/core/common/datablock/DataBlockBuilder.java 96.38% 3 Missing and 5 partials ⚠️
...rg/apache/pinot/common/datablock/RowDataBlock.java 0.00% 4 Missing ⚠️
...ache/pinot/common/datablock/ColumnarDataBlock.java 0.00% 3 Missing ⚠️
.../apache/pinot/common/datablock/DataBlockSerde.java 70.00% 2 Missing and 1 partial ⚠️
... and 4 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #13303      +/-   ##
============================================
- Coverage     61.75%   57.84%   -3.91%     
- Complexity      207      219      +12     
============================================
  Files          2436     2615     +179     
  Lines        133233   143536   +10303     
  Branches      20636    22053    +1417     
============================================
+ Hits          82274    83028     +754     
- Misses        44911    54020    +9109     
- Partials       6048     6488     +440     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 57.82% <53.10%> (-3.89%) ⬇️
java-21 57.73% <53.10%> (-3.89%) ⬇️
skip-bytebuffers-false 57.84% <53.10%> (-3.91%) ⬇️
skip-bytebuffers-true 57.69% <53.10%> (+29.96%) ⬆️
temurin 57.84% <53.10%> (-3.91%) ⬇️
unittests 57.84% <53.10%> (-3.91%) ⬇️
unittests1 40.75% <53.10%> (-6.14%) ⬇️
unittests2 27.86% <0.00%> (+0.13%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gortiz gortiz force-pushed the multi-stage-serde branch 3 times, most recently from 270f679 to 2152e36 Compare June 4, 2024 08:31
Comment on lines -32 to -41
public void testSerdeCorrectness(BaseDataBlock dataBlock)
throws IOException {
byte[] bytes = dataBlock.toBytes();
ByteBuffer byteBuffer = ByteBuffer.wrap(bytes);
int versionType = DataBlockUtils.readVersionType(byteBuffer);
BaseDataBlock deserialize = deserialize(byteBuffer, versionType);

assertEquals(byteBuffer.position(), bytes.length, "Buffer position should be at the end of the buffer");
assertEquals(deserialize, dataBlock, "Deserialized data block should be the same as the original data block");
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serde correctness is implemented now in the DataBlockSerde tests

Comment on lines -190 to +154
rowBuilder._fixedSizeDataByteArrayOutputStream.write(byteBuffer.array(), 0, byteBuffer.position());
}

CompoundDataBuffer.Builder varBufferBuilder = new CompoundDataBuffer.Builder(ByteOrder.BIG_ENDIAN, true)
.addPagedOutputStream(varSize);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the biggest improvement in performance when creating the block. The older version allocated a large array (which is expensive as it may be outside the TLAB) and then copy that into the ArrayOutputStream, which probably ends up allocating that amount of bytes again.

Now instead we just reuse the whole byte buffer, adding it into the builder, which is basically a list of byte buffers that can be later be used to send the data through the network or directly read the info on another local stage.

}
columnarBuilder._fixedSizeDataByteArrayOutputStream.write(byteBuffer.array(), 0, byteBuffer.position());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here again we were copying bytes. In this case the allocation is smaller, but problematic anyway.

// @Param(value = {"0", "10", "90"})
int _nullPerCent = 10;

// @Param(value = {"direct_small", "heap_small"})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be used to study the performance on different allocators. heal_small is the faster by a large margin.

@gortiz gortiz marked this pull request as ready for review June 10, 2024 12:10
@gortiz gortiz force-pushed the multi-stage-serde branch from 5033ea4 to 5332e3f Compare June 14, 2024 10:29
@gortiz gortiz requested a review from Jackie-Jiang June 24, 2024 11:01
@gortiz
Copy link
Contributor Author

gortiz commented Aug 28, 2024

I think I've applied most if not all suggestions. Please take a look to this PR again, as it would be great to be able to merge it soon.

Copy link
Contributor

@yashmayya yashmayya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this improvement @gortiz, the numbers look super impressive! I've left some minor comments and questions (many of which are simply to improve my understanding of these areas in Pinot 😄).

Comment on lines 92 to 94
} catch (AssertionError e) {
throw new AssertionError(
"Error comparing Row/Column Block at (" + rowId + "," + colId + ") of Type: " + columnDataType + "!", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this catch block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I don't remember. Probably to catch errors produced by assertions in DataBlockTestUtils.getElement. I can change the code to make it clearer.

}

@Test(dataProvider = "blocks")
void testSerde(String desc, DataBlock block) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also add tests for RowDataBlock / ColumnarDataBlock and if possible, including some edge cases like null dictionary, null data schema etc.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great, but it is not easy to create actual row and column blocks here given DataBlockBuilder is in pinot-core. I remember I tried to move it to pinot-common but I guess I didn't commit the change because there was some issue (maybe some other dependency in pinot-core)

Instead we are testing these serde properties in DataBlockSerdeTest. That is not great, but it is easier to implement right now.

including some edge cases like null dictionary, null data schema etc.?

I don't think these are valid cases for row/column blocks

@gortiz gortiz force-pushed the multi-stage-serde branch from ef88e9d to 83b6790 Compare August 30, 2024 11:46

static {
SERDES = new EnumMap<>(DataBlockSerde.Version.class);
SERDES.put(DataBlockSerde.Version.V1_V2, new ZeroCopyDataBlockSerde());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

V1_V2 is hard-coded here and in ZeroCopyDataBlockSerde.getVersion. Change to SERDES.put(ZeroCopyDataBlockSerde.VERSION, new ZeroCopyDataBlockSerde()) ?

Also do you plan to make this configurable in the future or punt on it until the next version is added ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case we end up having more serdes we will probably add more entries to the map and/or open the ability to change them using configurations.

For example, we could end up implementing a better implementation of V1_V2. Then we could have both Serdes and change from one to the other using configuration. In case we create new protocol versions (like a V3) we would be able to add it here, although we would need a way to decide which version should be used (as explained in some doc, probably just setting the version used as a config meanwhile we have a heterogeneous cluster).

@@ -74,22 +77,26 @@ public static void main(String[] args)
.addProfiler(GCProfiler.class));
}

@Param(value = {"INT", "LONG", "STRING", "BYTES", "BIG_DECIMAL", "LONG_ARRAY", "STRING_ARRAY"})
// @Param(value = {"INT", "LONG", "STRING", "BYTES", "BIG_DECIMAL", "LONG_ARRAY", "STRING_ARRAY"})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should be removed or re-instated ?

Why are STRING_ARRAY and BIGDECIMAL removed ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had the same discussion below:

I think it is useful to have them there to have easy access to other common configurations. The current configuration (always use 10k rows and only {"INT", "LONG", "STRING", "BYTES", "LONG_ARRAY"} types is useful, but keeping the larger list of params and the other rows may be also useful in more extensive tests.

@gortiz gortiz merged commit e5df02c into apache:master Sep 18, 2024
21 checks passed
@gortiz gortiz deleted the multi-stage-serde branch September 18, 2024 10:08
@abhioncbr abhioncbr added the release-notes Referenced by PRs that need attention when compiling the next release notes label Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement multi-stage Related to the multi-stage query engine performance release-notes Referenced by PRs that need attention when compiling the next release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants