Improve performance of DataBlock serde #13303

gortiz · 2024-06-03T14:34:45Z

This PR includes several changes in the code that builds, serializes and deserializes DataBlocks in order to improve the performace.

Changes here should not change the binary format (test included verify that). Instead I've changed how the code to reduce allocation and copies. I'm sure more can be done to improve the performance without breaking the binary format and even more could be done if we decide to break the format.

The PR includes 4 benchmarks: One that given a List<Object[]> creates a datablock, one that serializes that datablock, one that deserialize it and one that does the three things in a row.
The old version of this benchmark is in dbeeaaf.

The results of the test that does creates, serialize and deserialize is:

Benchmark	Param: _dataType	Param: _rows	Param: _blockType	Param: _version	Score: ops/ms	Score: MB/op
all	BIG_DECIMAL	10000	COLUMNAR	old	6.84	2.51 MB
all	BIG_DECIMAL	10000	COLUMNAR	new	10.13	1.13 MB
all	BIG_DECIMAL	10000	ROW	old	6.25	2.50 MB
all	BIG_DECIMAL	10000	ROW	new	9.26	1.13 MB
all	BIG_DECIMAL	1000000	COLUMNAR	old	0.04	257.40 MB
all	BIG_DECIMAL	1000000	COLUMNAR	new	0.09	111.71 MB
all	BIG_DECIMAL	1000000	ROW	old	0.04	257.39 MB
all	BIG_DECIMAL	1000000	ROW	new	0.08	111.71 MB
all	BYTES	10000	COLUMNAR	old	3.54	5.23 MB
all	BYTES	10000	COLUMNAR	new	17.34	1.09 MB
all	BYTES	10000	ROW	old	3.54	5.22 MB
all	BYTES	10000	ROW	new	15.77	1.09 MB
all	BYTES	1000000	COLUMNAR	old	0.02	551.55 MB
all	BYTES	1000000	COLUMNAR	new	0.11	108.16 MB
all	BYTES	1000000	ROW	old	0.02	551.54 MB
all	BYTES	1000000	ROW	new	0.10	108.16 MB
all	INT	10000	COLUMNAR	old	65.15	0.28 MB
all	INT	10000	COLUMNAR	new	177.29	0.07 MB
all	INT	10000	ROW	old	25.27	0.31 MB
all	INT	10000	ROW	new	64.06	0.07 MB
all	INT	1000000	COLUMNAR	old	0.34	33.36 MB
all	INT	1000000	COLUMNAR	new	0.85	4.84 MB
all	INT	1000000	ROW	old	0.21	37.35 MB
all	INT	1000000	ROW	new	0.49	4.84 MB
all	LONG	10000	COLUMNAR	old	37.42	0.52 MB
all	LONG	10000	COLUMNAR	new	146.98	0.11 MB
all	LONG	10000	ROW	old	23.11	0.51 MB
all	LONG	10000	ROW	new	65.48	0.11 MB
all	LONG	1000000	COLUMNAR	old	0.17	65.36 MB
all	LONG	1000000	COLUMNAR	new	0.64	8.84 MB
all	LONG	1000000	ROW	old	0.15	65.35 MB
all	LONG	1000000	ROW	new	0.43	8.84 MB
all	LONG_ARRAY	10000	COLUMNAR	old	3.68	4.51 MB
all	LONG_ARRAY	10000	COLUMNAR	new	8.73	0.90 MB
all	LONG_ARRAY	10000	ROW	old	3.59	4.50 MB
all	LONG_ARRAY	10000	ROW	new	10.21	0.90 MB
all	LONG_ARRAY	1000000	COLUMNAR	old	0.02	479.52 MB
all	LONG_ARRAY	1000000	COLUMNAR	new	0.09	88.62 MB
all	LONG_ARRAY	1000000	ROW	old	0.02	479.51 MB
all	LONG_ARRAY	1000000	ROW	new	0.09	88.62 MB
all	STRING	10000	COLUMNAR	old	24.28	0.47 MB
all	STRING	10000	COLUMNAR	new	30.64	0.08 MB
all	STRING	10000	ROW	old	16.61	0.34 MB
all	STRING	10000	ROW	new	23.92	0.08 MB
all	STRING	1000000	COLUMNAR	old	0.18	49.38 MB
all	STRING	1000000	COLUMNAR	new	0.26	4.86 MB
all	STRING	1000000	ROW	old	0.14	37.37 MB
all	STRING	1000000	ROW	new	0.21	20.86 MB
all	STRING_ARRAY	10000	COLUMNAR	old	1.94	4.02 MB
all	STRING_ARRAY	10000	COLUMNAR	new	2.92	1.96 MB
all	STRING_ARRAY	10000	ROW	old	1.84	4.01 MB
all	STRING_ARRAY	10000	ROW	new	2.86	1.96 MB
all	STRING_ARRAY	1000000	COLUMNAR	old	0.01	412.55 MB
all	STRING_ARRAY	1000000	COLUMNAR	new	0.02	192.53 MB
all	STRING_ARRAY	1000000	ROW	old	0.01	412.54 MB
all	STRING_ARRAY	1000000	ROW	new	0.02	192.53 MB

As you can see, throughput is between 1x to 3x, but the difference in allocation is even better. This is specially important because while this benchmark is executed on a machine with enough memory, a large allocation rate can be problematic in production because a single query allocating too much may heavily affect latency on other queries due to GC (even using non-blocking GCs!).

One of the key techniques to achieve the performance gain was to use special type of buffers and streams. In order to make this PR smaller, I created #13304 that only contains the buffer and stream changes. We can merge that PR before merging this one.

TODO:

Clean up the code
Create test that verify binary format is not broken

codecov-commenter · 2024-06-03T15:13:00Z

Codecov Report

Attention: Patch coverage is 53.10263% with 393 lines in your changes missing coverage. Please review.

Project coverage is 57.84%. Comparing base (59551e4) to head (bd6d9e5).
Report is 1119 commits behind head on master.

Files with missing lines	Patch %	Lines
...apache/pinot/common/datablock/DataBlockEquals.java	6.93%	157 Missing and 4 partials ⚠️
...g/apache/pinot/common/datablock/BaseDataBlock.java	7.86%	81 Missing and 1 partial ⚠️
...pinot/common/datablock/ZeroCopyDataBlockSerde.java	62.55%	66 Missing and 16 partials ⚠️
.../apache/pinot/common/datablock/DataBlockUtils.java	61.70%	15 Missing and 3 partials ⚠️
...java/org/apache/pinot/common/utils/DataSchema.java	0.00%	14 Missing ⚠️
...e/pinot/segment/spi/memory/CompoundDataBuffer.java	30.76%	9 Missing ⚠️
.../pinot/core/common/datablock/DataBlockBuilder.java	96.38%	3 Missing and 5 partials ⚠️
...rg/apache/pinot/common/datablock/RowDataBlock.java	0.00%	4 Missing ⚠️
...ache/pinot/common/datablock/ColumnarDataBlock.java	0.00%	3 Missing ⚠️
.../apache/pinot/common/datablock/DataBlockSerde.java	70.00%	2 Missing and 1 partial ⚠️
... and 4 more

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #13303      +/-   ##
============================================
- Coverage     61.75%   57.84%   -3.91%     
- Complexity      207      219      +12     
============================================
  Files          2436     2615     +179     
  Lines        133233   143536   +10303     
  Branches      20636    22053    +1417     
============================================
+ Hits          82274    83028     +754     
- Misses        44911    54020    +9109     
- Partials       6048     6488     +440

Flag	Coverage Δ
custom-integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration	`<0.01% <0.00%> (-0.01%)`	⬇️
integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration2	`0.00% <0.00%> (ø)`
java-11	`57.82% <53.10%> (-3.89%)`	⬇️
java-21	`57.73% <53.10%> (-3.89%)`	⬇️
skip-bytebuffers-false	`57.84% <53.10%> (-3.91%)`	⬇️
skip-bytebuffers-true	`57.69% <53.10%> (+29.96%)`	⬆️
temurin	`57.84% <53.10%> (-3.91%)`	⬇️
unittests	`57.84% <53.10%> (-3.91%)`	⬇️
unittests1	`40.75% <53.10%> (-6.14%)`	⬇️
unittests2	`27.86% <0.00%> (+0.13%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

gortiz · 2024-06-10T10:30:56Z

pinot-common/src/test/java/org/apache/pinot/common/datablock/BaseDataBlockContract.java

-  public void testSerdeCorrectness(BaseDataBlock dataBlock)
-      throws IOException {
-    byte[] bytes = dataBlock.toBytes();
-    ByteBuffer byteBuffer = ByteBuffer.wrap(bytes);
-    int versionType = DataBlockUtils.readVersionType(byteBuffer);
-    BaseDataBlock deserialize = deserialize(byteBuffer, versionType);
-
-    assertEquals(byteBuffer.position(), bytes.length, "Buffer position should be at the end of the buffer");
-    assertEquals(deserialize, dataBlock, "Deserialized data block should be the same as the original data block");
-  }


Serde correctness is implemented now in the DataBlockSerde tests

gortiz · 2024-06-10T10:37:35Z

pinot-core/src/main/java/org/apache/pinot/core/common/datablock/DataBlockBuilder.java

-      rowBuilder._fixedSizeDataByteArrayOutputStream.write(byteBuffer.array(), 0, byteBuffer.position());
    }
+
+    CompoundDataBuffer.Builder varBufferBuilder = new CompoundDataBuffer.Builder(ByteOrder.BIG_ENDIAN, true)
+        .addPagedOutputStream(varSize);
+


This is probably the biggest improvement in performance when creating the block. The older version allocated a large array (which is expensive as it may be outside the TLAB) and then copy that into the ArrayOutputStream, which probably ends up allocating that amount of bytes again.

Now instead we just reuse the whole byte buffer, adding it into the builder, which is basically a list of byte buffers that can be later be used to send the data through the network or directly read the info on another local stage.

gortiz · 2024-06-10T10:39:10Z

pinot-core/src/main/java/org/apache/pinot/core/common/datablock/DataBlockBuilder.java

      }
-      columnarBuilder._fixedSizeDataByteArrayOutputStream.write(byteBuffer.array(), 0, byteBuffer.position());


Here again we were copying bytes. In this case the allocation is smaller, but problematic anyway.

gortiz · 2024-06-10T11:00:07Z

pinot-perf/src/main/java/org/apache/pinot/perf/BenchmarkDataBlock.java

+  //    @Param(value = {"0", "10", "90"})
+  int _nullPerCent = 10;
+
+//  @Param(value = {"direct_small", "heap_small"})


This can be used to study the performance on different allocators. heal_small is the faster by a large margin.

Methods have been moved to PinotInputStream and PinotOutputStream

…tablock deserialization

gortiz · 2024-08-28T13:50:00Z

I think I've applied most if not all suggestions. Please take a look to this PR again, as it would be great to be able to merge it soon.

yashmayya

Thanks for this improvement @gortiz, the numbers look super impressive! I've left some minor comments and questions (many of which are simply to improve my understanding of these areas in Pinot 😄).

pinot-common/src/main/java/org/apache/pinot/common/datablock/DataBlockEquals.java

pinot-common/src/main/java/org/apache/pinot/common/datablock/DataBlockSerde.java

pinot-common/src/main/java/org/apache/pinot/common/datablock/DataBlockUtils.java

yashmayya · 2024-08-29T10:49:29Z

pinot-core/src/test/java/org/apache/pinot/core/common/datablock/DataBlockTest.java

+        } catch (AssertionError e) {
+          throw new AssertionError(
+              "Error comparing Row/Column Block at (" + rowId + "," + colId + ") of Type: " + columnDataType + "!", e);


What's the purpose of this catch block?

TBH I don't remember. Probably to catch errors produced by assertions in DataBlockTestUtils.getElement. I can change the code to make it clearer.

pinot-common/src/main/java/org/apache/pinot/common/datablock/BaseDataBlock.java

pinot-common/src/main/java/org/apache/pinot/common/datablock/ZeroCopyDataBlockSerde.java

yashmayya · 2024-08-29T12:59:21Z

pinot-common/src/test/java/org/apache/pinot/common/datablock/ZeroCopyDataBlockSerdeTest.java

+  }
+
+  @Test(dataProvider = "blocks")
+  void testSerde(String desc, DataBlock block) {


Could we also add tests for RowDataBlock / ColumnarDataBlock and if possible, including some edge cases like null dictionary, null data schema etc.?

It would be great, but it is not easy to create actual row and column blocks here given DataBlockBuilder is in pinot-core. I remember I tried to move it to pinot-common but I guess I didn't commit the change because there was some issue (maybe some other dependency in pinot-core)

Instead we are testing these serde properties in DataBlockSerdeTest. That is not great, but it is easier to implement right now.

including some edge cases like null dictionary, null data schema etc.?

I don't think these are valid cases for row/column blocks

pinot-core/src/main/java/org/apache/pinot/core/common/datablock/DataBlockBuilder.java

…dation.constraints.NotNull

pinot-common/src/main/java/org/apache/pinot/common/datablock/BaseDataBlock.java

vrajat · 2024-09-09T13:24:40Z

pinot-common/src/main/java/org/apache/pinot/common/datablock/DataBlockUtils.java

+
+  static {
+    SERDES = new EnumMap<>(DataBlockSerde.Version.class);
+    SERDES.put(DataBlockSerde.Version.V1_V2, new ZeroCopyDataBlockSerde());


V1_V2 is hard-coded here and in ZeroCopyDataBlockSerde.getVersion. Change to SERDES.put(ZeroCopyDataBlockSerde.VERSION, new ZeroCopyDataBlockSerde()) ?

Also do you plan to make this configurable in the future or punt on it until the next version is added ?

In case we end up having more serdes we will probably add more entries to the map and/or open the ability to change them using configurations.

For example, we could end up implementing a better implementation of V1_V2. Then we could have both Serdes and change from one to the other using configuration. In case we create new protocol versions (like a V3) we would be able to add it here, although we would need a way to decide which version should be used (as explained in some doc, probably just setting the version used as a config meanwhile we have a heterogeneous cluster).

vrajat · 2024-09-09T13:32:03Z

pinot-perf/src/main/java/org/apache/pinot/perf/BenchmarkDataBlock.java

@@ -74,22 +77,26 @@ public static void main(String[] args)
        .addProfiler(GCProfiler.class));
  }

-  @Param(value = {"INT", "LONG", "STRING", "BYTES", "BIG_DECIMAL", "LONG_ARRAY", "STRING_ARRAY"})
+//  @Param(value = {"INT", "LONG", "STRING", "BYTES", "BIG_DECIMAL", "LONG_ARRAY", "STRING_ARRAY"})


nit: Should be removed or re-instated ?

Why are STRING_ARRAY and BIGDECIMAL removed ?

We had the same discussion below:

I think it is useful to have them there to have easy access to other common configurations. The current configuration (always use 10k rows and only {"INT", "LONG", "STRING", "BYTES", "LONG_ARRAY"} types is useful, but keeping the larger list of params and the other rows may be also useful in more extensive tests.

…xplicitly add the ByteBuffer wrapper

gortiz added 4 commits May 30, 2024 15:52

Add new buffers and streams

18d0865

Extra improvements on buffers

30b4359

CompoundDataBuffer.Builder will skip empty buffers

0dd8edc

Fixed a bug in PagedPinotOutputStream.write(byte[]...)

3ecff36

gortiz mentioned this pull request Jun 3, 2024

New buffers #13304

Merged

3 tasks

gortiz force-pushed the multi-stage-serde branch 3 times, most recently from 270f679 to 2152e36 Compare June 4, 2024 08:31

Add tests and fixed some bugs

fe46889

gortiz force-pushed the multi-stage-serde branch from 2152e36 to 16fc306 Compare June 5, 2024 14:26

gortiz mentioned this pull request Jun 7, 2024

Benchmark to measure multi-stage block serde cost #13336

Merged

gortiz commented Jun 10, 2024

View reviewed changes

gortiz marked this pull request as ready for review June 10, 2024 12:10

Jackie-Jiang added enhancement performance labels Jun 11, 2024

gortiz added 5 commits June 14, 2024 12:14

improve DataBuffer.sameContent

a11654f

improve code style in DataBuffer.commonHashCode

d6f1d60

Improve performance in CompoundDataBuffer constructors

a22b114

Remove SeekableInputStream and SeekableOutputStream

8bc191e

Methods have been moved to PinotInputStream and PinotOutputStream

Fix format

42e7067

gortiz force-pushed the multi-stage-serde branch from 5033ea4 to 5332e3f Compare June 14, 2024 10:29

gortiz added 2 commits June 24, 2024 10:30

Fix comments in PR

0d6b72c

Merge remote-tracking branch master into new-buffers

08abb41

gortiz requested a review from Jackie-Jiang June 24, 2024 11:01

gortiz added 2 commits June 24, 2024 13:03

Initial version

4720be8

fix code style

822698e

gortiz added 6 commits August 28, 2024 14:26

multi-stage-serde: Remove older data blocks and their tests

75c5882

multi-stage-serde: Improve error message when having errors during da…

7cf0779

…tablock deserialization

multi-stage-serde: Implement timestamp comparison

4988b1c

multi-stage-serde: Move DataBlock.equals to its own utility class

7ea6f28

multi-stage-serde: Improve javadoc

bdd35a1

multi-stage-serde: Remove DataBlock.Raw abstraction

bab6c36

gortiz requested review from ankitsultana, Jackie-Jiang and yashmayya August 28, 2024 13:49

gortiz added 3 commits August 29, 2024 14:48

Merge remote-tracking branch 'origin/master' into multi-stage-serde

b3e6342

multi-stage-serde: Fix issue in DataBlockEquals

f1cd5c7

multi-stage-serde: Fix issue in ExprMinMaxObject

8960d44

yashmayya reviewed Aug 29, 2024

View reviewed changes

multi-stage-serde: Update tests not using DataBlockEquals

83b6790

gortiz force-pushed the multi-stage-serde branch from ef88e9d to 83b6790 Compare August 30, 2024 11:46

gortiz added 5 commits August 30, 2024 15:57

multi-stage-serde: Add missing @OverRide

9d03f32

multi-stage-serde: Use javax.annotation.Nonnull instead of javax.vali…

19dffdd

…dation.constraints.NotNull

multi-stage-serde: Improve error description in DataBlockTest

230cff0

multi-stage-serde: Fix errors in javadoc

8bc421b

multi-stage-serde: Try to improve datablock serde javadoc

528e508

vrajat reviewed Sep 9, 2024

View reviewed changes

gortiz added 2 commits September 10, 2024 12:55

multi-stage-explain: Use PinotByteBuffer.wrapper(byte[]) instead of e…

2b68868

…xplicitly add the ByteBuffer wrapper

Merge remote-tracking branch 'origin/master' into multi-stage-serde

bd6d9e5

vrajat approved these changes Sep 13, 2024

View reviewed changes

Jackie-Jiang approved these changes Sep 13, 2024

View reviewed changes

gortiz merged commit e5df02c into apache:master Sep 18, 2024
21 checks passed

gortiz deleted the multi-stage-serde branch September 18, 2024 10:08

gortiz mentioned this pull request Dec 13, 2024

[Proposal] Pinot Multistage Engine Lite Mode #14640

Closed

abhioncbr added the release-notes Referenced by PRs that need attention when compiling the next release notes label Jan 14, 2025

		}
		columnarBuilder._fixedSizeDataByteArrayOutputStream.write(byteBuffer.array(), 0, byteBuffer.position());

Improve performance of DataBlock serde #13303

Improve performance of DataBlock serde #13303

Uh oh!

Conversation

gortiz commented Jun 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Jun 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gortiz commented Aug 28, 2024

Uh oh!

yashmayya left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gortiz commented Jun 3, 2024 •

edited

Loading

codecov-commenter commented Jun 3, 2024 •

edited

Loading