[CK_TILE] Fixing Type Conversions in PassThroughPack8 #2769

SamiAario-AMD · 2025-09-02T12:12:11Z

Proposed changes

Please describe the motivation behind the pull request, whether it enables a new feature or fixes a bug. If there are associated pull requests or issues, please link them to the pull request.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

SamiAario-AMD · 2025-09-05T12:23:28Z

Most of the work in this branch went into fixing the type conversions in PassThroughPack8. I added "baseline" converters from pkint4 to fp8_t, bf16_t and bf8_t via an intermediate conversion to float. Although this passes validation, it is likely to be poor performance-wise. This is why I looked into more performant conversion using a lookup table: since there are only 16 pkint4 values the register footprint is small and performance is likely to be optimal or close to it, provided we generate the lookup table at compile time by declaring it a constexpr. This works for bf16 but not for fp8 nor bf8 because compiler support is currently lacking. I did however leave the commented out implementations of these two non-working solutions in the code, so they can be adopted once compiler support exists. I also checked that these pass validation when they are not declared constexpr.

I did not modify the existing conversion from pkint4 to fp16 since it passes validation, but it is probably worthwhile to compare its performance to a lookup-table based one.

SamiAario-AMD · 2025-09-08T14:53:24Z

I should mention that bf8 x pk_i4 was not part of the ticket but adding it was straightforward using the same approach as with fp8 x pk_i4.

…_element_wise_operation.hpp

…up based converters

Copilot

Pull Request Overview

This PR fixes type conversion issues in the PassThroughPack8 implementation by correcting bit shifting operations, implementing constexpr lookup tables for more reliable data type conversions, and updating test files to properly handle boolean return values from the run_gemm_combinations function.

Updates the return type of run_gemm_combinations from int to bool and fixes return value handling in test files
Fixes incorrect bit shifting in PassThroughPack8::operator() for bf16x8_t conversion
Implements constexpr lookup table alternatives for fp8, bf8, and bf16 conversions to improve reliability

Reviewed Changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
test/ck_tile/gemm/test_gemm_pipeline_universal_run_test.inc	Changes return type and logic for run_gemm_combinations
test/ck_tile/gemm/test_gemm_pipeline_basic_run_test.inc	Changes return type and logic for run_gemm_combinations
test/ck_tile/gemm/*.cpp	Updates main functions to handle boolean return values properly
include/ck_tile/ops/elementwise/unary_element_wise_operation.hpp	Fixes bit shift bug and adds constexpr lookup table implementations
include/ck_tile/core/numeric/bfloat16.hpp	Replaces union-based bit casting with constexpr bit_cast
test/ck_tile/gemm/test_gemm_pipeline_smoke_util.hpp	Adds new type configuration specializations
test/ck_tile/elementwise/CMakeLists.txt	Removes incorrect conditional compilation logic

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-23T10:21:06Z

include/ck_tile/ops/elementwise/unary_element_wise_operation.hpp

    {
        y.lo = i4_to_bhalf4(bit_cast<int>(x));
-        y.hi = i4_to_bhalf4(bit_cast<int>(x) >> 16);
+        y.hi = i4_to_bhalf4(bit_cast<int>(x) >> 8);


The bit shift should be 16, not 8. The original code had >> 16 which correctly extracts the upper 16 bits for the high half of the bf16x8_t. Shifting by 8 will cause incorrect data extraction.

Suggested change

y.hi = i4_to_bhalf4(bit_cast<int>(x) >> 8);

y.hi = i4_to_bhalf4(bit_cast<int>(x) >> 16);

The functionality of the new i4_to_bhalf4 was modeled after the existing i4_to_half4 function. These make use of the same layout and data extraction for pk_int4_t, and therefore the new operator for bf16x8_t shifts by the same amount as the existing operator for fp16x8_t (see lines #355 to #359).

Validation for both functions was added in test_gemm_pipeline_universal_bf16.cpp and test_gemm_pipeline_universal_fp16.cpp.

Copilot · 2025-09-23T10:21:07Z

include/ck_tile/ops/elementwise/unary_element_wise_operation.hpp

+                    bf16_lookup_table[(q >> 16) & 0xf],
+                    bf16_lookup_table[(q >> 4) & 0xf],
+                    bf16_lookup_table[(q >> 20) & 0xf]};


The bit extraction pattern is inconsistent with the expected packed int4 layout. The shifts should be 0, 4, 8, 12 to extract consecutive 4-bit values, not 0, 16, 4, 20. This will result in incorrect value extraction from the packed integer.

Suggested change

bf16_lookup_table[(q >> 16) & 0xf],

bf16_lookup_table[(q >> 4) & 0xf],

bf16_lookup_table[(q >> 20) & 0xf]};

bf16_lookup_table[(q >> 4) & 0xf],

bf16_lookup_table[(q >> 8) & 0xf],

bf16_lookup_table[(q >> 12) & 0xf]};

Again, this is based on the existing layout and data extraction for pk_int4_t, used by i4_to_half4. Lines #49 and #50 in the function i4_to_half4 correspond to the behavior here, where the LO is formed from offsets at 0 and 16, and HI from offsets at 4 and 20.

Copilot · 2025-09-23T10:21:07Z

include/ck_tile/ops/elementwise/unary_element_wise_operation.hpp

+                   fp8_lookup_table[(q >> 16) & 0xf],
+                   fp8_lookup_table[(q >> 4) & 0xf],
+                   fp8_lookup_table[(q >> 20) & 0xf]};


The bit extraction pattern is inconsistent with the expected packed int4 layout. The shifts should be 0, 4, 8, 12 to extract consecutive 4-bit values, not 0, 16, 4, 20. This will result in incorrect value extraction from the packed integer.

Suggested change

fp8_lookup_table[(q >> 16) & 0xf],

fp8_lookup_table[(q >> 4) & 0xf],

fp8_lookup_table[(q >> 20) & 0xf]};

fp8_lookup_table[(q >> 4) & 0xf],

fp8_lookup_table[(q >> 8) & 0xf],

fp8_lookup_table[(q >> 12) & 0xf]};

The situation here is identical to the above.

Copilot · 2025-09-23T10:21:07Z

include/ck_tile/ops/elementwise/unary_element_wise_operation.hpp

+                   bf8_lookup_table[(q >> 16) & 0xf],
+                   bf8_lookup_table[(q >> 4) & 0xf],
+                   bf8_lookup_table[(q >> 20) & 0xf]};


The bit extraction pattern is inconsistent with the expected packed int4 layout. The shifts should be 0, 4, 8, 12 to extract consecutive 4-bit values, not 0, 16, 4, 20. This will result in incorrect value extraction from the packed integer.

Suggested change

bf8_lookup_table[(q >> 16) & 0xf],

bf8_lookup_table[(q >> 4) & 0xf],

bf8_lookup_table[(q >> 20) & 0xf]};

bf8_lookup_table[(q >> 4) & 0xf],

bf8_lookup_table[(q >> 8) & 0xf],

bf8_lookup_table[(q >> 12) & 0xf]};

Another identical instance of the use of the existing layout for pk_int4_t.

…nstexpr compliant

… lookup table for use in conversions from pk_int4 to bf16

SamiAario-AMD force-pushed the LWPCK-3548 branch 4 times, most recently from 3c9798c to 6d7c174 Compare September 5, 2025 11:55

SamiAario-AMD force-pushed the LWPCK-3548 branch 2 times, most recently from 2ded46c to d6b990c Compare September 5, 2025 17:23

SamiAario-AMD marked this pull request as ready for review September 8, 2025 09:58

SamiAario-AMD requested review from illsilin, carlushuang, qianfengz, aosewski, poyenc, geyyer, bartekxk, andriy-ca, afagaj, asleepzzz, tenpercent, ThomasNing, coderfeli, shumway and vidyasagar-amd as code owners September 8, 2025 09:58

SamiAario-AMD force-pushed the LWPCK-3548 branch from e904753 to d960cdb Compare September 9, 2025 06:27

SamiAario-AMD added 5 commits September 10, 2025 10:19

Change the return type of run_gemm_combinations in the basic tests

2dfbb22

Change the return type of run_gemm_combinations in the universal tests

cf35f78

Add universal GEMM tests for bf16 x pk_i4 and fp16 x pk_i4

1b115f3

Add universal GEMM test for fp8 x pk_i4

022d658

Add basic GEMM tests for bf16 x pk_i4, fp16 x pk_i4 and fp8 x pk_i4.

729f631

SamiAario-AMD requested a review from aska-0096 as a code owner September 12, 2025 07:51

SamiAario-AMD added 6 commits September 16, 2025 16:34

Merge branch 'develop' into LWPCK-3548

e08b285

Remove the inefficient fallbacks for fp8 and bf8 in elementwise/unary…

a786741

…_element_wise_operation.hpp

Use explicit macros for enabling and disabling the the constexpr look…

abab85e

…up based converters

Merge branch 'develop' into LWPCK-3548

00f3792

Merge branch 'develop' into LWPCK-3548

9521ec2

Fix two failing tests

c5e3701

DDEle changed the title ~~Lwpck 3548~~ [CK_TILE] Fixing Type Conversions in PassThroughPack8 Sep 18, 2025

SamiAario-AMD added 5 commits September 18, 2025 13:13

Merge branch 'develop' into LWPCK-3548

1f9d21d

Merge branch 'develop' into LWPCK-3548

c9f3e13

Merge branch 'develop' into LWPCK-3548

da546c4

Merge branch 'develop' into LWPCK-3548

9f07656

Merge branch 'develop' into LWPCK-3548

1ef29c8

bartekxk requested a review from Copilot September 23, 2025 10:19

Copilot AI reviewed Sep 23, 2025

View reviewed changes

bartekxk previously approved these changes Sep 24, 2025

View reviewed changes

SamiAario-AMD added 6 commits September 24, 2025 10:21

Merge branch 'develop' into LWPCK-3548

98f005b

Merge branch 'develop' into LWPCK-3548

95d1b3e

Merge branch 'develop' into LWPCK-3548

472c6a7

Merge branch 'develop' into LWPCK-3548

6892004

Avoid union-based type punning in float_to_bf16_rtn_raw to make it co…

ea464d7

…nstexpr compliant

Use float_to_bf16_rtn_raw instead of float_to_bf16 to create the bf16…

4586d30

… lookup table for use in conversions from pk_int4 to bf16

SamiAario-AMD dismissed bartekxk’s stale review via 4586d30 September 24, 2025 11:27

bartekxk previously approved these changes Sep 25, 2025

View reviewed changes

Merge branch 'develop' into LWPCK-3548

d917303

SamiAario-AMD dismissed bartekxk’s stale review via deb46a6 September 26, 2025 08:06

On ROCm 7.0.1 we need an explicit cast to from uint16_t to bf16_t

28c69dc

SamiAario-AMD force-pushed the LWPCK-3548 branch from deb46a6 to 28c69dc Compare September 26, 2025 08:09

SamiAario-AMD added 2 commits September 26, 2025 11:09

Merge branch 'develop' into LWPCK-3548

6a37228

Merge branch 'develop' into LWPCK-3548

55b5455

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CK_TILE] Fixing Type Conversions in PassThroughPack8 #2769

[CK_TILE] Fixing Type Conversions in PassThroughPack8 #2769

SamiAario-AMD commented Sep 2, 2025 •

edited

Loading

Uh oh!

SamiAario-AMD commented Sep 5, 2025

Uh oh!

SamiAario-AMD commented Sep 8, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Sep 23, 2025

Uh oh!

SamiAario-AMD Sep 24, 2025

Uh oh!

Copilot AI Sep 23, 2025

Uh oh!

SamiAario-AMD Sep 24, 2025 •

edited

Loading

Uh oh!

Copilot AI Sep 23, 2025

Uh oh!

SamiAario-AMD Sep 24, 2025

Uh oh!

Copilot AI Sep 23, 2025

Uh oh!

SamiAario-AMD Sep 24, 2025

Uh oh!

Uh oh!

	y.hi = i4_to_bhalf4(bit_cast<int>(x) >> 8);
	y.hi = i4_to_bhalf4(bit_cast<int>(x) >> 16);

[CK_TILE] Fixing Type Conversions in PassThroughPack8 #2769

Are you sure you want to change the base?

[CK_TILE] Fixing Type Conversions in PassThroughPack8 #2769

Conversation

SamiAario-AMD commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Checklist

Discussion

Uh oh!

SamiAario-AMD commented Sep 5, 2025

Uh oh!

SamiAario-AMD commented Sep 8, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

SamiAario-AMD Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

SamiAario-AMD Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

SamiAario-AMD Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

SamiAario-AMD Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SamiAario-AMD commented Sep 2, 2025 •

edited

Loading

SamiAario-AMD Sep 24, 2025 •

edited

Loading