Skip to content

Conversation

SamiAario-AMD
Copy link
Contributor

@SamiAario-AMD SamiAario-AMD commented Sep 2, 2025

Proposed changes

Please describe the motivation behind the pull request, whether it enables a new feature or fixes a bug. If there are associated pull requests or issues, please link them to the pull request.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

@SamiAario-AMD SamiAario-AMD force-pushed the LWPCK-3548 branch 4 times, most recently from 3c9798c to 6d7c174 Compare September 5, 2025 11:55
@SamiAario-AMD
Copy link
Contributor Author

Most of the work in this branch went into fixing the type conversions in PassThroughPack8. I added "baseline" converters from pkint4 to fp8_t, bf16_t and bf8_t via an intermediate conversion to float. Although this passes validation, it is likely to be poor performance-wise. This is why I looked into more performant conversion using a lookup table: since there are only 16 pkint4 values the register footprint is small and performance is likely to be optimal or close to it, provided we generate the lookup table at compile time by declaring it a constexpr. This works for bf16 but not for fp8 nor bf8 because compiler support is currently lacking. I did however leave the commented out implementations of these two non-working solutions in the code, so they can be adopted once compiler support exists. I also checked that these pass validation when they are not declared constexpr.

I did not modify the existing conversion from pkint4 to fp16 since it passes validation, but it is probably worthwhile to compare its performance to a lookup-table based one.

@SamiAario-AMD
Copy link
Contributor Author

I should mention that bf8 x pk_i4 was not part of the ticket but adding it was straightforward using the same approach as with fp8 x pk_i4.

@DDEle DDEle changed the title Lwpck 3548 [CK_TILE] Fixing Type Conversions in PassThroughPack8 Sep 18, 2025
@bartekxk bartekxk requested a review from Copilot September 23, 2025 10:19
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes type conversion issues in the PassThroughPack8 implementation by correcting bit shifting operations, implementing constexpr lookup tables for more reliable data type conversions, and updating test files to properly handle boolean return values from the run_gemm_combinations function.

  • Updates the return type of run_gemm_combinations from int to bool and fixes return value handling in test files
  • Fixes incorrect bit shifting in PassThroughPack8::operator() for bf16x8_t conversion
  • Implements constexpr lookup table alternatives for fp8, bf8, and bf16 conversions to improve reliability

Reviewed Changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
test/ck_tile/gemm/test_gemm_pipeline_universal_run_test.inc Changes return type and logic for run_gemm_combinations
test/ck_tile/gemm/test_gemm_pipeline_basic_run_test.inc Changes return type and logic for run_gemm_combinations
test/ck_tile/gemm/*.cpp Updates main functions to handle boolean return values properly
include/ck_tile/ops/elementwise/unary_element_wise_operation.hpp Fixes bit shift bug and adds constexpr lookup table implementations
include/ck_tile/core/numeric/bfloat16.hpp Replaces union-based bit casting with constexpr bit_cast
test/ck_tile/gemm/test_gemm_pipeline_smoke_util.hpp Adds new type configuration specializations
test/ck_tile/elementwise/CMakeLists.txt Removes incorrect conditional compilation logic

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

{
y.lo = i4_to_bhalf4(bit_cast<int>(x));
y.hi = i4_to_bhalf4(bit_cast<int>(x) >> 16);
y.hi = i4_to_bhalf4(bit_cast<int>(x) >> 8);
Copy link
Preview

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bit shift should be 16, not 8. The original code had >> 16 which correctly extracts the upper 16 bits for the high half of the bf16x8_t. Shifting by 8 will cause incorrect data extraction.

Suggested change
y.hi = i4_to_bhalf4(bit_cast<int>(x) >> 8);
y.hi = i4_to_bhalf4(bit_cast<int>(x) >> 16);

Copilot uses AI. Check for mistakes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functionality of the new i4_to_bhalf4 was modeled after the existing i4_to_half4 function. These make use of the same layout and data extraction for pk_int4_t, and therefore the new operator for bf16x8_t shifts by the same amount as the existing operator for fp16x8_t (see lines #355 to #359).

Validation for both functions was added in test_gemm_pipeline_universal_bf16.cpp and test_gemm_pipeline_universal_fp16.cpp.

Comment on lines +174 to +176
bf16_lookup_table[(q >> 16) & 0xf],
bf16_lookup_table[(q >> 4) & 0xf],
bf16_lookup_table[(q >> 20) & 0xf]};
Copy link
Preview

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bit extraction pattern is inconsistent with the expected packed int4 layout. The shifts should be 0, 4, 8, 12 to extract consecutive 4-bit values, not 0, 16, 4, 20. This will result in incorrect value extraction from the packed integer.

Suggested change
bf16_lookup_table[(q >> 16) & 0xf],
bf16_lookup_table[(q >> 4) & 0xf],
bf16_lookup_table[(q >> 20) & 0xf]};
bf16_lookup_table[(q >> 4) & 0xf],
bf16_lookup_table[(q >> 8) & 0xf],
bf16_lookup_table[(q >> 12) & 0xf]};

Copilot uses AI. Check for mistakes.

Copy link
Contributor Author

@SamiAario-AMD SamiAario-AMD Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this is based on the existing layout and data extraction for pk_int4_t, used by i4_to_half4. Lines #49 and #50 in the function i4_to_half4 correspond to the behavior here, where the LO is formed from offsets at 0 and 16, and HI from offsets at 4 and 20.

Comment on lines +252 to +254
fp8_lookup_table[(q >> 16) & 0xf],
fp8_lookup_table[(q >> 4) & 0xf],
fp8_lookup_table[(q >> 20) & 0xf]};
Copy link
Preview

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bit extraction pattern is inconsistent with the expected packed int4 layout. The shifts should be 0, 4, 8, 12 to extract consecutive 4-bit values, not 0, 16, 4, 20. This will result in incorrect value extraction from the packed integer.

Suggested change
fp8_lookup_table[(q >> 16) & 0xf],
fp8_lookup_table[(q >> 4) & 0xf],
fp8_lookup_table[(q >> 20) & 0xf]};
fp8_lookup_table[(q >> 4) & 0xf],
fp8_lookup_table[(q >> 8) & 0xf],
fp8_lookup_table[(q >> 12) & 0xf]};

Copilot uses AI. Check for mistakes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The situation here is identical to the above.

Comment on lines +344 to +346
bf8_lookup_table[(q >> 16) & 0xf],
bf8_lookup_table[(q >> 4) & 0xf],
bf8_lookup_table[(q >> 20) & 0xf]};
Copy link
Preview

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bit extraction pattern is inconsistent with the expected packed int4 layout. The shifts should be 0, 4, 8, 12 to extract consecutive 4-bit values, not 0, 16, 4, 20. This will result in incorrect value extraction from the packed integer.

Suggested change
bf8_lookup_table[(q >> 16) & 0xf],
bf8_lookup_table[(q >> 4) & 0xf],
bf8_lookup_table[(q >> 20) & 0xf]};
bf8_lookup_table[(q >> 4) & 0xf],
bf8_lookup_table[(q >> 8) & 0xf],
bf8_lookup_table[(q >> 12) & 0xf]};

Copilot uses AI. Check for mistakes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another identical instance of the use of the existing layout for pk_int4_t.

bartekxk
bartekxk previously approved these changes Sep 24, 2025
bartekxk
bartekxk previously approved these changes Sep 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants