Skip to content

Commit e11f694

Browse files
spolifroni-amdlpaolettividyasagar-amd
authored
first commit of the glossary (#2702)
* first commit of the glossary * minor changes * Update docs/reference/Composable-Kernel-Glossary.rst Co-authored-by: Leo Paoletti <[email protected]> * Update docs/reference/Composable-Kernel-Glossary.rst Co-authored-by: Leo Paoletti <[email protected]> * Update Composable-Kernel-Glossary.rst --------- Co-authored-by: Leo Paoletti <[email protected]> Co-authored-by: Vidyasagar Ananthan <[email protected]>
1 parent 4eb4158 commit e11f694

File tree

3 files changed

+264
-1
lines changed

3 files changed

+264
-1
lines changed

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ The Composable Kernel repository is located at `https://github.com/ROCm/composab
3939
* :doc:`Composable Kernel API reference <./doxygen/html/namespace_c_k>`
4040
* :doc:`CK Tile API reference <./doxygen/html/namespaceck__tile>`
4141
* :doc:`Composable Kernel complete API class list <./doxygen/html/annotated>`
42+
* :doc:`Composable Kernel glossary <./reference/Composable-Kernel-Glossary>`
4243

4344
To contribute to the documentation refer to `Contributing to ROCm <https://rocm.docs.amd.com/en/latest/contribute/contributing.html>`_.
4445

Lines changed: 256 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,256 @@
1+
.. meta::
2+
:description: Composable Kernel glossary of terms
3+
:keywords: composable kernel, glossary
4+
5+
***************************************************
6+
Composable Kernel glossary
7+
8+
***************************************************
9+
10+
.. glossary::
11+
:sorted:
12+
13+
arithmetic logic unit
14+
The arithmetic logic unit (ALU) is the GPU component responsible for arithmetic and logic operations.
15+
16+
compute unit
17+
The compute unit (CU) is the parallel vector processor in an AMD GPU with multiple :term:`ALUs<arithmetic logic unit>`. Each compute unit will run all the :term:`wavefronts<wavefront>` in a :term:`work group>`. A compute unit is equivalent to NVIDIA's streaming multiprocessor.
18+
19+
matrix core
20+
A matrix core is a specialized GPU unit that accelerate matrix operations for AI and deep learning tasks. A GPU contains multiple matrix cores.
21+
22+
register
23+
Registers are the fastest tier of memory. They're used for storing temporary values during computations and are private to the :term:`work-items<work-item>` that use them.
24+
25+
VGPR
26+
See :term:`vector general purpose register`.
27+
28+
vector general purpose register
29+
A vector general purpose register (VGPR) is a :term:`register` that stores individual thread data. Each thread in a :term:`wave<wavefront>` has its own set of VGPRs for private variables and calculations.
30+
31+
SGPR
32+
See :term:`scalar general purpose register`.
33+
34+
scalar general purpose register
35+
A scalar general purpose register (SGPR) is a :term:`register` shared by all the :term:`work items<work item>` in a :term:`wave<wavefront>`. SGPRs are used for constants, addresses, and control flow common across the entire wave.
36+
37+
LDS
38+
See :term:`local data share`.
39+
40+
local data share
41+
Local data share (LDS) is high-bandwidth, low-latency on-chip memory accessible to all the :term:`work-items<work-item>` in a :term:`work group`. LDS is equivalent to NVIDIA's shared memory.
42+
43+
LDS banks
44+
LDS banks are a type of memory organization where consecutive addresses are distributed across multiple memory banks for parallel access. LDS banks are used to prevent memory access conflicts and improve bandwidth when LDS is used.
45+
46+
global memory
47+
The main device memory accessible by all threads, offering high capacity but higher latency than shared memory.
48+
49+
pinned memory
50+
Pinned memory is :term:`host` memory that is page-locked to accelerate transfers between the CPU and GPU.
51+
52+
dense tensor
53+
A dense tensor is a tensor where most of its elements are non-zero. Dense tensors are typically stored in a contiguous block of memory.
54+
55+
sparse tensor
56+
A sparse tensor is a tensor where most of its elements are zero. Typically only the non-zero elements of a sparse tensor and their indices are stored.
57+
58+
host
59+
Host refers to the CPU and the main memory system that manages GPU execution. The host is responsible for launching kernels, transferring data, and coordinating overall computation.
60+
61+
device
62+
Device refers to the GPU hardware that runs parallel kernels. The device contains the :term:`compute units<compute unit>`, memory hierarchy, and specialized accelerators.
63+
64+
work-item
65+
A work-item is the smallest unit of parallel execution. A work-item runs a single independent instruction stream on a single data element. A work-item is equivalent to an NVIDIA thread.
66+
67+
wavefront
68+
Also referred to as a wave, a wavefront is a group of :term:`work-items<work-item>` that run the same instruction. A wavefront is equivalent to an NVIDIA warp.
69+
70+
work group
71+
A work group is a collection of :term:`work-items<work-item>` that can synchronize and share memory. A work group is equivalent to NVIDIA's thread block.
72+
73+
grid
74+
A grid is a collection of :term:`work groups<work group>` that run a kernel. Each work group within the grid operates independently and can be scheduled on a different :term:`compute unit`. A grid can be organized into one, two, or three dimensions. A grid is equivalent to an NVIDIA thread block.
75+
76+
block Size
77+
The block size is the number of :term:`work-items<work-item>` in a :term:`compute unit`.
78+
79+
SIMT
80+
See :term:`single-instruction, multi-thread`
81+
82+
single-instruction, multi-thread
83+
Single-instruction, multi-thread (SIMT) is a parallel computing model where all the :term:`work-items<work-item>` within a :term:`wavefront` run the same instruction on different data.
84+
85+
SIMD
86+
See :term:`single-instruction, multi-data`
87+
88+
single-instruction, multi-data
89+
Single-instruction, multi-data (SIMD) is a parallel computing model where the same instruction is run with different data simultaneously.
90+
91+
occupancy
92+
The ratio of active :term:`wavefronts<wavefront>` to the maximum possible number of wavefronts.
93+
94+
kernel
95+
A kernel is a function that runs an :term:`operation` or a collection of operations. A kernel will run in parallel on several :term:`work-items<work-item>` across the GPU. In Composable Kernel, kernels require :term:`pipelines<pipeline>`.
96+
97+
operation
98+
An operation is a computation on input data.
99+
100+
pipeline
101+
A Composable Kernel pipeline schedules the sequence of operations for a :term:`kernel`, such as the data loading, computation, and storage phases. A pipeline consists of a :term:`problem` and a :term:`policy`.
102+
103+
tile partitioner
104+
The tile partitioner defines the mapping between the :term:`problem` dimensions and GPU hierarchy. It specifies :term:`workgroup`-level :term:`tile` sizes and determines :term:`grid` dimensions by dividing the problem size by the tile sizes.
105+
106+
problem
107+
The problem is the part of the :term:`pipeline` that defines input and output shapes, data types, and mathematical :term:`operations<operation>`.
108+
109+
policy
110+
The policy is the part of the :term:`pipeline` that defines memory access patterns and hardware-specific optimizations.
111+
112+
user customized tile pipeline
113+
A customized :term:`tile` :term:`pipeline` that combines custom :term:`problem` and :term:`policy` components for specialized computations.
114+
115+
user customized tile pipeline optimization
116+
The process of tuning the :term:`tile` size, memory access pattern, and hardware utilization for specific workloads.
117+
118+
tile programming API
119+
The :term:`tile` programming API is Composable Kernel's high-level interface for defining tile-based computations with predefined hardware mappings for data loading and storing.
120+
121+
coordinate transformation primitives
122+
Coordinate transformation primitives are Composable Kernel utilities for converting between different coordinate systems.
123+
124+
reference kernel
125+
A reference :term:`kernel` is a baseline kernel implementation used to verify correctness and performance. Composable Kernel makes two reference kernels, one for CPU and one for GPU, available.
126+
127+
launch parameters
128+
Launch parameters are the configuration values, such as :term:`grid` and :term:`block size`, that determine how a :term:`kernel` is mapped to hardware resources.
129+
130+
memory coalescing
131+
Memory coalescing is an optimization strategy where consecutive :term:`work-items<work-item>` access consecutive memory addresses in such a way that a single memory transaction serves multiple work-items.
132+
133+
alignment
134+
Alignment is a memory management strategy where data structures are stored at addresses that are multiples of a specific value.
135+
136+
137+
bank conflict
138+
A bank conflict occurs when multiple :term:`work-items<work-item>` in a :term:`wavefront` access different addresses that map to the same shared memory bank.
139+
140+
padding
141+
Padding is the addition of extra elements, often zeros, to tensor edges in order to control output size in convolution and pooling, or to align data for memory access.
142+
143+
transpose
144+
Transpose is an :term:`operation` that rearranges the order of tensor axes, often for the purposes of matching :term:`kernel` input formats or optimize memory access patterns.
145+
146+
permute
147+
Permute is an :term:`operation` that rearranges the order of tensor axes, often for the purposes of matching :term:`kernel` input formats or optimize memory access patterns.
148+
149+
host-device transfer
150+
A host-device transfer is the process of moving data between :term:`host` and :term:`device` memory.
151+
152+
stride
153+
A stride is the step size to move from one element to the next in a specific dimension of a tensor or matrix. In convolution and pooling, the stride determines how far the :term:`kernel` moves at each step.
154+
155+
dilation
156+
Dilation is the spacing between :term:`kernel` elements in convolution :term:`operations<operation>`, allowing the receptive field to grow without increasing kernel size.
157+
158+
Im2Col
159+
Im2Col is a data transformation technique that converts image data to column format.
160+
161+
Col2Im
162+
Col2Im is a data transformation technique that converts column data to image format.
163+
164+
fast changing dimension
165+
The fast changing dimension is the innermost dimension in memory layout.
166+
167+
outer dimension
168+
The outer dimension is the slower-changing dimension in memory layout.
169+
170+
inner dimension
171+
The inner dimension is the faster-changing dimension in memory layout.
172+
173+
tile
174+
A tile is a sub-region of a tensor or matrix that is processed by a :term:`work group` or :term:`work-item`. Rectangular data blocks are the unit of computation and memory transfer in Composable Kernel, and are the basis for tiled algorithms.
175+
176+
block tile
177+
A block tile is a memory :term:`tile` processed by a :term:`work group`.
178+
179+
wave tile
180+
A wave :term:`tile` is a sub-tile processed by a single :term:`wavefront` within a :term:`work group`. The wave tile is the base level granularity of a :term:`single-instruction, multi-thread (SIMD)<single-instruction, multi-thread>` model.
181+
182+
tile distribution
183+
The tile distribution is the hierarchical data mapping from :term:`work-items<work-item>` to data in memory.
184+
185+
tile window
186+
Viewport into a larger tensor that defines the current tile's position and boundaries for computation.
187+
188+
load tile
189+
Load tile is an operation that transfers data from :term:`global memory` or the :term:`load data share` to :term:`vector general purpose registers<vector general purpose register>`.
190+
191+
store tile
192+
Store tile is an operation that transfers data from :term:`vector general purpose registers<vector general purpose register>` to :term:`global memory` or the :term:`load data share`.
193+
194+
descriptor
195+
Metadata structure that defines :term:`tile` properties, memory layouts, and coordinate transformations for Composable Kernel :term:`operations<operation>`.
196+
197+
input
198+
See :term:`problem shape`.
199+
200+
problem shape
201+
The problem shape defines the dimensions and data types of input tensors that define the :term:`problem`.
202+
203+
vector
204+
The vector is the smallest data unit processed by an individual :term:`work-item`. A vectors is typically four to sixteen elements, depending on data type and hardware.
205+
206+
elementwise
207+
An elementwise :term:`operation` is an operation applied to each tensor element independently.
208+
209+
epilogue
210+
The epilogue is the final stage of a kernel. Activation functions, bias, and other post-processing steps are applied in the epilogue.
211+
212+
Add+Multiply
213+
See :term:`fused add multiply`.
214+
215+
fused add multiply
216+
A common fused :term:`operation` in machine language and linear algebra, where an :term:`elementwise` addition is immediately followed by a multiplication. Fused add multiply is often used for bias and scaling in neural network layers.
217+
218+
MFMA
219+
See :term:`matrix fused multiply-add`.
220+
221+
matrix fused multiply-add
222+
Matrix fused multiply-add (MFMA) is a :term:`matrix core` instruction for GEMM :term:`operations<operation>`.
223+
224+
GEMM
225+
See :term:`general matrix multiply`.
226+
227+
general matrix multiply
228+
A general matrix multiply (GEMM) is a Core matrix :term:`operation` in linear algebra and deep learning. A GEMM is defined as :math:`C = {\alpha}AB + {\beta}C`, where :math:`A`, :math:`B`, and :math:`C` are matrices, and :math:`\alpha` and :math:`\beta` are scalars.
229+
230+
VGEMM
231+
See :term:`naive GEMM`.
232+
233+
vanilla GEMM
234+
See :term:`naive GEMM`.
235+
236+
naive GEMM
237+
The naive GEMM, sometimes referred to as a vanilla GEMM or VGEMM, is the simplest form of :term:`GEMM` in Composable Kernel. The naive GEMM is defined as :math:`C = AB`, where :math:`A`, :math:`B`, and :math:`C` are matrices. The naive GEMM is the baseline GEMM that all other GEMM :term:`operations<operation>` build on.
238+
239+
GGEMM
240+
See :term:`grouped GEMM`.
241+
242+
grouped GEMM
243+
A :term:`kernel` that calls multiple :term:`VGEMMs<naive GEMM>`. Each call can have a different :term:`problem shape`.
244+
245+
batched GEMM
246+
A :term:`kernel` that calls :term:`VGEMMs<naive GEMM>` with different batches of data. All the data batches have the same :term:`problem shape`.
247+
248+
Split-K GEMM
249+
Split-K GEMM is a parallelization strategy that partitions the reduction dimension (K) of a :term:`GEMM` across multiple :term:`compute units<compute unit>`, increasing parallelism for large matrix multiplications.
250+
251+
GEMV
252+
See :term:`general matrix vector multiplication`
253+
254+
general matrix vector multiplication
255+
General matrix vector multiplication (GEMV) is an :term:`operation` where a matrix is multiplied by a vector, producing another vector.
256+

docs/sphinx/_toc.yml.in

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,8 +34,14 @@ subtrees:
3434
title: Composable Kernel vector utilities
3535
- file: reference/Composable-Kernel-wrapper.rst
3636
title: Composable Kernel wrapper
37+
- file: doxygen/html/namespace_c_k.rst
38+
title: CK API reference
39+
- file: doxygen/html/namespaceck__tile.rst
40+
title: CK Tile API reference
3741
- file: doxygen/html/annotated.rst
38-
title: Composable Kernel class list
42+
title: Full API class list
43+
- file: reference/Composable-Kernel-Glossary.rst
44+
title: Glossary
3945

4046
- caption: About
4147
entries:

0 commit comments

Comments
 (0)