API Reference¶
Kernel Definition & Launch¶
API |
Description |
|---|---|
|
Decorate a Python function for GPU compilation |
|
Launch kernel with given grid shape and compile-time constants |
|
Type annotation for compile-time constant parameters |
Buffers¶
API |
Description |
|---|---|
|
Create a GPU buffer from a numpy array (unified memory, zero-copy) |
|
Allocate a zeroed float32 buffer |
|
Create a GPU buffer from a numpy array |
|
Return a numpy view of the buffer data |
Program Identity¶
API |
Description |
|---|---|
|
Threadgroup index along |
|
Thread index within the threadgroup |
|
Lane index within the simdgroup (0-31) |
Index Generation¶
API |
Description |
|---|---|
|
Tile of |
|
Ceiling division: |
|
Smallest power of 2 >= |
Element-wise Memory¶
API |
Description |
|---|---|
|
Load elements; masked-off lanes read 0 |
|
Store elements; masked-off lanes are skipped |
Tile Memory¶
API |
Description |
|---|---|
|
Load a 2D tile from row-major memory |
|
Store a 2D tile to row-major memory |
|
Zero-initialized tile (accumulator init) |
|
Tile matrix multiply-accumulate: |
Control Flow¶
API |
Description |
|---|---|
|
Tiling loop (K-dimension iteration, multi-pass algorithms) |
Math Operations¶
All operate element-wise on scalars and tiles:
API |
Description |
|---|---|
|
Exponential |
|
Natural logarithm |
|
Square root |
|
Absolute value |
|
Hyperbolic tangent |
|
Conditional select |
|
Element-wise max |
|
Element-wise min |
Standard Python arithmetic (+, -, *, /, <, >, etc.) works inside kernels.
Reductions¶
API |
Description |
|---|---|
|
Sum-reduce tile to scalar |
|
Max-reduce tile to scalar |
|
Min-reduce tile to scalar |
Simdgroup Operations¶
API |
Description |
|---|---|
|
Context manager: execute on a subset of simdgroups |
|
XOR-based lane exchange within a simdgroup |
|
Broadcast from one lane to all lanes |
|
Threadgroup memory barrier |
|
Allocate threadgroup (shared) memory |
Tile Scheduling¶
API |
Description |
|---|---|
|
Apply tile scheduling pattern ( |
Autotuning¶
API |
Description |
|---|---|
|
Decorator for automatic parameter search |
|
A set of constexpr values to benchmark |
|
Autotune once and return a fast dispatcher |