Your First Kernel¶
This tutorial walks through writing, launching, and understanding a simple GPU kernel with meTile.
The Kernel¶
import numpy as np
import metile
@metile.kernel
def add(X, Y, Out, N, BLOCK: metile.constexpr):
pid = metile.program_id(0)
offs = pid * BLOCK + metile.arange(0, BLOCK)
mask = offs < N
x = metile.load(X + offs, mask=mask)
y = metile.load(Y + offs, mask=mask)
metile.store(Out + offs, x + y, mask=mask)
Let’s break this down line by line.
@metile.kernelMarks this function for GPU compilation. When you call it, meTile traces the Python code, compiles it to a Metal shader, and dispatches it on the GPU.
X, Y, OutDevice pointers to GPU memory. These map to
device float*in Metal.NA runtime scalar, passed as a
constant int&to the shader.BLOCK: metile.constexprA compile-time constant. The value is baked directly into the shader. Changing it triggers recompilation.
metile.program_id(0)Returns the index of this program instance along axis 0. If you launch 4 instances, they get
pid = 0, 1, 2, 3. This is analogous toblockIdx.xin CUDA orget_program_id(0)in Triton.metile.arange(0, BLOCK)Creates a tile (vector) of consecutive indices
[0, 1, 2, ..., BLOCK-1].offs = pid * BLOCK + metile.arange(0, BLOCK)Each program instance handles a contiguous chunk of
BLOCKelements. Instance 0 handles[0..BLOCK-1], instance 1 handles[BLOCK..2*BLOCK-1], etc.mask = offs < NA boolean mask that prevents out-of-bounds accesses when
Nis not a multiple ofBLOCK.metile.load(X + offs, mask=mask)Loads
BLOCKelements from memory. Masked-off lanes read zero.metile.store(Out + offs, x + y, mask=mask)Stores results. Masked-off lanes are skipped.
Launching¶
N = 1024
x = metile.Buffer(data=np.random.randn(N).astype(np.float32))
y = metile.Buffer(data=np.random.randn(N).astype(np.float32))
out = metile.Buffer.zeros((N,))
grid = (metile.cdiv(N, 256),) # ceil(1024 / 256) = 4 program instances
add[grid](x, y, out, N, BLOCK=256)
print(out.numpy()[:5])
metile.BufferWraps a Metal buffer in unified memory. CPU and GPU share the same physical memory on Apple Silicon, so there is no copy between host and device.
metile.Buffer.zeros((N,))Allocates a zeroed buffer of
Nfloat32 elements.metile.cdiv(N, 256)Ceiling division:
ceil(N / 256). Utility for computing grid sizes.add[grid](...)The
[grid]subscript sets the number of program instances (threadgroups). The kernel is compiled on first call and cached for subsequent calls with the same constexprs.
The Compilation Pipeline¶
When you call add[grid](...), meTile:
Traces the Python function with symbolic values to build a Tile IR
Lowers the Tile IR to Metal IR (Apple GPU-specific primitives)
Optimizes via IR-to-IR passes (vectorization, loop splitting, constant folding)
Emits MSL (Metal Shading Language) source code
Compiles with
xcrun metal -O2(or JIT if Xcode is unavailable)Dispatches the compute pipeline on the GPU
You can inspect any stage with the METILE_DEBUG environment variable:
METILE_DEBUG=msl python my_script.py # see the generated Metal shader
METILE_DEBUG=tile_ir python my_script.py # see the Tile IR
METILE_DEBUG=all python my_script.py # see everything
What’s Next¶
Language Reference for the full language reference
Softmax for a more complex kernel with reductions and multiple passes
Matrix Multiply (GEMM) for tile-level matrix multiply with
dotandtile_load