Memory Model¶
Apple Silicon has a unified memory architecture where the CPU and GPU share the same physical
memory. meTile exposes this directly through metile.Buffer.
Buffers¶
import numpy as np
import metile
# Create from numpy (zero-copy, the GPU reads the same memory)
x = metile.Buffer(data=np.random.randn(1024).astype(np.float32))
# Allocate zeroed
out = metile.Buffer.zeros((1024,))
# Allocate from existing numpy array
arr = np.zeros(1024, dtype=np.float32)
buf = metile.Buffer.from_numpy(arr)
# Read results back to numpy (also zero-copy)
result = out.numpy()
There is no explicit host-to-device copy. When you create a metile.Buffer, the data lives
in unified memory accessible to both CPU and GPU. After a kernel writes to a buffer, call
sync() to ensure the GPU has finished, then read the buffer directly:
from metile.runtime.metal_device import MetalDevice
kernel[grid](x, out, N, BLOCK=256)
MetalDevice.get().sync() # wait for GPU to finish
print(out.numpy()) # read results
Inside Kernels¶
Inside @metile.kernel functions, buffer parameters become device pointers. You access memory
through metile.load and metile.store:
# Element-wise access with pointer arithmetic
offs = pid * BLOCK + metile.arange(0, BLOCK)
x = metile.load(X + offs, mask=offs < N)
metile.store(Out + offs, x * 2.0, mask=offs < N)
# 2D tile access for matrix operations
a = metile.tile_load(A, row, col, stride, (ROWS, COLS))
metile.tile_store(C, row, col, stride, result, (ROWS, COLS))
Masking¶
When the data size is not a multiple of the block size, use masks to prevent out-of-bounds memory access:
offs = pid * BLOCK + metile.arange(0, BLOCK)
mask = offs < N # boolean mask: True for valid elements
x = metile.load(X + offs, mask=mask) # masked-off lanes read 0
metile.store(Out + offs, x, mask=mask) # masked-off lanes are skipped
N = 10, BLOCK = 4, pid = 2 (last instance)
offs = [8, 9, 10, 11]
mask = [T, T, F, F] # values 10 and 11 are out of bounds
load: reads x[8], x[9], returns 0 for indices 10, 11
store: writes out[8], out[9], skips indices 10, 11
Masking is essential for correctness. Without it, the last program instance would read/write past the end of the array.