Fusion Reference
Fusion Guide
RunMat's acceleration layer recognises multiple flavours of fusible graphs and hands them to your GPU provider as single kernels.
RunMat fuses common patterns that show up across linear algebra, signal processing, imaging, and solver workloads into single GPU programs. Keeping them fused prevents redundant memory traffic and lets us re-use provider kernels to ship quickly.
- Elementwise & reductions: Collapse dozens of scalar operations into one dispatch and prevent repeated reads/writes of the same tensor.
- Matmul epilogues: Fusing scale, bias, and activation work avoids launching a second kernel that touches the full matrix again and delivers RunMat's matmul + activation parity goals.
- Covariance / Gram / power-step / explained-variance chains: Iterative factorizations spend most of their time in repeated "multiply, renormalize, measure" loops. Treating each stage as a fusion kind keeps eigensolvers and Krylov methods resident on the GPU.
- Image normalisation: Imaging and sensor pipelines often start with per-frame whitening plus gain/bias adjustments. Folding statistics and affine transforms into one kernel removes several launches per frame.
We prioritised these groups because they appear across domains, keep chatty host/device traffic off PCIe, and benefit greatly from fusion. We'll be adding more fusion groups in the future to cover more workloads.
Have a new fusion flavour in mind? Open an issue or submit a pull request so we can explore it together.
RunMat Currently Fuses the Following Patterns
Elementwise Chains
RunMat collapses arithmetic and transcendental expressions into one shader with full broadcasting.
Read Guide →
Reductions
sum, mean, and similar column/row reductions with omit-NaN handling and scaling rules.
Read Guide →
Matmul Epilogues
Keep matmul outputs on device for scale, bias, clamp, pow, and diagonal extraction epilogues.
Read Guide →
Centered Gram / Covariance
Mean subtraction plus covariance / Gram construction for any tall matrix stays resident.
Read Guide →
Power-Step Normalisation
Fuse matmul plus vector normalisation stages for iterative solvers and eigensolvers.
Read Guide →
Explained Variance
Track diag(Q' * G * Q)-style diagnostics without leaving the GPU.
Read Guide →
Image Normalisation
Batch × H × W whitening, gain, and bias fusion for image-like tensors.
Read Guide →
How to Use These Docs
- Looking for coverage: Start with the link that matches your math. Each page lists the exact instruction patterns the fusion planner looks for and the operations that stay on device.
- Investigating surprises: If a workload isn't fusing, cross-check the prerequisites section (e.g. single-consumer chains for elementwise groups or constant epsilon for power steps).
- Extending RunMat: Combine these docs with
docs/HOW_RUNMAT_FUSION_WORKS.mdto see where to add new detection logic or builtin metadata. - Telemetry correlation: Provider telemetry reports
fusion_kindlabels. Match those labels to the filenames above to understand what the GPU executed.