Fusion Reference

Fusion Guide

RunMat's acceleration layer recognises multiple flavours of fusible graphs and hands them to your GPU provider as single kernels.

RunMat fuses common patterns that show up across linear algebra, signal processing, imaging, and solver workloads into single GPU programs. Keeping them fused prevents redundant memory traffic and lets us re-use provider kernels to ship quickly.

  • Elementwise & reductions: Collapse dozens of scalar operations into one dispatch and prevent repeated reads/writes of the same tensor.
  • Matmul epilogues: Fusing scale, bias, and activation work avoids launching a second kernel that touches the full matrix again and delivers RunMat's matmul + activation parity goals.
  • Covariance / Gram / power-step / explained-variance chains: Iterative factorizations spend most of their time in repeated "multiply, renormalize, measure" loops. Treating each stage as a fusion kind keeps eigensolvers and Krylov methods resident on the GPU.
  • Image normalisation: Imaging and sensor pipelines often start with per-frame whitening plus gain/bias adjustments. Folding statistics and affine transforms into one kernel removes several launches per frame.

We prioritised these groups because they appear across domains, keep chatty host/device traffic off PCIe, and benefit greatly from fusion. We'll be adding more fusion groups in the future to cover more workloads.

Have a new fusion flavour in mind? Open an issue or submit a pull request so we can explore it together.

RunMat Currently Fuses the Following Patterns

How to Use These Docs

  1. Looking for coverage: Start with the link that matches your math. Each page lists the exact instruction patterns the fusion planner looks for and the operations that stay on device.
  2. Investigating surprises: If a workload isn't fusing, cross-check the prerequisites section (e.g. single-consumer chains for elementwise groups or constant epsilon for power steps).
  3. Extending RunMat: Combine these docs with docs/HOW_RUNMAT_FUSION_WORKS.md to see where to add new detection logic or builtin metadata.
  4. Telemetry correlation: Provider telemetry reports fusion_kind labels. Match those labels to the filenames above to understand what the GPU executed.