Matmul Epilogue Fusion
Matmul epilogue fusion keeps the product of mtimes(A, B) on the GPU while folding scalar/vector epilogues into the same provider call.
What Qualifies
- One matmul root. The planner looks for a
mtimesnode whose output feeds a single-consumer elementwise chain. - Allowed epilogue ops. Addition, subtraction, multiplication, division (elementwise or scalar), broadcast multiply/divide by row/column vectors, clamp via
min/max, power (.^orpow), and optional finaldiag. - Single consumer requirement. Branched matmul outputs break fusion so each consumer can run independently.
- Tensor residency. Both inputs must be GPU-friendly (the executor uploads as needed) and match provider precision.
Why Matmul Epilogue?
Matmul epilogue is a common pattern in linear algebra and statistics. It shows up in affine transforms, normalisation, gating, and other linear algebra operations. By fusing this pattern, we can keep the epilogue operations resident on the GPU, avoiding the overhead of uploading and downloading it to the CPU, and allows us to compute it without launching a second kernel.
Scenarios that Fuse
- Affine transforms:
C = A*B .* scale + bias, wherescale/biascan be scalars or row/column vectors. - Normalisation / gating:
C = clamp((A*B - mean) ./ sigma, lo, hi). - Power heads:
C = (A*B).^gammafor scalargamma. - Gram-diagonal extraction:
diag(A*B)(after optional scaling) for explained-variance style metrics.
Not Supported
- Non-elementwise consumers such as reshapes, reductions, or additional matmuls.
- Multiple chained matmuls (e.g.
(A*B)*C)—only the firstmtimesand its immediate elementwise epilogue fuse today. - Data-dependent epilogues that require branching.
Troubleshooting
- Missing fusion: Ensure the epilogue operations appear consecutively in the instruction stream with no intervening consumers. Broadcasting mismatches will also break detection.
- Row/column scale inference: The planner recognises column vectors (shape
[1, N]) as column scales and row vectors ([M, 1]) as row scales. Higher-rank tensors will not be treated as scale factors. - Precision issues: GPU matmul epilogues follow the provider’s precision. If you request
doubleon a device without FP64, RunMat will fall back to the CPU path.
If a workload should fuse but does not, enable RUNMAT_DEBUG_FUSION=1 to have the planner print why a node was rejected, then compare against the criteria above.