RunMat Accelerate
Purpose
runmat-accelerate
provides the high-level acceleration layer that integrates GPU backends with the language runtime. It implements provider(s) for runmat-accelerate-api
so that gpuArray
, gather
, and (later) accelerated math and linear algebra can execute on devices transparently where appropriate.
Architecture
- Depends on
runmat-accelerate-api
to register anAccelProvider
implementation at startup. - Backends (e.g.,
wgpu
,cuda
,rocm
,metal
,vulkan
,opencl
) are feature-gated. Only one provider is registered globally, but a future multi-device planner can fan out. Planner
decides when to run ops on CPU vs GPU (size thresholds, op types, fusion opportunities).Accelerator
exposes ergonomic entry points used by the runtime or higher layers.
Autograd and default optimization (planned)
- Tensor/Matrix operations will participate in reverse-mode autograd by default. The runtime records a compact tape of primitive ops; gradients are computed by chaining primitive derivatives (no provider changes required).
- The planner and JIT will fuse common elementwise chains and simple BLAS sequences to reduce temporaries and host↔device transfers automatically.
- For providers that expose fused kernels, the planner can route differentiated graphs to those paths, improving both forward and backward performance.
What it provides today
- A scaffolding
Accelerator
with elementwise add routing: choose CPU path (delegating torunmat-runtime
) or GPU path (via provider methods). The GPU path currently uses upload/compute/download placeholders and is ready to be backed by a real backend. - Integration points for
gpuArray
/gather
: when a provider is registered, runtime builtins route through the provider API defined inrunmat-accelerate-api
.
How it fits with the runtime
- The MATLAB-facing builtins (
gpuArray
,gather
) live inrunmat-runtime
for consistency with all other builtins. They call intorunmat-accelerate-api::provider()
, which is implemented and registered by this crate. - This separation avoids dependency cycles and keeps the language surface centralized while enabling pluggable backends.
Backends
wgpu
(feature:wgpu
) is the first cross-vendor target. CUDA/ROCm/Metal/Vulkan/OpenCL are planned (features already stubbed).- Backend responsibilities:
- Allocate/free buffers, handle host↔device transfers
- Provide kernels for core ops (elementwise, transpose, matmul/GEMM)
- Report device information (for planner decisions)
Current state
- Compiles and wires through to the runtime via the API layer.
- CPU fallback path fully functional; GPU path ready for provider implementation.
Roadmap
- Implement an in-process provider with a buffer registry (proof-of-concept) to make
gpuArray
/gather
round-trip actual data without copying through a real device yet. - Implement first real backend (likely
wgpu
): upload/download, elementwise add/mul/div/pow, transpose, matmul, with simple planner thresholds. - Add streams/queues, memory pools, pinned/unified buffers, and multi-device support.
- Planner cost model and operator fusion (elementwise chains and simple BLAS fusions).
Example usage
The provider is registered at process startup (REPL/CLI/app). Once registered, MATLAB-like code can use:
G = gpuArray(A); % move tensor to device
H = G + 2; % elementwise add (planner may choose GPU path)
R = gather(H); % bring results back to host