RunMat Accelerate Architecture
This document explains how RunMat routes high-level MATLAB-compatible builtins onto GPU hardware through the RunMat Accelerate subsystem. The focus is on the provider abstraction, auto-offload planner, and how new-style runtime builtins cooperate with the acceleration layer.
1. Architectural Overview
RunMat Accelerate defines a backend-agnostic façade that lives in crates/runmat-accelerate/src/lib.rs. The crate exposes three main responsibilities:
- Select and initialise an acceleration provider (wgpu, in-process fallback, etc.).
- Maintain heuristics that decide whether a builtin should execute on the CPU or on the selected device.
- Coordinate with the fusion planner to compile larger expression graphs into GPU kernels.
At runtime the Accelerator façade (crates/runmat-accelerate/src/lib.rs) sits between builtin dispatchers and the provider interface. It receives RunMat Value arguments, asks the planner where to execute, and either forwards to the legacy CPU implementation (runmat_runtime) or marshals tensors into GPU buffers before calling provider hooks.
2. Provider Abstraction (runmat-accelerate-api)
Acceleration providers implement the AccelProvider trait defined in crates/runmat-accelerate-api/src/lib.rs. The trait is extensive: aside from core memory operations (upload, download, free), it contains optional hooks for elementwise math, reductions, linear algebra, signal processing, random number generation, sorting, set operations, cumulative scans, and fused kernels (fused_elementwise, fused_reduction).
Important pieces:
- Registration –
unsafe fn register_providerinstalls a'staticprovider instance into a global (GLOBAL_PROVIDER). The helperprovider()returns anOption<&'static dyn AccelProvider>for callers. - Structured metadata – providers can override
device_info_struct()to surface vendor / backend / memory information for user-facing builtins such asgpuDevice. - Fused kernel entry points – the fusion engine relies on
fused_elementwiseandfused_reductionto submit generated WGSL code and obtain newGpuTensorHandleresults without falling back to the CPU. - Convenience functions – the API offers helpers such as
try_elem_addand random tensor initialisers that internally check whether a provider is registered.
Because the trait covers a superset of MATLAB semantics, providers can opt-in incrementally. Missing features return Err so Callers can fall back to host code.
3. Provider Lifecycle & Initialisation
AccelerateInitOptions in crates/runmat-accelerate/src/lib.rs packages all runtime configuration (provider preference, power preference, fallback behaviour, auto-offload options). initialize_acceleration_provider_with performs the following steps:
- Applies auto-offload configuration (
configure_auto_offload➜ globalLazy<RwLock<AutoOffloadOptions>>). - Short-circuits if a provider is already registered.
- Registers the preferred backend:
- With the
wgpufeature enabled, it creates aWgpuProviderusingbackend::wgpu::provider::register_wgpu_provider, logging adapter details and triggering a warm-up to amortise shader compilation. - Otherwise it chooses the in-process provider (
simple_provider::register_inprocess_provider) as a compatibility fallback.
- With the
During initialisation the crate also installs residency hooks via runmat_accelerate_api::register_residency_clear, enabling the runtime dispatcher to clear GPU residency metadata when tensors are gathered back to the host (crates/runmat-accelerate/src/fusion_residency.rs).
4. Planner & Execution Path
The Planner (crates/runmat-accelerate/src/lib.rs) encapsulates the decision of whether to offload a builtin. The current implementation is intentionally simple: choose_elem_add checks tensor sizes and shape compatibility before returning ExecutionTarget::Gpu or ExecutionTarget::Cpu. The Accelerator::elementwise_add façade demonstrates the standard flow:
- Gather GPU handles into host tensors if required (e.g., when one operand is still a
GpuTensorHandlebut the provider lacks buffer bookkeeping). - For CPU execution, call back into
runmat_runtime. - For GPU execution, upload host tensors with
provider.upload, invoke the elementwise hook, download the result, and wrap it into aValue::Tensor.
Higher-level builtins follow the same pattern after passing through the new dispatcher. The planner will evolve to consult profiling data produced by NativeAutoOffload.
5. Native Auto-Offload Engine
crates/runmat-accelerate/src/native_auto.rs manages heuristics that automatically promote builtin arguments to GPU residency. Key details:
- Global singleton – the first call invokes
initialize, checks configuration flags (environment variables orAutoOffloadOptions), and captures the active provider reference. - Thresholds –
ThresholdConfigstores minimum element counts / FLOPs before GPU execution becomes worthwhile. Defaults are conservative and can be overridden at runtime. - Calibration – optional
auto_calibratebenchmarks (compare_elemwise,compare_reduction,compare_matmul) run both CPU and provider kernels to refine thresholds. - Dynamic decisions – methods like
promote_binaryandpromote_reductionreuse cached GPU handles when possible, consult the fusion engine (active_fusion) to factor in upcoming element counts, and fall back to gathering values for sink builtins. 'like'handling – reduction helpers honour MATLAB semantics such as'like'or'native'prototypes by re-uploading host results when requested.
These heuristics let front-end code remain MATLAB-friendly while opportunistically keeping tensors resident on the GPU.
6. Runtime Builtins & GPU Specs
New-style builtins inside crates/runmat-runtime/src/builtins record their GPU capabilities using inventory macros (register_builtin_gpu_spec!) and describe fusion templates via BuiltinFusionSpec (crates/runmat-runtime/src/builtins/common/spec.rs). For example:
math/elementwise/exp.rsdeclares an elementwise GPU spec that points to the provider hookunary_expand a fusion template that emits WGSL for single-input exponentials.math/reduction/sum.rsregisters a reduction spec referencingreduce_sum_dim, marks omit-nan requirements, and exposes metadata used by the fusion planner (e.g., preferred workgroup size).
At execution time each builtin:
- Validates / broadcasts arguments using helpers from
builtins/common. - Consults provider hooks through
runmat_accelerate_api::provider(). - Falls back to the CPU implementation when the provider reports
Err. - Delegates residency decisions to
NativeAutoOffloadfor'like'or auto-offload scenarios.
The dispatcher (crates/runmat-runtime/src/dispatcher.rs) automatically gathers GPU tensors when a builtin rejects device residency, ensuring MATLAB compatibility without extra user code.
7. WGPU Provider Implementation
The default GPU backend, WgpuProvider (crates/runmat-accelerate/src/backend/wgpu/provider_impl.rs), owns:
- The
wgpu::Device/Queue. - A handle table (
Mutex<HashMap<u64, BufferEntry>>) used to translateGpuTensorHandle::buffer_idback intowgpu::Bufferobjects. - A
WgpuPipelinesbundle (crates/runmat-accelerate/src/backend/wgpu/pipelines.rs) that lazily creates compute pipelines for elementwise math, reductions, scans, permutations, convolution, random number generation, etc. Pipeline layouts share a consistent binding scheme so generated WGSL from the fusion engine can be compiled and cached (fused_pipeline_cache). - Fused kernel helpers that hash shader sources to reuse compute pipelines across invocations. The warm-up pass executed during provider registration primes caches by dispatching representative kernels and reporting hit/miss counts for observability.
- Metrics (
metrics::WgpuMetrics) that track transfer / dispatch timings. These counters feed the auto-offload cost model via the profiling subsystem.
Because the provider implements most AccelProvider hooks, builtins can offload a broad spectrum of operations while retaining a CPU fallback path.
8. In-Process Provider Fallback
When GPU acceleration is disabled or unavailable, simple_provider::register_inprocess_provider registers InProcessProvider from crates/runmat-accelerate/src/simple_provider.rs. This shim stores tensors in a HashMap keyed by buffer_id and implements hooks by delegating to host algorithms from runmat_runtime. It exists to preserve builtins that expect GPU handles without forcing every configuration to ship with a real GPU backend.
9. Residency Tracking & Gather Semantics
Some GPU tensors stay resident across fused kernels. The fusion subsystem marks handles via fusion_residency::mark and clears them when a gather occurs. The runtime dispatcher (gather_if_needed) invokes the runmat_accelerate_api::clear_residency hook before reconstructing a dense tensor, which lets higher-level logic know the buffer is no longer live on the device.
10. Putting It Together
A typical accelerated builtin call flows as follows:
- User code invokes a builtin that was generated by
runmatfunc(e.g.,sum). - The builtin checks for provider support and invokes
NativeAutoOffloadto promote inputs when profitable. - The fusion planner inspects the surrounding
AccelGraph(if the builtin participates in a fusion opportunity) and, when successful, emits WGSL viaFusionGroupPlan::generate_wgslorgenerate_reduction_wgsl. fusion_execsubmits the kernel to the provider usingfused_elementwise/fused_reduction. Otherwise the builtin calls a direct provider hook (e.g.,reduce_sum_dim).- Results stay on the GPU unless the caller explicitly requests a gather or the provider lacks an implementation, in which case the dispatcher materialises a CPU tensor.
This layered design lets RunMat balance MATLAB compatibility, performance portability, and incremental backend support while keeping the new builtin surface area focused on declarative specs rather than device plumbing.