RunMat Accelerate API
Purpose
runmat-accelerate-api defines the minimal, dependency-light interface between the language runtime and GPU acceleration providers. It exists so that:
- The runtime (
runmat-runtime) can expose user-facing builtins likegpuArrayandgatherwithout taking a hard dependency on any particular backend implementation. - Acceleration crates (e.g.,
runmat-accelerate) can implement providers and register themselves at process startup, enabling device-resident execution when available and falling back to CPU when not.
This crate intentionally avoids depending on other workspace crates to prevent dependency cycles.
What it contains
- GpuTensorHandle: A small, opaque handle describing a device-resident tensor (shape, device id, and a backend-managed buffer id). The actual device memory is fully managed by the backend.
- AccelProvider: The provider trait backends implement to upload/download/free GPU buffers and to report device info.
- Global provider registry:
register_providerandprovider()allow a single global provider to be installed by the host (e.g.,runmat-accelerateor an application) and used by the runtime. - Host tensor types:
HostTensorViewandHostTensorOwnedprovide simple host-side tensor representations (data + shape) without introducing a dependency onrunmat-builtins.
How it fits with the rest of the system
runmat-runtimeimplements the MATLAB-facing builtinsgpuArray(x)andgather(x). These builtins callrunmat-accelerate-api::provider()and, if a provider is registered, forward to it. Otherwise, they return CPU fallbacks (keeping semantics predictable even when no GPU is present).runmat-accelerateimplements one or more concrete providers and callsregister_provider(...)during initialization (e.g., from the REPL/CLI startup or host app), so the runtime automatically benefits from GPU acceleration where appropriate.
Autograd & optimization (planned)
This API is intentionally minimal; reverse-mode autograd and higher-level optimizations live above this layer in runmat-accelerate and the runtime/JIT. The expected integration is:
- The runtime will provide a tape/graph for Tensor/Matrix ops (on CPU or device). Gradients are computed by composing known primitive gradients; providers can remain unaware of autograd.
- When a provider offers fused kernels, the planner/JIT can map differentiated op chains to fewer device launches (elementwise fusion, simple BLAS fusions), reducing temporaries and transfers by default.
- No changes to this API are required for basic autograd; future extensions (optional) may add hooks for fused gradient kernels and streams/queues.
Safety and lifetime
register_provider stores a 'static reference. The caller must ensure the provider outlives the program (common in singletons created at startup). The handle (GpuTensorHandle) contains only POD metadata (no GC pointers) and is safe to move/copy.
API reference (concise)
pub struct GpuTensorHandle { pub shape: Vec<usize>, pub device_id: u32, pub buffer_id: u64 }pub trait AccelProvider { upload(&HostTensorView) -> Result<GpuTensorHandle>; download(&GpuTensorHandle) -> Result<HostTensorOwned>; free(&GpuTensorHandle) -> Result<()>; device_info() -> String }pub struct HostTensorView<'a> { pub data: &'a [f64], pub shape: &'a [usize] }pub struct HostTensorOwned { pub data: Vec<f64>, pub shape: Vec<usize> }pub unsafe fn register_provider(&'static dyn AccelProvider)pub fn provider() -> Option<&'static dyn AccelProvider>
Example (provider skeleton)
use runmat_accelerate_api::{AccelProvider, GpuTensorHandle, HostTensorOwned, HostTensorView, register_provider};
struct MyProvider;
impl AccelProvider for MyProvider {
fn upload(&self, host: &HostTensorView) -> anyhow::Result<GpuTensorHandle> {
// allocate device buffer, copy host.data, return handle
Ok(GpuTensorHandle { shape: host.shape.to_vec(), device_id: 0, buffer_id: 1 })
}
fn download(&self, h: &GpuTensorHandle) -> anyhow::Result<HostTensorOwned> {
// copy device -> host and return
Ok(HostTensorOwned { data: vec![0.0; h.shape.iter().product()], shape: h.shape.clone() })
}
fn free(&self, _h: &GpuTensorHandle) -> anyhow::Result<()> { Ok(()) }
fn device_info(&self) -> String { "MyDevice 0".to_string() }
}
// at startup
unsafe { register_provider(&MyProvider); }
Current state
- Stable trait and handle definitions.
- Used by
runmat-runtimebuiltins (gpuArray,gather). - Ready for concrete provider implementations in
runmat-accelerate.
Roadmap
- Add optional fields (dtype, strides) without breaking existing providers (via feature flags or additive fields).
- Extend provider API for streams/queues, memory pools, unified/pinned memory, and device events.
- Multi-device provider model and selection heuristics.
- Document autograd integration points and optional provider hooks for fused gradient kernels.