GLASS is a comprehensive, header-only CUDA C++ __device__ template library for
block-local linear algebra on GPUs — BLAS, LAPACK-style factorizations and triangular
solves, dense linear-system solvers, and related algorithms, all under one single-block
calling convention. It is the foundational linear-algebra layer underneath
GRiD,
MPCGPU,
GATO,
HJCD-IK, and other A2R Lab GPU solvers.
📖 Full documentation: https://a2r-lab.github.io/GLASS/ (source under docs/source/).
GLASS functions are __device__ helpers that operate on data already in shared or device
memory. Every function assumes it runs within one CUDA block — the caller launches one
block per independent data item — which makes GLASS a composable layer for kernels in
model-predictive control, trajectory optimization, and rigid-body dynamics. It is the
linear-algebra layer underneath GRiD.
It began as hand-rolled SIMT subroutines tuned for the very small matrices where vendor
launch/dispatch overhead dominates, and has grown into a unified single-block surface that
also wraps NVIDIA's device-side libraries — CUB (L1), cuBLASDx (L2/L3), cuSOLVERDx (LAPACK) —
under one __device__ calling convention, so one kernel can mix hand-rolled and vendor-backed
primitives without leaving the block.
GLASS exposes three primary interfaces — pick the one that matches how your problem maps onto the GPU. Two are block-scoped (one block per problem), one is warp-scoped (one warp per problem, for packing many tiny problems into a block); all share the same operations:
| Interface | Scope | What it is / when to choose it | Header |
|---|---|---|---|
glass:: (Block) |
block | Hand-rolled SIMT, threadIdx / blockDim — no deps. The default; one moderate-to-large problem per block |
glass.cuh |
glass::warp:: (Warp) |
warp | Single-warp SIMT via __shfl_*_sync (selected L1/L2/L3 ops, no __syncthreads). Pack many small independent problems into one block |
inline in the base headers (via glass.cuh) |
glass::nvidia:: (Nvidia) |
block | CUB + cuBLASDx + cuSOLVERDx, auto-dispatched against SIMT by size (compile-time sizes). When a vendor tensor-core kernel wins at your size | glass-nvidia.cuh |
Note:
glass::cgrps::(headerglass-cgrps.cuh) is a convenience cooperative-groups alias of the Block interface — the same SIMT loop indexed via acooperative_groups::thread_group, numerically identical and not a separately-tuned backend.
Both glass:: and glass::cgrps:: offer runtime (size as arg) and compile-time (size
as template arg) overloads. Reductions additionally offer _lowmem (no scratch)
and _fast (warp-shuffle) suffixed forms (e.g. glass::reduce_lowmem / glass::reduce_fast). The dense surface covers gemm/gemv/ger,
iamax, trsv/trmv, syrk/syr2k, inv/cholDecomp_InPlace (single and K-way fused),
ldlt/ldlt_solve, and posv/potrs; plus contraction-parallel *_reduced, tensor_*, and
congruence_* families. See the namespace & naming guide.
Built on the primitives above (single-block) for the block-tridiagonal SPD systems of trajectory optimization / MPC:
| Function | What | Header |
|---|---|---|
glass::bdmv |
block-tridiagonal matvec ([L|D|R] strips, padded vectors) |
src/base/banded/bdmv.cuh |
glass::pcg |
single-block preconditioned conjugate gradient (S x = b) |
src/base/pcg/solve.cuh |
An internal box-constrained QP solver, glass::internal::box_qp, also lives in the tree but is
not part of the public surface (QP is optimization, not linear algebra).
#include "glass.cuh"
__global__ void my_kernel(float* A, float* B, float* C, int m, int n, int k) {
glass::gemm(m, n, k, 1.f, A, B, 0.f, C); // all block threads cooperate on one problem
}
my_kernel<<<num_items, 256>>>(A, B, C, m, n, k); // one block per data itemRunnable, self-contained programs (one concept each) live in examples/. GEMM
follows the standard BLAS convention — C is M×N, contraction K (A is M×K, B is K×N) —
with TRANSPOSE_A / TRANSPOSE_B operand flags and a single ROW_MAJOR_C output flag; a
row-major operand is just a transpose. See examples/10_gemm_basics.cu.
GLASS is header-only — add the repo root to your include path and #include "glass.cuh".
The pure-SIMT surface needs only nvcc -std=c++17. The glass::nvidia:: paths additionally
need NVIDIA MathDx (cuBLASDx / cuSOLVERDx) and extra flags:
| Surface | Build requirements |
|---|---|
glass.cuh, glass-cgrps.cuh |
C++17 — no extra deps |
glass-nvidia.cuh (L1) |
C++17 + CUB (bundled with CUDA 11+) |
glass-nvidia.cuh (L2/L3 GEMM/GEMV/batched) |
C++17 + --expt-relaxed-constexpr + cuBLASDx |
glass-nvidia.cuh (LAPACK) |
C++17 + --expt-relaxed-constexpr + -rdc=true -dlto -lcusolverdx -lcublas -lcusolver -lcudart + cuSOLVERDx |
The nvidia wrappers auto-detect availability (GLASS_HAVE_CUBLASDX / GLASS_HAVE_CUSOLVERDX).
Full setup, linking, and the MathDx download are in bench/INSTALL.md and
the installation guide.
pip install -r test/requirements.txt
pytest test/ # compiles test/cuda/*.cu once, caches by source hashrm -rf test/build forces a clean rebuild. Details in
docs/source/user_guide/tutorials/running_tests.rst.
The README is a landing page; the deep reference lives in the
hosted docs (sources in docs/source/):
| Topic | Page |
|---|---|
| API reference (L1 / L2 / L3 / nvidia / warp / banded) | api_reference/ |
| Namespaces, naming rules, and the two-axis taxonomy | concepts/namespaces.rst |
| Choosing a backend + tuning for your hardware | concepts/tuning.rst |
glass::nvidia::gemm cuBLASDx-vs-SIMT dispatch |
concepts/backend_dispatch.rst |
TRAILING_SYNC and barrier conventions |
concepts/trailing_sync.rst |
Contraction-parallel (*_reduced) family |
concepts/contraction_parallel.rst |
Block-tridiagonal layout (bdmv / pcg) |
concepts/block_tridiagonal.rst |
| Worked examples + quickstart | tutorials/ · examples/ |
| Benchmarks + measured sweep results | tutorials/benchmarks.rst · tutorials/sweep_results.rst |
- One block per problem. Every function runs inside a single block; launch
<<<num_items, threads>>>. - Column-major by default (Fortran order, matching cuBLAS). GEMM uses
TRANSPOSE_A/TRANSPOSE_B+ROW_MAJOR_C(a row-major operand is just a transpose); GEMV keeps a per-matrixROW_MAJORflag (its transpose changes the math op);glass::nvidia::uses thelayoutenum per matrix (LA/LB/LC). - Reductions are destructive.
dot/nrm2/ reduction variants write the result tox[0]and may consume the input as scratch;nrm2squares elements before reducing. Theglass::warp::forms return the value instead. cholDecomp_InPlacefills only the lower triangle; the upper retains input values.glass::nvidia::*(default form) requires exactlygemm_threads<T,M,N,K>()threads; use theBLOCK_THREADStemplate parameter (withDEFINE_NVIDIA_<NAME>_BLOCKDIM) to launch any count≥ gemm_min_block_threads<T,M,N,K>(). Compile without-DNDEBUGfor a clean assertion instead of a silent deadlock if the launch is too small.glass::nvidia::trsmhas no native non-1.0alpha(cuSOLVERDx limitation); the wrapper pre-scalesBin shared memory beforeexecute.