Tapp Standard Enables Performance Portability for Tensor Operations with C-Based Interface

Authors: Jan Brandejs (Université de Toulouse), Niklas Hörnblad (Umeå University), Edward F. Valeev (Virginia Tech) and colleagues

The lack of a unified standard for tensor operations presents a significant challenge to efficient and portable scientific computing. Jan Brandejs, Niklas Hörnblad, and Edward F. Valeev, together with collaborators, address this issue with the introduction of Tensor Algebra Processing Primitives (TAPP). This C‑based interface decouples application code from underlying hardware, promising improved performance portability and reduced dependency conflicts. The researchers provide a rigorous mathematical formulation of tensor contractions alongside a reference implementation, ensuring accuracy and enabling validation of optimised kernels. Demonstrations integrating TAPP with libraries such as TBLIS, cuTENSOR, and DIRAC highlight the potential of this community‑driven standard to streamline tensor‑based computations.

Algebra Background

The increasing prevalence of tensor algebra in diverse scientific domains necessitates a standardised approach to tensor operations. The central objective is to define a minimal set of primitive operations that can be combined to express a wide range of tensor‑algebra routines, promoting code reuse and optimisation. The research adopts a bottom‑up approach, beginning with an analysis of common tensor operations found in applications such as quantum chemistry, machine learning, and high‑performance computing.

This analysis identified a core set of 18 primitives, encompassing operations like tensor contraction, element‑wise addition, and reshaping. These primitives were then formalised with precise semantics and interfaces, ensuring unambiguous implementation and interoperability. Performance evaluations were conducted on representative hardware, including CPUs and GPUs, to demonstrate the efficiency of TAPP‑based implementations. Specific contributions include:

A formal specification of the 18 TAPP primitives, with detailed descriptions of inputs, outputs, and computational behaviour.
A reference implementation utilising both CPU and GPU back‑ends.

Tensor Contraction Efficiency for Scientific Modelling

Tensor operations are fundamental to rapidly developing fields such as artificial intelligence and quantum‑science modelling, with tensor contraction being the most computationally intensive operation. The efficiency of tensor contraction directly impacts progress in material science, quantum chemistry, drug discovery, and life sciences, as it dictates the feasible size of models used in these disciplines. Improvements in contraction performance have been crucial for advancements in deep learning, quantum‑computing simulations, tensor‑network methods, and various quantum‑chemistry techniques.

Despite its importance, tensor‑contraction software lacks the maturity and organisation seen in matrix‑operation software, exhibiting fragmentation due to a diverse and growing developer base. A recent survey revealed a proliferation of scattered libraries and significant code duplication, largely attributable to the absence of standardised tensor‑operation primitives. Standardisation would enable modular code development and library reuse, mitigating the challenges of hard‑to‑replace dependencies.

This work represents an initial step toward establishing a standard for tensor contraction, defining the problem precisely and proposing a standard interface alongside a reference implementation. The approach draws inspiration from the Basic Linear Algebra Subroutines (BLAS), which have served as the de‑facto standard for linear‑algebra operations for over four decades. The evolution of BLAS—from Level 1 (scalar and vector operations) to Level 2 (addressing memory‑hierarchy bottlenecks)—highlights the importance of adapting standards to leverage hardware advances and optimise computational efficiency.

Key aspects of TAPP:

Type‑agnostic execution API – flexible in data types; mixed‑type support depends on the underlying back‑end.
Reusable operation descriptions – the same tensor‑operation description can be reused for subsequent executions, encouraging efficient caching.
Virtual key‑value stores (VKVs) – a free‑form mechanism for providing information to the implementation (e.g., initialization, data locality).
Robust error handling – integral error codes with plain‑text descriptions analogous to POSIX strerror.

The reference implementation prioritises correctness and simplicity over raw performance, serving as a model for developers supporting TAPP. It supports arbitrary mixing of real and complex floating‑point numbers with 64‑, 32‑, and 16‑bit widths (other implementations may be more restrictive). Successful integration with existing libraries—TBLIS, cuTENSOR, and DIRAC—demonstrates the feasibility and potential of TAPP as a unifying layer for tensor algebra. By providing a common interface, TAPP enables applications to leverage the performance benefits of different tensor libraries without extensive code modifications.

The authors acknowledge that TAPP intentionally focuses on performance‑critical operations, excluding convenience functionalities like in‑memory array reshaping to maintain efficiency. Future development plans include:

A comprehensive, randomised benchmark suite.
“Multi‑TAPP”, a feature enabling dynamic selection of tensor libraries at runtime, allowing developers to benchmark and choose the optimal library for specific tasks.

The current work represents a draft standard, maintained by a dedicated decision‑making committee to ensure ongoing relevance and evolution.

More information

ArXiv pre‑print: https://arxiv.org/abs/2601.07827

Quantum Blogs and Articles

Why It Matters